🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Divide and conquer broadphase

Started by
5 comments, last by Ivan Terziev 6 years, 4 months ago

Hello everyone,

I just wanted to share a, what I believe to be novel, broadphase collision detection technique that I stumbled over while working on my physics engine. If anyone knows of related work, please refer me!

The code will be at the end of this post, but first some discussion.

Briefly, the goal of a broadphase is to determine potentially colliding pairs of objects. There exist many well known techniques such as: uniform / hierarchical grids, bounding volume hierarchies, sweep and prune, or quadtrees. They all have their respective pros and cons, which I won't go into here. One thing they all have in common however, is that they require maintaining and building a data structure which facilitates the process of finding pairs.

Divide and conquer broadphase is based on the relatively new "Divide and conquer ray-tracing" technique. The idea is to recursively subdivide space, partitioning objects into these spaces, until there are sufficiently few objects in a space. When there are few enough objects, use a brute force method to find the potential pairs. 

The great thing about this technique is, it requires no data structures to be built and it uses a fixed amount of additional memory. In addition, the code is surprisingly simple.

The code shown works in 2D. It is trivially extended to 3D or beyond by changing only the partitioning method to consider the additional dimensions.

 

Performance analysis:

Subjectively speaking, I have tested this method against many others in my own physics engine on a variety of scenes. It performs very nicely, surpassing the dBVT in many cases (especially when many objects are in motion). It rivals the speed of my current best, the uniform grid.

From an analytical point of view, the best case is O(n log n) when the objects are not overlapping, and the worst case O (n^2) when all objects occupy the same space. However, in typical physics simulation scenes, objects are nicely distributed and do not overlap much (assuming the constraint solver is doing its job). O(n log n) doesn't sound that great considering BVHs achieve the same on paper. But unlike BVHs, no structure needs to be maintained or updated, which causes it to perform better in practice.

Memory usage, as you will see, is minimal. The algorithm is run from scratch at the beginning of a time step.

 

Potential optimizations:

There are a few avenues for optimization. The first and most obvious being, unroll the recursion into a loop with an explicit stack in order to reduce the overhead of the recursive calls.

Another is to use a more intricate partitioning method, rather than the simple one shown here.

Note that I use a set to track pairs, since this method can result in duplicate pairs. The set is actually rather slow. You can easily swap this for a vector, then sort and remove duplicates at the end for a nice performance boost.

There exists a parallel version of "Divide and conquer ray-tracing". I'm sure the ideas presented there could be used to parallelize this technique as well, though I have not looked into it,

Lastly, one could potentially cache the partition locations, and use those as a starting point for computing the partition locations in the next frame. This exploits the temporal coherence of physics simulations. This could theoretically divide the cost of the whole process by 2, since computing the partition location is half of the work in each call.

 

Here is my implementation (feel free to use this however you wish):


#include <vector>
#include <set>
#include <algorithm>

// Divide and conquer broadphase
void dac_bp(const vector<aabb>& bounds, const vector<int>::iterator start, const vector<int>::iterator end, set<pair>& pairs)
{
	const int BRUTE_FORCE_THRESH = 32;

	if ((end - start) < BRUTE_FORCE_THRESH)
	{
		// Brute force check the remaining pairs in this interval
		for (auto i = start; i != end; i++)
		{
			for (auto j = i + 1; j != end; j++)
			{
				if (bounds[*i].intersects(bounds[*j]))
				{
					pairs.insert(pair(*i, *j));
				}
			}
		}
	}
	else
	{
		// Compute bounds of all boxes in this interval
		float2 pmin(FLT_MAX, FLT_MAX), pmax(-FLT_MAX, -FLT_MAX);
		for (auto i = start; i != end; i++)
		{
			pmin = min(pmin, bounds[*i].min);
			pmax = max(pmax, bounds[*i].max);
		}

		// Choose the partition axis and partition location
		float2 size = pmax - pmin;
		int axis = (size[1] > size[0]);
		float split = (pmin[axis] + pmax[axis]) * 0.5f;

		// Partition boxes left and recurse
		auto mid_left = partition(start, end, [split, axis, bounds](int i) { return bounds[i].min[axis] <= split; });

		dac_bp(bounds, start, mid_left, pairs);

		// Partition boxes right and recurse
		auto mid_right = partition(start, end, [split, axis, bounds](int i) { return bounds[i].max[axis] >= split; });

		dac_bp(bounds, start, mid_right, pairs);
	}
}


// USAGE
// Build a vector containing bounds of all objects
vector<aabb> bounds = { /* fill with bounds of objects */ };

// Initialize indices
vector<int> indices;
for (size_t i = 0; i < bounds.size(); i++)
	indices.push_back(i);

// Get all colliding pairs!
set<pair> pairs;
dac_bp(bounds, indices.start(), indices.end(), pairs);

 

Thanks for reading, and I hope someone finds this of use! If you have any further ideas for extending this, please share!

Advertisement

In your code you have a variable named partition and also a function - that won't compile so I assume the function is std::partition? 

Very neat and simple! So it's basically just recursively sorting the index array into something similar to a BVH... 

To make it parallel, you'd 'just' have to make a copy of the index array when partitioning instead of modifying the indices in place. Then you could run the left and right tasks in parallel. You'd also need one output buffer per thread and sort/merge at the end. 

Ah yes, that should be std::partition ("cleaned" up my implementation for the post but missed that :P)

Yeah it's sort of like a a top down BVH build with pair checking at the same time. Main difference from BVH is that an object can potentially appear in both subtrees, hence the need to partition twice.

And yeah makes sense, you could copy the index array then run left and right in parallel. Wouldn't be fixed memory usage anymore, but we've got plenty of memory these days so no biggie :)

I'm unsure how this would be faster than a well-implemented BVH, since it is essentially a BVH construction algorithm. Do you have a BVH implementation to compare against and relative performance numbers on a variety of benchmarks? Those would be necessary to draw any real conclusions. I feel like there could only be a performance win here if the scene is highly dynamic and would require the BVH to be rebuilt each frame. In most cases, the BVH could be built every 10 frames, then only refitted on the other frames for much less cost than rebuilding every frame. BVH also allows better SIMD utilization using 4-ary or 8-ary trees (intersect single ray or AABB against 4 or 8 nodes at once), because the data can be pre-packed into a form that is fastest for the intersection query. BVH can also apply heuristics (e.g. SAH) during construction to greatly improve performance by choosing a better splitting plane. In theory, you could use the same heuristics in your algorithm, but would have to apply them every frame, rather than every 10+ frames, which would slow things down.

You should try an iterative implementation (with a manually-implemented stack), it would probably be faster (all fast ray tracing algorithms use a loop+stack instead of recursive function calls).

The fastest broadphase I have personally implemented used a dynamic octree. Each object stored a pointer to the node it was in, and nodes had pointers to their parents/children. This allowed the tree to be updated really fast when no objects are moving (just check to see if the object is entirely contained in current node AABB). If not contained, then walk up the hierarchy toward root. If the object is completely contained in any children, push the object down into the subtree. This was old code and didn't use any SIMD, so I don't know how it would fare against a proper BVH implementation, but it was still significantly faster than a spatial hash.

Consider using a hash set for the output pairs, but only after the narrowphase step. Use a pre-allocated vector until then. (a few duplicate collision checks is probably less expensive than building a set of broadphase pairs each frame, since there will be many more broadphase pairs than narrowphase colliding pairs). The hash key is an order-invariant hash (e.g. xor) of the objects that are potentially colliding.

Thanks for the feedback Aressera.

 

5 hours ago, Aressera said:

I'm unsure how this would be faster than a well-implemented BVH, since it is essentially a BVH construction algorithm. Do you have a BVH implementation to compare against and relative performance numbers on a variety of benchmarks? Those would be necessary to draw any real conclusions. I feel like there could only be a performance win here if the scene is highly dynamic and would require the BVH to be rebuilt each frame. In most cases, the BVH could be built every 10 frames, then only refitted on the other frames for much less cost than rebuilding every frame. BVH also allows better SIMD utilization using 4-ary or 8-ary trees (intersect single ray or AABB against 4 or 8 nodes at once), because the data can be pre-packed into a form that is fastest for the intersection query. BVH can also apply heuristics (e.g. SAH) during construction to greatly improve performance by choosing a better splitting plane. In theory, you could use the same heuristics in your algorithm, but would have to apply them every frame, rather than every 10+ frames, which would slow things down.

Yeah its a tricky comparison. I guess it comes down to whether or not you are tracking pairs between frames. I have a benchmark setup in which about half my scenes have everything moving. When I am comparing methods, I use the total time required to simulate all scenes for some number of steps. So, it's not a very scientific comparison to be sure. For my BVH, I refit each frame and apply internal tree rotations based on the SAH cost (used the method presented here: https://www.cs.utah.edu/~aek/research/tree.pdf). It's very likely that I have a sub-optimal BVH implementation though (no SIMD, poor cache locality).

Indeed, if I track pairs between frames and do not query objects which haven't left their bounds, it costs very little. That requires me to use a fattened AABB so that I don't re-test an object unless it has moved out of its bounds. However the fattened AABB generates a bunch of extra pairs that my narrowphase has to check, which is particularly expensive in my case. I think that's why I'm seeing lower than normal overall performance. One day I will do a more scientific comparison with some existing BVH implementations. Overall the divide and conquer method is just a novelty due to the memory savings and simplicity - I don't foresee using it in my own engine because I do want to optimize for the case where nothing is moving.

5 hours ago, Aressera said:

The fastest broadphase I have personally implemented used a dynamic octree. Each object stored a pointer to the node it was in, and nodes had pointers to their parents/children. This allowed the tree to be updated really fast when no objects are moving (just check to see if the object is entirely contained in current node AABB). If not contained, then walk up the hierarchy toward root. If the object is completely contained in any children, push the object down into the subtree. This was old code and didn't use any SIMD, so I don't know how it would fare against a proper BVH implementation, but it was still significantly faster than a spatial hash.

That's very interesting. Octree is one which I haven't tried yet. I will have to give it a go! I'm still trying to beat my uniform hash grid. 

For the octree, do you only insert an object into the node in which it entirely fits, or all nodes which it overlaps? I'm guessing just the one it fits in since the object has a pointer to the node its in. What metric did you use for determining when to subdivide a node? I suppose when testing for collisions, you need to check the node an object as in, as well as traverse down any intersecting children right?

Very good discussion and code. All broadphase algorithms are either O(n) if hash based or O(n log n) if sorting SAP or hierarchy based. Based on the use case some do better. If all objects have similar size bounds then hash based are a better choice. The dac_bp is very similar to building an AABB tree from polygon soup for static collision.  If all objects move and have different sizes then this is a great choice because it doesn't maintain any data structures and cuts to the meat and its T (N) is likely to be faster than other O(n log n) algorithms.

However if game logic requires you to run spatial queries then the lack of fast look up structure is a minus. Also for many use cases few objects move and those are bounded with inflated boxes so you don't have to update the structure every sim step. In practice the dynamic hierarchy is never rebuilt each frame. Imagine a world with 100,000 barrels and the player can interact only with the nearby objects. It is not very optimal do run dac_bp on all objects.

For references Box2d and Bullet have different DBVHs but have a good analysis of their choices Box2d uses an AVL tree which in theory is faster for reads and slower for updates while on the other hand Bullet has a runtime incremental balancer for the tree. PhysX has a fast implementation of SAP but then it has to rebuild the tree from sorted objects for scene queries.

For keeping unique collision pairs sorting and removing duplicates is good enough (better than inserting into a set) but I find open hashing implementation slightly faster (single table no memory fragmentation).

Inflated aabbs do cause more narrowphase checks and ideally the narrowphase should have a fast path rejection.

This topic is closed to new replies.

Advertisement