Introduction

It has been quite a while that I wanted to make a post on the topic of BVHs. I have written many implementations of BVH construction and traversal for CPUs and GPUs, and I thought most of the material online was either too basic, too detailed, incomplete, or just wrong. So there we are: I am going to give an introduction on how to write a fast and correct BVH implementation. Doing so is surprisingly difficult, even if you write non-vectorized scalar code for a CPU.

The end goal of this series of posts is that you, the reader, can feel confident writing your own BVH implementation. Ultimately, I cannot cover absolutely everything on the topic, but I will try to give good references for the topics that I do not present here. For a complete C++ implementation of BVH construction and traversal for CPUs, with multiple algorithms for BVH construction, optimization, and traversal, you can refer to my BVH library on GitHub.

I will not give an overview on SIMD or GPU implementations of BVH traversal and construction here. There are many blog posts and good references online for that, but more importantly, I believe that SIMD implementations can benefit from many of the techniques that I describe here. Moreover, a good scalar single-ray implementation of BVH traversal can easily come close to Embree’s vectorized single-ray implementation (see the benchmarks of my BVH library). An explanation for this is that vectorized single-ray implementations perform in fact more work than scalar versions (in particular, more ray-box tests). This means that, even though vector instructions theoretically could give almost an 8x speedup against scalar code, a good vectorized implementation of BVH traversal will only be around 30% faster than a good scalar one.

Now that the goals are laid out, let us start with the basics: a memory layout for BVH nodes.

BVH layout

There are many possibilities out there, but I am only going to present my personal favorite layout here. The first technique is to drop the idea of using pointers: That will help for serialization to disk, but more importantly, it allows to save space by using 32-bit indices instead of 64-bit pointers. Since we deal with binary BVHs here, another optimization is to store a node’s children next to each other in memory, which means that a node only needs one index to the first child. This gives the final layout:

struct node {
    float bounds[6];
    uint32_t first_child_or_primitive;
    uint32_t primitive_count;
};

With this structure, a leaf is encoded with primitive_count != 0, and in this case first_child_or_primitive is an index into the array of primitives: This gives a range of primitives covered by the leaf. Internal nodes, on the other hand, are encoded with primitive_count == 0, and first_child_or_primitive is then the index of the first child into the array of nodes (the second child is just first_child_or_primitive + 1). A nice property of this layout is that a Node is exactly 32 bytes, which is half the size of a cache line on most platforms. Since this layout also enforces that the two children of the node are next to each other, this makes sure they both exactly fit in one cache line, a property that is very useful during traversal.

If you are new to BVHs and are wondering why we do not use pointers here, please note that a property of BVHs is that they do not have empty nodes. Thus, either a BVH node is a leaf with no children, or it is an internal node with exactly two children. It is impossible for a BVH node to only have one child. If your BVH implementation uses pointers for that reason, you are doing something wrong.

Another important detail is how primitives should be represented. Usually, the application already has primitives stored in an array, in the order in which they come from the scene file or 3D model. Since BVHs are by definition object partitioning schemes, which means that they partition the set of objects (as opposed to space partitioning schemes, which partition space), they are only representing a recursive partition of the original data set. Thus, there are two way to efficiently represent that information:

  1. Use indices: first_child_or_primitive is then an index into the array of indices, which then indices the final array of primitives. This introduces an extra indirection, but allows to keep the order of the original primitives unchanged.
  2. Permute the original primitive array according to the order defined by the BVH. This is by far the most efficient way to proceed, but comes with the price of having to change the order in which the original primitives are laid out. An alternative is of course to have a copy of the primitive data in the order required by the BVH: This is often needed anyway, since primitive intersection routines may require different data to work optimally (e.g. Embree and my BVH library both use a vertex-edge-edge-normal representation instead of a three vertex triangle representation, in order to speed up the Möller-Trumbore test).

In any case, construction algorithms often work with primitive indices initially, and then perform the permutation at the last stage of the construction, if required. This is mainly to minimize data movement and simplify the algorithm, since indices take less space than the primitive data.

Finally, note that the node bounds are stored as one array that interleaves the minimum and maximum coordinates of the bounding box in the following way: min_x, max_x, min_y, max_y, min_z, max_z. This helps during single-ray traversal, since now the ray octant can be precomputed and used to speed up the ray-box test. To help accessing the correct bounds, it might be useful to introduce accessors like this:

struct bbox {
    float min[3];
    float max[3];
};
struct bbox get_node_bbox(const struct node* node) {
    return (struct bbox) {
        .min = { node->bounds[0], node->bounds[2], node->bounds[4] },
        .max = { node->bounds[1], node->bounds[3], node->bounds[5] }
    };
}

Building a BVH

I was initially a fan of Ingo Wald’s binned BVH construction algorithm, since it produces high-quality BVHs relatively fast, and is rather simple to implement. However, I no longer recommend that approach: It is difficult to parallelize, and it still requires a fallback strategy when binning cannot find a good split. A much more modern technique is described in Parallel Locally-Ordered Clustering for Bounding Volume Hierarchy Construction by D. Meister and J. Bittner. Their algorithm (which we will refer to as PLOC in the remainder of this text) is trivially parallelizable, produces high-quality BVHs, and is also relatively simple to implement, albeit a little harder than binned construction. The core idea of the paper is as follows:

  1. Sort primitives according to a space filling curve (e.g. Morton encoding).
  2. Create the initial set of nodes (leaves) by assigning each primitive to its own leaf.
  3. For each node A, find the nearby node B that minimizes the surface area of A U B within the search radius R (parameter of the algorithm). In terms of C code, this search procedure will look like this:
    struct bbox bbox_union(struct bbox bbox1, struct bbox bbox2) {
        return (struct bbox) {
            .min = {
                bbox1.min[0] < bbox2.min[0] ? bbox1.min[0] : bbox2.min[0],
                bbox1.min[1] < bbox2.min[1] ? bbox1.min[1] : bbox2.min[1],
                bbox1.min[2] < bbox2.min[2] ? bbox1.min[2] : bbox2.min[2],
            },
            .max = {
                bbox1.max[0] < bbox2.max[0] ? bbox1.max[0] : bbox2.max[0],
                bbox1.max[1] < bbox2.max[1] ? bbox1.max[1] : bbox2.max[1],
                bbox1.max[2] < bbox2.max[2] ? bbox1.max[2] : bbox2.max[2],
            }
        };
    }
    float bbox_half_area(struct bbox bbox) {
        float extents[] = {
            bbox.max[0] - bbox.min[0],
            bbox.max[1] - bbox.min[1],
            bbox.max[2] - bbox.min[2]
        };
        return extents[0] * (extents[1] + extents[2]) + extents[1] * extents[2];
    }
    size_t find_best_node(size_t node_id, const struct node* nodes, size_t node_count, size_t search_radius) {
        size_t best_id = node_id;
        float best_distance = FLT_MAX;
    
        size_t begin = node_id > search_radius ? node_id - search_radius : 0;
        size_t end   = node_id + search_radius < node_count ? node_id + search_radius : node_count - 1;
        for (size_t other_id = begin; other_id < end; ++other_id) {
            float distance = bbox_half_area(bbox_union(get_node_bbox(nodes[node_id]), get_node_bbox(other_id)));
            if (distance < best_distance) {
                best_distance = distance;
                best_id = other_id;
            }
        }
        return other_id;
    }
    

    It is possible to optimize that further by decomposing the search in two parts: the backward part, and the forward one. The backward search iterates through nodes that are before the current one, and it relies on the distances computed by the previous iterations stored in the distance matrix. The forward part iterates through nodes that are after the current one, and thus cannot use cached distances: It operates like the code given above, except that the computed distances are cached for the following iterations.

  4. For each node, determine if it should be merged with its minimum node (if B is the minimum node for A and vice-versa).
  5. Merge nodes that should be merged and loop back to step 3 if more than one node remains after merging.

The search radius R can be tuned to control the performance/quality ratio of the algorithm, and in my experience a value of 14 is nearly optimal in terms of the quality of the generated BVH. In fact, increasing it further is not guaranteed to produce better trees, a phenomenon that the authors of the paper also report in their results.

Computing Morton Codes

You will find many pieces of code on the internet that generate morton codes, but most of them are only for a fixed number of bits. Some also depend on PDEP (from the BMI extension of modern Intel/AMD CPUs), and I strongly recommend against them: Not only are those instructions slow on some Ryzen processors, but they also do not have vector equivalents, which means that your compiler’s auto-vectorization will produce faster code if you use regular bit manipulation instructions. So what should you use instead? The following code:

uint64_t split(uint64_t x, int log_bits) {
    uint64_t mask = (UINT64_C(1) << (1 << log_bits)) - 1;
    for (int i = log_bits, n = 1 << log_bits; i > 0; --i, n >>= 1) {
        mask = (mask | (mask << n)) & ~(mask << (n / 2));
        x = (x | (x << n)) & mask;
    }
    return x;
}

uint64_t encode(uint64_t x, uint64_t y, uint64_t z, int log_bits) {
    return split(x, log_bits) | (split(y, log_bits) << 1) | (split(z, log_bits) << 2);
}

How does this work? Well there is this split() function that splits the bits of a value apart by interspersing 2 zeros after each bit (for instance, split(1101) = 001001000001). It’s easy to see how with that the final, Morton-encoded value is obtained: We just split the bits of the x, y and z coordinates, and shift the y and z by 1 and 2 bits on the left, respectively, and finally OR the three results. Here is an example:

x = 1101, y = 0110, z = 1011
 split(x)                                    = 001001000001
(split(y) << 1)                              = 000010010000
(split(z) << 2)                              = 100000100100
split(x) | (split(y) << 1) | (split(z) << 2) = 101011110101

But how does this split() function work exactly? It is a divide and conquer algorithm: Initially, we have 1 << log_bits to split. Let’s say we have log_bits == 2, for instance. This means that we have 4 bits to split. To understand the process, consider splitting the 4-digit binary number abcd, whose result should be 00a00b00c00d. This means that a needs to be shifted by 6 bits, b by 4, c by 2, and d by 0. We can get there by first shifting a and b by 4, and keeping c and d where they are, obtaining 00ab0000cd. In that value, b and d are at their correct position, and only c and a need to be shifted by 2 bits. To obtain that value, we can shift the current value abcd and OR it with itself shifted by 4, which gets us 00abcdabcd. We still need to remove the lower two bits of the high part and the higher two bits of the low part (i.e. the inner cdab). To do that, we create the mask 0011000011, which is simply ((00001111 << 4) | 00001111) & ~(00001111 < 2). After AND’ing that mask with the value 00abcdabcd, we indeed obtain the value we wanted 00ab0000cd = 0011000011 & 00abcdabcd. Then, we repeat the process once more, this time shifting only by 2 bits. We obtain abab00cdcd, which we AND with the mask 1001001001, obtained with a formula similar to the one before ((11000011 << 2) | 11000011) & ~(11000011 << 1). The final result is a00b00c00d, as expected.

The nice property about this code is that you can choose the number of bits you need, and any good compiler with optimizations turned on will be able to unroll the loop if log_bits is known at compile-time.

Radix Sort

Radix sort is a really efficient sorting algorithm and is crucial to get the best performance out of PLOC. A simple parallel implementation works in the following way:

  1. Allocate a pair of buffers to hold the copies of the keys and values, and for each thread, allocate 1 << bits_per_iteration buckets.
  2. For each group of bits_per_iteration bits in the key, do, in each thread:
    • Fill the buckets with 0.
    • Iterate over the range of keys covered by the thread and for each of them increment the corresponding bucket: buckets[(key >> current_bit) & ((1 << bits_per_iteration) - 1)]++
    • Perform a prefix sum over the buckets to compute the number of keys that fall into each bucket for each thread For instance, if the buckets look like this:

      Buckets 0 1 2 3
      Thread 1 1 0 1 2
      Thread 2 2 1 0 1
      Thread 3 0 2 1 0

      They should look like this after the prefix sum:

      Buckets 0 1 2 3
      Thread 1 0 3 6 8
      Thread 2 1 3 7 10
      Thread 3 3 4 7 11

      In parallel, this can be done in two steps: first summing vertically, then horizontally.

    • Copy the keys and values to into their destination position into the copy buffers, based on the prefix sum of the buckets, for each thread.
    • Swap the copy buffers with the current ones.

You can find an implementation of that algorithm here.

Implementing PLOC

The steps outlined above describe the general procedure to build a BVH using PLOC. However, there are multiple things that are omitted. First off, how do you allocate BVH nodes? In PBRT and other books on rendering, you sometimes find the use of memory pools. It may sound nice and cool, but it is really the exact opposite of that: terrible and misinformed.

You see, BVHs are trees, which means we can easily find the number of nodes given the number of leaves. For a binary tree, the number of nodes is 2 * leaf_count - 1. It is easy to verify that by induction on the structure of trees. Hence, if you have primitive_count primitives in your BVH, you have at most 2 * primitive_count - 1 nodes. Just use malloc() once to allocate your nodes before building, and if you have sometimes more than one primitive per node (meaning that you end up having fewer nodes than initially planned), then just do a realloc() once construction is done to adjust the size of your buffer. It is highly unlikely a memory pool implementation can beat that, and it is also way easier to implement.

Another issue that is not mentioned is how nodes should be merged in memory. The layout I use is very simple: Start by placing the leaves (made of only one primitive, according to step 2) at the end of the array of nodes. Then, set the initial active range of nodes as [end - primitive_count, end).

The way I implement the algorithm is then as follows: Each iteration of PLOC computes the minimum node for each node in the active range. That gives me an array neighbors[] that contains, for each node, the index of the node that realizes the minimum of the merged surface area (as described in step 3). Then, I need to compute the index where the merged nodes are going to be inserted, and for that I have an array called merged_index[] which contains 1 for nodes which should be merged, and 0 otherwise. At this point it is worth noting that I only want to create one node out of A and B for each pair of nodes A and B that should be merged. I therefore set merged_index[] to 1 for A if the index of A is lower than that of B. Summing things up, for a given node index i, we have:

merged_index[i] = i < neighbors[i] && neighbors[neighbors[i]] == i ? 1 : 0;

Once that array is filled, I can just run a prefix sum over it and that will give me the position where I should start inserting the merged nodes.

Since the merged nodes have to be inserted into a another buffer, to avoid data races, the algorithm performs the merge in a copy of the node buffer, and the two buffers are swapped at the end of each merging iteration.

An implementation of PLOC following this design can be found here.

Collapsing Leaves

Once the BVH is built using PLOC, an easy optimization is to collapse leaves using the Surface Area Heuristic (SAH). Recall that as it stands, the algorithm creates one leaf per primitive. That might be sub-optimal: The parent node may introduce useless bounding-box tests in the case where the two children have similar bounding boxes. To prevent that, one can use a bottom-up algorithm that collapses leaves based on the SAH. The SAH-inspired criterion for collapsing leaves is the following:

\[(N_L+ N_R - C_t) \cdot SA_P <= N_L \cdot SA_L + N_R \cdot SA_R\]

In this equation, \(N_L\) and \(N_R\) are the number of primitives in the left and right leaves, respectively. Similarly, \(SA_L\), \(SA_R\), and \(SA_P\) are the surface area of the left leaf, right leaf, and parent, respectively. Finally \(C_t\) is the cost of traversing a node, expressed with respect to the cost of intersecting a primitive. This means for instance that \(C_t = 1\) means that intersecting a node is as expensive (computationally speaking) as intersecting a primitive.

To implement bottom-up traversal in parallel, one can use the technique described in Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees by T. Karras. The algorithm first computes an array parents[] mapping each node to its parent, and initializes an array of integers called flags[] to zero (for synchronization). Then, the algorithm processes each node in parallel, and for each of them:

  1. Tests if it is a leaf, and if it is not one then skips the node. This ensures that iteration starts from the leaves.
  2. Collapse the current node. Once the node has been collapsed, go to the parent of that node (using parents[]).
  3. Use atomics to increment the flag of the parent. If the previous value returned by the atomic operation is different than 1, then stop processing. Otherwise, go back to step 2. This ensures that the parent is processed only when both children have been processed.

An implementation of that algorithm can be found here (for a generic bottom-up traversal) and here (for the actual leaf collapsing logic).

Summary

This post has looked into building BVHs using PLOC, and gave implementation details that should help you implement your own version of the algorithm. In the next post, I explain how to write a good traversal kernel, that is at the same time robust and fast.