# BVHs (part II)

# Introduction

The last post gave a quick review of a modern BVH construction algorithm, along with implementation details. This time, we are going to look into BVH traversal, in the context of ray tracing. If you use BVHs for something else than ray tracing, you may still want to stick around, since many ideas explained here can be applied to other traversal algorithms.

# Basic Algorithm

Given a ray and a BVH, we want to find the closest intersection between that ray and the primitives contained in the BVH. Sometimes, we can simplify the algorithm a little when we only care about whether there is an intersection at all (when tracing shadow rays, for instance). In that case, the traversal algorithm can terminate immediately once an intersection is found, which in my experience can make the algorithm around 20% faster, depending on the scene.

This means our traversal algorithm should be able to operate in two different modes:
*closest*- and *any*-intersection.
In closest-intersection mode, it behaves as expected, returning the closest intersection
along the ray, and in any-intersection mode, it stops once the first intersection is found,
regardless of whether it is the closest or not.

Before explaining the algorithm, let us first define a ray:

```
struct ray {
float org[3];
float dir[3];
float tmin, tmax;
};
```

This structure encodes the ray origin (a point), the ray direction (a vector), and the minimum and maximum intersection distances (scalar values). In mathematical terms, the ray is defined by the 3D segment:

\[O + t \cdot D \, \text{with} \, O, D \in \mathrm{R}^3, t \in [t_{max}, t_{min}]\]In this equation, \(O\) and \(D\) are the ray origin and direction, respectively, and
\(t\), \(t_{min}\) and \(t_{max}\) are what we will call *intersection distances*.
In such terms, finding the intersection between the ray and the BVH is just finding the minimum
\(t\in[t_{min}, t_{max}]\) such that \(O + t \cdot D\) is on a primitive of the scene.

The basic algorithm that solves this problem is recursive: For each node, we test if the ray intersects the bounding box of that node, and if the intersection is within the current range \([t_{min}, t_{max}]\). If an intersection is found within that range, we recurse over the children of the node if the node is an internal node, or intersect the primitives contained in the node if the node is a leaf.

Thus, here is a basic, naive and inefficient version of the algorithm, using the data structures and BVH layout of the previous post:

```
struct hit {
uint32_t primitive_index;
// Additional info (depends on the primitive type)
};
bool intersect_bvh_recursive(
const struct node* nodes,
const struct primitive* primitives,
size_t node_index,
struct ray* ray, const struct ray_data* ray_data,
struct hit* hit, bool any)
{
const struct node* node = nodes[node_index];
// 1. Test the ray against the node's bounding box
if (!intersect_node_simple(ray, ray_data, node))
return false;
// 2. Intersect primitives or recurse
if (node->primitive_count > 0) {
// The node is a leaf: Intersect primitives contained in it
bool found = false;
for (size_t i = node->first_child_or_primitive, n = i + node->primitive_count; i < n; ++i) {
if (intersect_primitive(ray, &primitives[i], hit)) {
hit->primitive_index = i;
if (any)
return true;
found = true;
}
}
return found;
} else {
// The node is internal: Recurse
bool hit_left = intersect_bvh_recursive(
nodes, primitives,
node->first_child_or_primitive + 0,
ray, ray_data, hit, any);
if (any && hit_left)
return true;
return hit_left |
intersect_bvh_recursive(
nodes, primitives,
node->first_child_or_primitive + 1,
ray, ray_data, hit, any);
}
}
bool intersect_bvh(
const struct node* nodes,
const struct primitive* primitives,
size_t node_index,
struct ray* ray, struct hit* hit, bool any)
{
struct ray_data ray_data;
compute_ray_data(ray, &ray_data);
return intersect_bvh_recursive(nodes, primitives, node_index, ray, &ray_data, hit, any);
}
```

This `intersect_bvh()`

function takes an array of nodes and an array of primitives.
The primitives should already be permutted according to the order defined by the BVH such that
`node->first_child_or_primitive`

is the index into the primitive array if the node is a leaf
(see the previous post for more details).

This code also assumes the existence of `intersect_node_simple()`

and `intersect_primitive()`

.
These will be explained a bit later in this post.
Currently, it is enough to know that they return a boolean indicating whether or not an intersection was found,
and that, if one was found, `intersect_primitive()`

sets the `tmax`

field of `ray`

and the additional info for the intersection in `hit`

.
The `ray_data`

parameter passed to `intersect_node_simple()`

is some precomputed data that speeds up the ray-node test.
The function `compute_ray_data()`

performs the precomputation, given the ray to traverse the BVH with.

Finally, note that `any`

is a boolean flag that controls whether the algorithm operates in closest- or any-intersection mode.
When it is set to true, the algorithm exits immediately once it an intersection is found.
Importantly, when `any`

is false, the recursive case cannot be simplified if an intersection is found in the first child.
This is because the two children of the current node may overlap, in which case the intersection found in the first one may not
be the closest (this is also true when applying the traversal order optimization described later in this post).

# Basic Optimizations

The first optimization is to get a rid of recursion: This will make the code run faster, because we only really need to push node indices on the stack, the rest of the arguments are the same in each recursive call. Another bonus of this transformation is that we no longer need to pop stack frames if an intersection is found in any-intersection mode: We can just exit the traversal loop. Here is the code with this optimization applied:

```
bool intersect_bvh(
const struct node* nodes,
const struct primitive* primitives,
size_t root_index, size_t max_stack_size,
struct ray* ray, struct hit* hit, bool any)
{
uint32_t stack[max_stack_size];
size_t stack_size = 0;
struct ray_data ray_data;
compute_ray_data(ray, &ray_data);
bool found = false;
stack[stack_size++] = root_index;
while (stack_size != 0) {
const struct node* node = nodes[stack[--stack_size]];
// 1. Test the ray against the node's bounding box
if (!intersect_node_simple(ray, &ray_data, node))
continue;
// 2. Intersect primitives or recurse
if (node->primitive_count > 0) {
// The node is a leaf: Intersect primitives contained in it
for (size_t i = node->first_child_or_primitive, n = i + node->primitive_count; i < n; ++i) {
if (intersect_primitive(ray, &primitives[i], hit)) {
hit->primitive_index = i;
if (any)
return true;
found = true;
}
}
} else {
// The node is internal: Recurse
stack[stack_size++] = node->first_child_or_primitive + 0;
stack[stack_size++] = node->first_child_or_primitive + 1;
}
}
return found;
}
```

Interestingly, the code is simpler in that form, mainly due to the fact that the any-intersection mode is much simpler to implement in that framework.
The code as written here uses C99’s VLA to allocate the stack.
The maximum size of that stack is given as the parameter `max_stack_size`

.
It can either be a “large enough” constant, or just be the depth of the BVH (computed at runtime).

Even though that version is leaner and simpler than the previous one, it still lacks a very important optimization: The traversal order (the order in which we push nodes on the stack) might not be optimal if we want to find the closest intersection.

# Optimizing the Traversal Order

As stated above, the `intersect_primitive()`

function changes `ray->tmax`

once we find an intersection,
which means that nodes are not processed when they are further away than the latest intersection.
In the case where the node is internal, we could take advantage of that and push the closest node first,
in the hope that an intersection is found in that subtree, such that the other subtree is culled afterwards.
We refer to that optimization as the *traversal order* optimization.

To my knowledge, there are two ways to go about this.

## Axis-based Traversal Order Optimization

The first is linked to the BVH construction algorithm, and requires to store the axis used during node splitting (in top-down methods) in the BVH node.
I found this idea when reading I. Wald’s *Interactive Rendering With Coherent Ray Tracing* article, but it could potentially be older than that.
Something like this:

```
struct node {
float bounds[6];
uint32_t first_child_or_primitive;
uint32_t primitive_count : 30;
uint32_t split_axis : 2; // 0 for X, 1 for Y, 2 for Z
};
```

If the method used during construction is not split-based (like PLOC, described in the previous post),
then the split axis can be approximated by taking the axis on which the two children bounding boxes differ the most.
Then, we can order the children of each node such that the first child is the one that appears first on the split axis.
This means that `first_child_or_primitive`

points to the child that is the most on the left if `split_axis`

is `0`

, for instance.

During traversal, we first create the following array that contains `1`

if the ray direction is positive or `0`

otherwise, for each axis:

```
int dir[3] = { ray->dir[0] > 0, ray->dir[1] > 0, ray->dir[2] > 0 };
```

Then, we can use that array to change the order in which we push nodes on the stack:

```
int order = dir[node->split_axis];
stack[stack_size++] = node->first_child_or_primitive + order;
stack[stack_size++] = node->first_child_or_primitive + 1 - order;
```

That is all that is needed to apply this optimization using the first method.

## Distance-based Traversal Order Optimization

Let us look now at the second method, which is more general, more accurate, and as an added bonus increases Instruction-Level Parallelism (ILP). ILP, in general, is an expression of how much parallelism is present in a piece of assembly code. Modern processors take advantage of ILP to execute independent instructions in parallel, which makes the code run faster.

The idea is very simple: If the ray-box intersection routine can give you the distance at which the ray enters the box,
additionally to the boolean value, then you can use that to order children when pushing them.
However, you need this entry distance for *both* children to be able to do that.
The solution is to intersect both children together, instead of intersecting the current node.
In other terms, the traversal loop intersects the bounding boxes of the two children of a node, instead of intersecting the bounding box of that node.
There is of course a special case if the BVH is only a leaf, which needs to be handled separately,
either by making sure this never happens and creating dummy nodes that have empty bounding boxes,
or by just having an `if`

statement before entering the traversal loop, which is the approach we are going to take here.

Here is the traversal loop, modified to intersect the two children of a node together:

```
bool intersect_bvh(
const struct node* nodes,
const struct primitive* primitives,
size_t root_index, size_t max_stack_size,
struct ray* ray, struct hit* hit, bool any)
{
uint32_t stack[max_stack_size];
size_t stack_size = 0;
struct ray_data ray_data;
compute_ray_data(ray, &ray_data);
bool found = false;
// Special case when the root is a leaf
const struct node* root = nodes + root_index;
if (root->primitive_count > 0) {
return
intersect_node_simple(ray, &ray_data, root) &&
intersect_leaf(root, primitives, ray, hit, any);
}
// General case
const struct node* left = nodes + root->first_child_or_primitive;
while (true) {
const struct node* right = left + 1;
// Intersect the two children together
float t_entry[2];
bool hit_left = intersect_node_distance(ray, &ray_data, left, t_entry + 0);
bool hit_right = intersect_node_distance(ray, &ray_data, right, t_entry + 1);
#define INTERSECT_CHILD(child) \
if (hit_##child) { \
if (child->primitive_count > 0) { \
found |= intersect_leaf(child, primitives, ray, hit, any); \
if (found && any) \
break; \
child = NULL; \
} \
} else \
child = NULL;
INTERSECT_CHILD(left)
INTERSECT_CHILD(right)
#undef INTERSECT_CHILD
if (left) {
// The left child was intersected
if (right) {
// Both children were intersected, we need to sort them based
// on their distances (only in closest-intersection mode).
if (!any && t_entry[0] > t_entry[1]) {
const struct node* tmp = left;
left = right;
right = tmp;
}
stack[stack_size++] = right->first_child_or_primitive;
}
left = nodes + left->first_child_or_primitive;
} else if (right) {
// Only the right child was intersected
left = nodes + right->first_child_or_primitive;
} else {
// No intersection was found
if (stack_size == 0)
break;
left = nodes + stack[--stack_size];
}
}
return found;
}
```

As you can already see, it is much more complicated than the previous version,
and it also requires two additional functions: `intersect_node_distance()`

and `intersect_leaf()`

.
The first one is, as indicated earlier, a modification of the original `intersect_node_distance()`

that returns
the distance between the ray and the bounding box via its additional parameter.
The second one, `intersect_leaf()`

, is just the part of the traversal that intersects primitives wrapped in its own function:

```
bool intersect_leaf(const struct node* leaf, const struct primitive* primitives, struct ray* ray, struct hit* hit, bool any) {
bool found = false;
for (size_t i = leaf->first_child_or_primitive, n = i + leaf->primitive_count; i < n; ++i) {
if (intersect_primitive(ray, &primitives[i], hit)) {
hit->primitive_index = i;
if (any)
return true;
found = true;
}
}
return found;
}
```

It is important to note that the behavior of the traversal loop above is different from the original one: Leaves will be intersected immediately after being seen (since only nodes that have children can be pushed on the stack). This behavior is in general desirable, since, with it, primitives are found faster and thus the traversal loop culls more nodes. However, if intersecting primitives is extremely expensive, this might be a problem, since that may in fact cause more ray-primitive intersections. If that ever happens, a solution is to push indices instead of pointers on the traversal stack, and to use negative indices to denote leaves. That way, leaves do not have to be processed immediately. Since this is rarely needed, it is left as an exercise to the reader.

# Ray-Node Intersection Routines

The final topic of this post is, of course, the intersection routines.
There are many implementations of ray-box intersection routines out there, but few are correct, and even fewer are efficient.
Therefore, I am going to present the routines I use in my implementation, which are based on T. Ize’s excellent *Robust BVH Ray Traversal* paper.

The ray-node intersection routines I implemented take advantage of the knowledge of the ray octant: They use the sign of the ray direction to know which sides of the bounding box are going to be intersected first. This saves 3 min/max calls, which save a lot of computation since most of the time is typically spent intersecting nodes.

```
struct ray_data {
int octant[3];
// Other fields used in `intersect_axis_min()`/`intersect_axis_max()`
};
void compute_octant(const struct ray* ray, int* octant) {
octant[0] = signbit(ray->dir[0]) ? 1 : 0;
octant[1] = signbit(ray->dir[1]) ? 1 : 0;
octant[2] = signbit(ray->dir[2]) ? 1 : 0;
}
// These two functions guarantee that if the right hand side is not a NaN,
// then the returned value is not an NaN. In C++, as noted by T. Ize,
// using std::min or std::max *does not* give you that guarantee.
float robust_min(float x, float y) { return x < y ? x : y; }
float robust_max(float x, float y) { return x > y ? x : y; }
bool intersect_node_distance(const struct ray* ray, struct ray_data* ray_data, const struct node* node, float* t_entry) {
float tmin_x = intersect_axis_min(0, node->bounds[0 + ray_data->octant[0]], ray, ray_data);
float tmin_y = intersect_axis_min(1, node->bounds[2 + ray_data->octant[1]], ray, ray_data);
float tmin_z = intersect_axis_min(2, node->bounds[4 + ray_data->octant[2]], ray, ray_data);
float tmax_x = intersect_axis_max(0, node->bounds[0 + 1 - ray_data->octant[0]], ray, ray_data);
float tmax_y = intersect_axis_max(1, node->bounds[2 + 1 - ray_data->octant[1]], ray, ray_data);
float tmax_z = intersect_axis_max(2, node->bounds[4 + 1 - ray_data->octant[2]], ray, ray_data);
float tmin = robust_max(robust_max(tmin_x, tmin_y), robust_max(tmin_z, ray->tmin));
float tmax = robust_min(robust_min(tmax_x, tmax_y), robust_min(tmax_z, ray->tmax));
*t_entry = tmin;
return tmin <= tmax;
}
bool intersect_node_simple(const struct ray* ray, const struct node* node) {
float t_entry;
return intersect_node_distance(ray, node, &t_entry);
}
```

The ray octant is represented as an array of three integers, equal to `0`

or `1`

, and with it,
we can compute the index of the minimum or maximum coordinate of the node.
Importantly, we use `signbit()`

and not a comparison with `0.0f`

, since `x < 0.0f`

would return `false`

for `-0.0f`

(because `-0.0f == 0.0f`

):
If we were using this comparison instead of `signbit()`

, we would get `tmin > tmax`

for the components of the direction that are equal to `-0.0f`

.

To complete this implementation, we need to provide a definition for `intersect_axis_min()`

and `intersect_axis_max()`

.
Depending on whether the focus is on performance or correctness, we can implement those differently.

## Robust Ray-Node Intersection

If the focus is to be correct with no false misses for really small ray directions, we can use T. Ize’s approach:

```
struct ray_data {
int octant[3];
float inv_dir[3];
float padded_inv_dir[3];
};
uint32_t float_to_bits(float x) {
uint32_t u;
memcpy(&u, &x, sizeof(float));
return u;
}
float bits_to_float(uint32_t u) {
float x;
memcpy(&x, &u, sizeof(float));
return x;
}
float add_ulp_magnitude(float x, unsigned ulps) {
return isfinite(x) ? bits_to_float(float_to_bits(x) + ulps) : x;
}
void compute_ray_data(const struct ray* ray, struct ray_data* ray_data) {
compute_octant(ray, ray_data->octant);
ray_data->inv_dir[0] = 1.0f / ray->dir[0];
ray_data->inv_dir[1] = 1.0f / ray->dir[1];
ray_data->inv_dir[2] = 1.0f / ray->dir[2];
ray_data->padded_inv_dir[0] = add_ulp_magnitude(ray_data->inv_dir[0], 2);
ray_data->padded_inv_dir[1] = add_ulp_magnitude(ray_data->inv_dir[1], 2);
ray_data->padded_inv_dir[2] = add_ulp_magnitude(ray_data->inv_dir[2], 2);
}
float intersect_axis_min(int axis, float p, const struct ray* ray, const struct ray_data* ray_data) {
return (p - ray->org[axis]) * ray_data->inv_dir[axis];
}
float intersect_axis_max(int axis, float p, const struct ray* ray, const struct ray_data* ray_data) {
return (p - ray->org[axis]) * ray_data->padded_inv_dir[axis];
}
```

For a detailed explanation as to why it is necessary to add 2 ULPs (Unit in the Last Place) to the inverse ray direction, please refer to T. Ize’s paper.
The short story is that the expression `(p - ray->org[axis]) * ray_data->inv_dir[axis]`

has some error that can be bounded to 3/2 ULPs.
Thus, adding 2 ULPs to the maximum intersection distance makes sure that we do not have false misses.

## Fast Ray-Node Intersection

Now, if your meshes are reasonable and if your ray directions are not close to zero, you may want to use a faster algorithm. For that, you can take advantage of the FMA instruction available on modern machines. The portable way to do that is to write a simple wrapper function:

```
#include <math.h>
#if defined(__clang__)
#define ENABLE_FP_CONTRACT \
_Pragma("clang diagnostic push") \
_Pragma("clang diagnostic ignored \"-Wunknown-pragmas\"") \
_Pragma("STDC FP_CONTRACT ON") \
_Pragma("clang diagnostic pop")
#else
#define ENABLE_FP_CONTRACT
#endif
inline float fast_multiply_add(float x, float y, float z) {
#ifdef FP_FAST_FMAF
return fmaf(x, y, z);
#else
ENABLE_FP_CONTRACT
return x * y + z;
#endif
}
```

The C standard defines the preprocessor macro `FP_FAST_FMAF`

when calling `fmaf()`

(both in `math.h`

) is as fast or faster than doing the addition and multiplication separately.
Thus, when this symbol is defined, it is reasonable to expect that a call to `fmaf()`

translates to one FMA instruction on your hardware.
When using clang as a compiler, the macro `FP_FAST_FMAF`

is never defined, and the code relies on a standard pragma `STDC FP_CONTRACT ON`

, which enables
contraction (transforming `a * b + c`

into a fused multiply-add).
That pragma is not supported in gcc, which is why there is this special path for clang.
Note that, unless you compile with dangerous floating point optimizations enabled (using `-ffast-math`

or similar),
then your compiler *will not*, in general, perform contractions, which is why we need this machinery.

With that function, we can write the following `intersect_axis_min()`

and `intersect_axis_max()`

functions:

```
struct ray_data {
int octant[3];
float inv_dir[3];
float scaled_org[3];
};
float safe_inverse(float x) {
return 1.0f / (fabsf(x) <= FLT_EPSILON ? copysignf(FLT_EPSILON, x) : x);
}
void compute_ray_data(const struct ray* ray, struct ray_data* ray_data) {
compute_octant(ray, ray_data->octant);
ray_data->inv_dir[0] = safe_inverse(ray->dir[0]);
ray_data->inv_dir[1] = safe_inverse(ray->dir[1]);
ray_data->inv_dir[2] = safe_inverse(ray->dir[2]);
ray_data->scaled_org[0] = -ray_data->inv_dir[0] * ray->org[0];
ray_data->scaled_org[1] = -ray_data->inv_dir[1] * ray->org[1];
ray_data->scaled_org[2] = -ray_data->inv_dir[2] * ray->org[2];
}
float intersect_axis_min(int axis, float p, const struct ray* ray, const struct ray_data* ray_data) {
(void)ray; // Silence warnings that we are not using `ray`
return fast_multiply_add(p, ray_data->inv_dir[axis], ray_data->scaled_org[axis]);
}
float intersect_axis_max(int axis, float p, const struct ray* ray, const struct ray_data* ray_data) {
return intersect_axis_min(axis, p, ray, ray_data);
}
```

Note that in this case we do need to take care of really small ray direction values,
since the use of FMA makes the result `intersect_axis_min()`

(and `intersect_axis_max()`

)
return a NaN if the ray direction is zero.

# Ray-Triangle Intersection Routine

Now that we have the necessary ray-node intersection routines, the only missing piece to write a traversal kernel is the ray-primitive intersection. In this last section, I will present a super simple and efficient version of the classic Möller-Trumbore ray-triangle intersection algorithm. It uses the following data layout for triangles:

```
struct tri {
float p0[3];
float e1[3];
float e2[3];
float n[3];
};
```

This triangle representation, as you probably guessed, uses a point (`p0`

) and two edges (`e1`

and `e2`

) to define a triangle.
The normal `n`

is the non-normalized cross-product of `e1`

and `e2`

, and is only here to speed up the intersection routine.
Here is the code that translates a triangle defined by three points into that representation:

```
void vsub(float* x, const float* y, const float* z) {
x[0] = y[0] - z[0];
x[1] = y[1] - z[1];
x[2] = y[2] - z[2];
}
void vcross(float* x, const float* y, const float* z) {
x[0] = y[1] * z[2] - y[2] * z[1];
x[1] = y[2] * z[0] - y[0] * z[2];
x[2] = y[0] * z[1] - y[1] * z[0];
}
struct tri make_tri(const float* p0, const float* p1, const float* p2) {
float e1[3], e2[3], n[3];
vsub(e1, p0, p1);
vsub(e2, p2, p0);
vcross(n, e1, e2);
return (struct tri) {
.p0 = { p0[0], p0[1], p0[2] },
.e1 = { e1[0], e1[1], e1[2] },
.e2 = { e2[0], e2[1], e2[2] },
.n = { n[0], n[1], n[2] }
};
}
```

With that data layout, here is the intersection function:

```
float vdot(const float* x, const float* y) {
return x[0] * y[0] + x[1] * y[1] + x[2] * y[2];
}
bool intersect_tri(struct ray* ray, const struct tri* tri, struct hit* hit) {
float c[3], r[3];
vsub(c, tri->p0, ray->org);
vcross(r, ray->dir, c);
float inv_det = 1.0f / vdot(tri->n, ray->dir);
float u = vdot(r, tri->e2) * inv_det;
float v = vdot(r, tri->e1) * inv_det;
float w = 1.0f - u - v;
// These comparisons are designed to return false
// when one of t, u, or v is a NaN
if (u >= 0 && v >= 0 && w >= 0) {
float t = vdot(tri->n, c) * inv_det;
if (t >= ray->tmin && t < ray->tmax) {
ray->tmax = t;
// Set additional info of hit here (e.g. `u` and `v`)
return true;
}
}
return false;
}
```

As an added bonus, this function also computes the barycentric coordinates on the triangle (`u`

and `v`

),
which can then later be used for texture mapping.

# Summary

This post explained how to implement a robust and efficient BVH traversal algorithm. In combination with the previous post on the topic, it should give enough information to start your own implementation. If you already have an existing BVH implementation, I hope that it has given you some ideas on how to optimize your traversal kernels. The next post in the series will show how to apply low-level optimizations to the traversal kernel.