BVHs (part III)
Introduction
The last post presented a simple and robust traversal algorithm for BVHs.
This time, we will look into low-level and hardware-agnostic optimizations for this traversal code.
Usually, low-level optimizations are not portable, as they are tied to a specific hardware architecture,
and they often require to write assembly code directly, or rely on hardware-specific intrinsics.
Here, we are going to see how to change the assembly generated by the compiler without actually changing
the code too much: We are mostly adding compiler annotations and attributes.
While some of these attributes are compiler specific, they can always be wrapped with #define
s that disable
them when they are not supported.
A Healthy Dose of Inlining
In the current version of the traversal kernel, we have several function calls. In general, function calls are costly, so we would like to remove them from the assembly, but still keep them in the code, as they serve as documentation and improve readability.
Before I go on, let me pause for a bit and explain why these function calls are costly.
Too often I read blog posts or articles that repeat that claim, without actually going into details,
and the result is that programmers think that they are slow because you need to jump to some memory address,
or because the call
instruction is too slow.
Unless your function is very small (only a couple of instructions long), the penalty you pay when calling a function does not
come from the call
or ret
instructions, or from the fact that the compiler has to save registers to obey to the calling convention.
It has to do with the fact that the compiler does not optimize across function boundaries.
To illustrate that point, imagine the following code:
void f(bool c) {
for (int i = 0; i < 100000; ++i) {
if (c) printf("%d\n", i);
}
}
If the compiler inlines a call to f(false)
, the result is extremely fast, since it will completely remove the loop and the resulting code will essentially do nothing.
On the other hand, if that call is not inlined, then the result will be slow, because even though the loop does nothing,
the compiler has no way of knowing that inside f()
since at that point c
is an unknown parameter and not the constant false
.
Granted, this example is a little extreme, but it shows that inlining can give you a lot more than just removing a couple of call
or ret
instructions.
Going back to the traversal kernel, it might make sense to annotate every function called in the traversal kernel
with __attribute__((always_inline))
(for gcc or clang), or __forceinline
(for MSVC).
In practice, I like to apply a little more restraint, so the rules I follow are:
- If a function is local to a compilation unit and inlining should be determined by the compiler, use
static
. - If a function is local to a compilation unit and small, use
static inline
. - If a function is local to a compilation unit and should always be inlined,
use
static inline __attribute__((always_inline))
(or the equivalent for MSVC).
Assuming those rules are followed in the source code, a problem still remains:
The any
parameter of intersect_bvh()
is not a constant, so the compiler will emit runtime tests every time it is mentioned.
A remedy to this problem is the following:
- Annotate
intersect_bvh()
withstatic inline __attribute__((force_inline))
- Create two new publicly visible functions:
struct bvh { struct node* nodes; primitive_t* primitives; index_t root_index; size_t depth; }; bool intersect_bvh_closest(const struct bvh* bvh, struct ray* ray, hit_t* hit) { return intersect_bvh( bvh->nodes, bvh->primitives, bvh->root_index, bvh->depth, ray, hit, false); } bool intersect_bvh_any(const struct bvh* bvh, struct ray* ray, hit_t* hit) { return intersect_bvh( bvh->nodes, bvh->primitives, bvh->root_index, bvh->depth, ray, hit, true); }
With this, we now have effectively two optimized versions of the code where the any
constant has been eliminated and replaced by a constant
(true
for intersect_bvh_any()
and false
for intersect_bvh_closest()
).
This simple change now should largely improve performance, with little to no impact on code readability!
Marking Branches as Cold or Hot
If you run the traversal kernel in the profiler, you will notice that most of the time is spent somewhere around here
(this is compiled with -march=native -O3
on a machine with hardware FMA, using the fast ray-node intersection routine and the closest-intersection mode):
.L66:
vmovss -56(%rbp), %xmm7
vmovss -76(%rbp), %xmm13
vmovss -72(%rbp), %xmm12
vmovss -68(%rbp), %xmm14
vmovss -80(%rbp), %xmm8
vmovaps %xmm7, %xmm1
vmovaps %xmm13, %xmm5
vfmadd132ss (%rdx,%r15,4), %xmm12, %xmm1
vfmadd231ss (%rdx,%r12,4), %xmm15, %xmm5
vmovaps %xmm7, %xmm0
vmovaps %xmm13, %xmm6
vfmadd132ss (%rdx,%r9,4), %xmm12, %xmm0
vfmadd231ss (%rdx,%r11,4), %xmm15, %xmm6
vmovaps %xmm14, %xmm3
vmovaps %xmm14, %xmm2
vfmadd132ss (%rdx,%r14,4), %xmm8, %xmm3
vfmadd132ss (%rdx,%rbx,4), %xmm8, %xmm2
vmovaps %xmm7, %xmm11
vfmadd132ss 32(%rdx,%r9,4), %xmm12, %xmm11
vmaxss -52(%rbp), %xmm3, %xmm3
vminss %xmm5, %xmm1, %xmm1
vmovaps %xmm7, %xmm5
vfmadd132ss 32(%rdx,%r15,4), %xmm12, %xmm5
vmovaps %xmm13, %xmm7
vmaxss %xmm6, %xmm0, %xmm0
vmovaps %xmm13, %xmm6
vfmadd231ss 32(%rdx,%r11,4), %xmm15, %xmm6
vminss %xmm4, %xmm2, %xmm2
vmaxss %xmm3, %xmm0, %xmm0
vminss %xmm2, %xmm1, %xmm1
vmovaps %xmm14, %xmm3
vmovaps %xmm14, %xmm2
vfmadd132ss 32(%rdx,%r14,4), %xmm8, %xmm3
vfmadd132ss 32(%rdx,%rbx,4), %xmm8, %xmm2
vcomiss %xmm0, %xmm1
vmaxss -52(%rbp), %xmm3, %xmm3
vmovaps %xmm5, %xmm13
vmovaps %xmm7, %xmm5
vfmadd231ss 32(%rdx,%r12,4), %xmm15, %xmm5
vmaxss %xmm6, %xmm11, %xmm11
vminss %xmm4, %xmm2, %xmm2
vmaxss %xmm3, %xmm11, %xmm11
vminss %xmm5, %xmm13, %xmm13
vminss %xmm2, %xmm13, %xmm13
jb .L38
First off, it is important to realize what we are looking at here.
There are a bunch of vfmadxxxps
, and some vminps
/vmaxps
, so it is easy to figure out that we are in fact in the ray-node intersection part of the traversal.
It is no surprise that we spend most of the time traversing rays in that part of the code,
since we have to intersect several inner nodes before reaching a single leaf.
If you are familiar with assembly, you may have noticed that this code in fact contains reloads from spilled registers.
Register spills are essentially registers being temporarily stored on the stack so that they are now free to store some other data.
This happens during register allocation, when the compiler cannot find any more free registers to store data.
Register reloads, on the contrary, happen when the compiler needs to restore some data that was stored during a spill.
To identify spills or reloads on most architectures, you have to look for occurences of the stack register in memory transfers.
In x86 or amd64 (a.k.a. x86-64), the stack register to look for is rbp
or ebp
, and memory transfers can be either regular move instructions (e.g. an instruction that begins with mov
), or folded stores/loads (as the destination or source of an instruction that computes something).
In the listing above, there are 7 reloads:
- 5 regular reloads:
vmovss -56(%rbp), %xmm7 vmovss -76(%rbp), %xmm13 vmovss -72(%rbp), %xmm12 vmovss -68(%rbp), %xmm14 vmovss -80(%rbp), %xmm8
- 2 folded reloads:
vmaxss -52(%rbp), %xmm3, %xmm3 vmaxss -52(%rbp), %xmm3, %xmm3
As you can imagine, it is better to spill and reload in cold parts of the code (those executed less often), since it is an expensive operation. Clearly, we would like to remove them from here and make sure that other registers than those needed for the ray-node intersection are spilled. So, how do we do this exactly?
The answer is simple: We have to tell the compiler somehow that our code is hot (executed often). The compiler will then know that it is better to spill registers from cold parts of the code instead of those of the ray-node intersection. To understand how this works, let us look again at the code:
float t_entry[2];
bool hit_left = intersect_node_distance(ray, left, t_entry + 0);
bool hit_right = intersect_node_distance(ray, right, t_entry + 1);
#define INTERSECT_CHILD(child) \
if (hit_##child) { \
if (child->primitive_count > 0) { \
found |= intersect_leaf(child, primitives, ray, hit, any); \
if (found && any) \
break; \
child = NULL; \
} \
} else \
child = NULL;
INTERSECT_CHILD(left)
INTERSECT_CHILD(right)
#undef INTERSECT_CHILD
Here, we see that the compiler has to allocate registers for the ray-node intersection,
but we also see that it needs registers for the ray-primitive intersections in intersect_leaf()
.
We know, as a matter of fact, that the traversal kernel will traverse more inner nodes than leaves,
so we could mark the branch that tests if the child is a leaf as unlikely (in other words, cold).
With gcc and clang, it is easy to do that with the built-in function __builtin_expect()
:
#if defined(__GNUC__) || defined(__clang__)
#define likely(x) __builtin_expect(x, true)
#define unlikely(x) __builtin_expect(x, false)
#else
#define likely(x) x
#define unlikely(x) x
#endif
Then, we can simply make the following change:
if (unlikely(child->primitive_count > 0)) {
/* ... */
}
This will mark the branch as cold and will thus inform the compiler that the registers used in intersect_leaf()
should probably be spilled instead of those of the ray-node intersection.
The resulting assembly, when compiling with gcc, is this:
.L66:
vmovaps %xmm10, %xmm0
vmovaps %xmm14, %xmm5
vfmadd132ss (%rax,%r9,4), %xmm15, %xmm0
vfmadd132ss (%rax,%r11,4), %xmm13, %xmm5
vmovaps %xmm11, %xmm3
vmovaps %xmm10, %xmm1
vfmadd132ss (%rax,%r14,4), %xmm9, %xmm3
vfmadd132ss (%rax,%r15,4), %xmm15, %xmm1
vmovaps %xmm14, %xmm4
vmovaps %xmm11, %xmm2
vfmadd132ss (%rax,%r12,4), %xmm13, %xmm4
vfmadd132ss (%rax,%rbx,4), %xmm9, %xmm2
vmovss -52(%rbp), %xmm12
vmovaps %xmm14, %xmm8
vmovaps %xmm14, %xmm7
vfmadd132ss 32(%rax,%r11,4), %xmm13, %xmm8
vfmadd132ss 32(%rax,%r12,4), %xmm13, %xmm7
vmaxss %xmm5, %xmm0, %xmm0
vmovaps %xmm11, %xmm5
vfmadd132ss 32(%rax,%r14,4), %xmm9, %xmm5
vmaxss %xmm12, %xmm3, %xmm3
vminss %xmm6, %xmm2, %xmm2
vminss %xmm4, %xmm1, %xmm1
vmovaps %xmm10, %xmm4
vfmadd132ss 32(%rax,%r15,4), %xmm15, %xmm4
vmaxss %xmm3, %xmm0, %xmm0
vmovaps %xmm10, %xmm3
vfmadd132ss 32(%rax,%r9,4), %xmm15, %xmm3
vminss %xmm2, %xmm1, %xmm1
vmovaps %xmm11, %xmm2
vfmadd132ss 32(%rax,%rbx,4), %xmm9, %xmm2
vcomiss %xmm0, %xmm1
vmaxss %xmm12, %xmm5, %xmm5
vminss %xmm7, %xmm4, %xmm4
vmaxss %xmm8, %xmm3, %xmm3
vminss %xmm6, %xmm2, %xmm2
vmaxss %xmm5, %xmm3, %xmm3
vminss %xmm2, %xmm4, %xmm4
jb .L38
As you can see, only one reload remains, which is much better than before, where there were 7 of them.
In terms of performance, this change amounts to a couple of percent (around +5% faster on the scenes I tested),
and it comes at a very little price in terms of complexity.
To remove the final reload, we have to tweak the probability used by __builtin_expect()
.
To do that, we use __builtin_expect_with_probability()
instead, and specify a probability of 95%
(the default used by __builtin_expect()
is 90%
).
#define unlikely(x) __builtin_expect_with_probability(x, false, 0.95)
After this change, the final reload is gone.
For completeness, we can also use unlikely()
in the special case for when the root node is a leaf:
if (unlikely(root->primitive_count > 0)) {
/* ... */
}
It is important to understand that this optimization is a trade-off: We do not get free registers out of thin air, but we take free registers from another part of the code. In the cases presented above though, we know statically that the branches have indeed a low probability of being taken, so having spills and reloads in those branches will not cause a large performance drop, and the benefit of having free registers for hotter parts of the code more than makes up for that anyway.
All in all, if you have to optimize the performance of your programs at this level,
it might take a bit of trial-and-error to find the best place to write annotations,
or the best function to inline.
In order to help me go through that tedious process, I usually use gcc -S -o assembly.s ...
to get the generated assembly for one file,
and I use perf record ...
/perf report
to get information on which parts of the code are the bottleneck.
At this point it might make sense to say that if you are using MSVC you should probably get a decent compiler like clang or gcc
if you plan on doing such low-level optimizations.
My experience with MSVC has been terrible, not only from the language support side, where it does not support all the features of C99,
a standard that is now more than 20 years old at the time of writing, but also from a point of view of generated assembly quality,
where even simple things such as using std::fill
on pointers generates abysmally bad code.
Summary
This post covered two simple but really effective ways to drive the output of your compiler. The kind of intrinsics and attributes that we used are all hardware-agnostic, but might not be supported by all C or C++ compilers. An alternative is to use PGO (Profile-Guided Optimization), but there is no guarantee that the compiler will use the profiling information correctly and do a better job at register allocation.