Introduction

The last post presented a simple and robust traversal algorithm for BVHs. This time, we will look into low-level and hardware-agnostic optimizations for this traversal code. Usually, low-level optimizations are not portable, as they are tied to a specific hardware architecture, and they often require to write assembly code directly, or rely on hardware-specific intrinsics. Here, we are going to see how to change the assembly generated by the compiler without actually changing the code too much: We are mostly adding compiler annotations and attributes. While some of these attributes are compiler specific, they can always be wrapped with #defines that disable them when they are not supported.

A Healthy Dose of Inlining

In the current version of the traversal kernel, we have several function calls. In general, function calls are costly, so we would like to remove them from the assembly, but still keep them in the code, as they serve as documentation and improve readability.

Before I go on, let me pause for a bit and explain why these function calls are costly. Too often I read blog posts or articles that repeat that claim, without actually going into details, and the result is that programmers think that they are slow because you need to jump to some memory address, or because the call instruction is too slow. Unless your function is very small (only a couple of instructions long), the penalty you pay when calling a function does not come from the call or ret instructions, or from the fact that the compiler has to save registers to obey to the calling convention. It has to do with the fact that the compiler does not optimize across function boundaries. To illustrate that point, imagine the following code:

void f(bool c) {
    for (int i = 0; i < 100000; ++i) {
        if (c) printf("%d\n", i);
    }
}

If the compiler inlines a call to f(false), the result is extremely fast, since it will completely remove the loop and the resulting code will essentially do nothing. On the other hand, if that call is not inlined, then the result will be slow, because even though the loop does nothing, the compiler has no way of knowing that inside f() since at that point c is an unknown parameter and not the constant false. Granted, this example is a little extreme, but it shows that inlining can give you a lot more than just removing a couple of call or ret instructions.

Going back to the traversal kernel, it might make sense to annotate every function called in the traversal kernel with __attribute__((always_inline)) (for gcc or clang), or __forceinline (for MSVC). In practice, I like to apply a little more restraint, so the rules I follow are:

If a function is local to a compilation unit and inlining should be determined by the compiler, use static.
If a function is local to a compilation unit and small, use static inline.
If a function is local to a compilation unit and should always be inlined, use static inline __attribute__((always_inline)) (or the equivalent for MSVC).

Assuming those rules are followed in the source code, a problem still remains: The any parameter of intersect_bvh() is not a constant, so the compiler will emit runtime tests every time it is mentioned. A remedy to this problem is the following:

Annotate intersect_bvh() with static inline __attribute__((force_inline))

Create two new publicly visible functions:

struct bvh {
    struct node* nodes;
    primitive_t* primitives;
    index_t root_index;
    size_t depth;
};

bool intersect_bvh_closest(const struct bvh* bvh, struct ray* ray, hit_t* hit) {
    return intersect_bvh(
        bvh->nodes,
        bvh->primitives,
        bvh->root_index,
        bvh->depth,
        ray, hit, false);
}

bool intersect_bvh_any(const struct bvh* bvh, struct ray* ray, hit_t* hit) {
    return intersect_bvh(
        bvh->nodes,
        bvh->primitives,
        bvh->root_index,
        bvh->depth,
        ray, hit, true);
}

With this, we now have effectively two optimized versions of the code where the any constant has been eliminated and replaced by a constant (true for intersect_bvh_any() and false for intersect_bvh_closest()). This simple change now should largely improve performance, with little to no impact on code readability!

Marking Branches as Cold or Hot

If you run the traversal kernel in the profiler, you will notice that most of the time is spent somewhere around here (this is compiled with -march=native -O3 on a machine with hardware FMA, using the fast ray-node intersection routine and the closest-intersection mode):

.L66:
    vmovss      -56(%rbp), %xmm7
    vmovss      -76(%rbp), %xmm13
    vmovss      -72(%rbp), %xmm12
    vmovss      -68(%rbp), %xmm14
    vmovss      -80(%rbp), %xmm8
    vmovaps     %xmm7, %xmm1
    vmovaps     %xmm13, %xmm5
    vfmadd132ss (%rdx,%r15,4), %xmm12, %xmm1
    vfmadd231ss (%rdx,%r12,4), %xmm15, %xmm5
    vmovaps     %xmm7, %xmm0
    vmovaps     %xmm13, %xmm6
    vfmadd132ss (%rdx,%r9,4), %xmm12, %xmm0
    vfmadd231ss (%rdx,%r11,4), %xmm15, %xmm6
    vmovaps     %xmm14, %xmm3
    vmovaps     %xmm14, %xmm2
    vfmadd132ss (%rdx,%r14,4), %xmm8, %xmm3
    vfmadd132ss (%rdx,%rbx,4), %xmm8, %xmm2
    vmovaps     %xmm7, %xmm11
    vfmadd132ss 32(%rdx,%r9,4), %xmm12, %xmm11
    vmaxss      -52(%rbp), %xmm3, %xmm3
    vminss      %xmm5, %xmm1, %xmm1
    vmovaps     %xmm7, %xmm5
    vfmadd132ss 32(%rdx,%r15,4), %xmm12, %xmm5
    vmovaps     %xmm13, %xmm7
    vmaxss      %xmm6, %xmm0, %xmm0
    vmovaps     %xmm13, %xmm6
    vfmadd231ss 32(%rdx,%r11,4), %xmm15, %xmm6
    vminss      %xmm4, %xmm2, %xmm2
    vmaxss      %xmm3, %xmm0, %xmm0
    vminss      %xmm2, %xmm1, %xmm1
    vmovaps     %xmm14, %xmm3
    vmovaps     %xmm14, %xmm2
    vfmadd132ss 32(%rdx,%r14,4), %xmm8, %xmm3
    vfmadd132ss 32(%rdx,%rbx,4), %xmm8, %xmm2
    vcomiss     %xmm0, %xmm1
    vmaxss      -52(%rbp), %xmm3, %xmm3
    vmovaps     %xmm5, %xmm13
    vmovaps     %xmm7, %xmm5
    vfmadd231ss 32(%rdx,%r12,4), %xmm15, %xmm5
    vmaxss      %xmm6, %xmm11, %xmm11
    vminss      %xmm4, %xmm2, %xmm2
    vmaxss      %xmm3, %xmm11, %xmm11
    vminss      %xmm5, %xmm13, %xmm13
    vminss      %xmm2, %xmm13, %xmm13
    jb          .L38

First off, it is important to realize what we are looking at here. There are a bunch of vfmadxxxps, and some vminps/vmaxps, so it is easy to figure out that we are in fact in the ray-node intersection part of the traversal. It is no surprise that we spend most of the time traversing rays in that part of the code, since we have to intersect several inner nodes before reaching a single leaf.

If you are familiar with assembly, you may have noticed that this code in fact contains reloads from spilled registers. Register spills are essentially registers being temporarily stored on the stack so that they are now free to store some other data. This happens during register allocation, when the compiler cannot find any more free registers to store data. Register reloads, on the contrary, happen when the compiler needs to restore some data that was stored during a spill. To identify spills or reloads on most architectures, you have to look for occurences of the stack register in memory transfers. In x86 or amd64 (a.k.a. x86-64), the stack register to look for is rbp or ebp, and memory transfers can be either regular move instructions (e.g. an instruction that begins with mov), or folded stores/loads (as the destination or source of an instruction that computes something).

In the listing above, there are 7 reloads:

5 regular reloads:

vmovss -56(%rbp), %xmm7
vmovss -76(%rbp), %xmm13
vmovss -72(%rbp), %xmm12
vmovss -68(%rbp), %xmm14
vmovss -80(%rbp), %xmm8

2 folded reloads:

vmaxss -52(%rbp), %xmm3, %xmm3
vmaxss -52(%rbp), %xmm3, %xmm3

As you can imagine, it is better to spill and reload in cold parts of the code (those executed less often), since it is an expensive operation. Clearly, we would like to remove them from here and make sure that other registers than those needed for the ray-node intersection are spilled. So, how do we do this exactly?

The answer is simple: We have to tell the compiler somehow that our code is hot (executed often). The compiler will then know that it is better to spill registers from cold parts of the code instead of those of the ray-node intersection. To understand how this works, let us look again at the code:

        float t_entry[2];
        bool hit_left  = intersect_node_distance(ray, left,  t_entry + 0);
        bool hit_right = intersect_node_distance(ray, right, t_entry + 1);

#define INTERSECT_CHILD(child) \
        if (hit_##child) { \
            if (child->primitive_count > 0) { \
                found |= intersect_leaf(child, primitives, ray, hit, any); \
                if (found && any) \
                    break; \
                child = NULL; \
            } \
        } else \
            child = NULL;

        INTERSECT_CHILD(left)
        INTERSECT_CHILD(right)

#undef INTERSECT_CHILD

Here, we see that the compiler has to allocate registers for the ray-node intersection, but we also see that it needs registers for the ray-primitive intersections in intersect_leaf(). We know, as a matter of fact, that the traversal kernel will traverse more inner nodes than leaves, so we could mark the branch that tests if the child is a leaf as unlikely (in other words, cold).

With gcc and clang, it is easy to do that with the built-in function __builtin_expect():

#if defined(__GNUC__) || defined(__clang__)
#define likely(x)   __builtin_expect(x, true)
#define unlikely(x) __builtin_expect(x, false)
#else
#define likely(x)   x
#define unlikely(x) x
#endif

Then, we can simply make the following change:

if (unlikely(child->primitive_count > 0)) {
    /* ... */
}

This will mark the branch as cold and will thus inform the compiler that the registers used in intersect_leaf() should probably be spilled instead of those of the ray-node intersection. The resulting assembly, when compiling with gcc, is this:

.L66:
    vmovaps     %xmm10, %xmm0
    vmovaps     %xmm14, %xmm5
    vfmadd132ss (%rax,%r9,4), %xmm15, %xmm0
    vfmadd132ss (%rax,%r11,4), %xmm13, %xmm5
    vmovaps     %xmm11, %xmm3
    vmovaps     %xmm10, %xmm1
    vfmadd132ss (%rax,%r14,4), %xmm9, %xmm3
    vfmadd132ss (%rax,%r15,4), %xmm15, %xmm1
    vmovaps     %xmm14, %xmm4
    vmovaps     %xmm11, %xmm2
    vfmadd132ss (%rax,%r12,4), %xmm13, %xmm4
    vfmadd132ss (%rax,%rbx,4), %xmm9, %xmm2
    vmovss      -52(%rbp), %xmm12
    vmovaps     %xmm14, %xmm8
    vmovaps     %xmm14, %xmm7
    vfmadd132ss 32(%rax,%r11,4), %xmm13, %xmm8
    vfmadd132ss 32(%rax,%r12,4), %xmm13, %xmm7
    vmaxss      %xmm5, %xmm0, %xmm0
    vmovaps     %xmm11, %xmm5
    vfmadd132ss 32(%rax,%r14,4), %xmm9, %xmm5
    vmaxss      %xmm12, %xmm3, %xmm3
    vminss      %xmm6, %xmm2, %xmm2
    vminss      %xmm4, %xmm1, %xmm1
    vmovaps     %xmm10, %xmm4
    vfmadd132ss 32(%rax,%r15,4), %xmm15, %xmm4
    vmaxss      %xmm3, %xmm0, %xmm0
    vmovaps     %xmm10, %xmm3
    vfmadd132ss 32(%rax,%r9,4), %xmm15, %xmm3
    vminss      %xmm2, %xmm1, %xmm1
    vmovaps     %xmm11, %xmm2
    vfmadd132ss 32(%rax,%rbx,4), %xmm9, %xmm2
    vcomiss     %xmm0, %xmm1
    vmaxss      %xmm12, %xmm5, %xmm5
    vminss      %xmm7, %xmm4, %xmm4
    vmaxss      %xmm8, %xmm3, %xmm3
    vminss      %xmm6, %xmm2, %xmm2
    vmaxss      %xmm5, %xmm3, %xmm3
    vminss      %xmm2, %xmm4, %xmm4
    jb          .L38

As you can see, only one reload remains, which is much better than before, where there were 7 of them. In terms of performance, this change amounts to a couple of percent (around +5% faster on the scenes I tested), and it comes at a very little price in terms of complexity. To remove the final reload, we have to tweak the probability used by __builtin_expect(). To do that, we use __builtin_expect_with_probability() instead, and specify a probability of 95% (the default used by __builtin_expect() is 90%).

#define unlikely(x) __builtin_expect_with_probability(x, false, 0.95)

After this change, the final reload is gone. For completeness, we can also use unlikely() in the special case for when the root node is a leaf:

if (unlikely(root->primitive_count > 0)) {
    /* ... */
}

It is important to understand that this optimization is a trade-off: We do not get free registers out of thin air, but we take free registers from another part of the code. In the cases presented above though, we know statically that the branches have indeed a low probability of being taken, so having spills and reloads in those branches will not cause a large performance drop, and the benefit of having free registers for hotter parts of the code more than makes up for that anyway.

All in all, if you have to optimize the performance of your programs at this level, it might take a bit of trial-and-error to find the best place to write annotations, or the best function to inline. In order to help me go through that tedious process, I usually use gcc -S -o assembly.s ... to get the generated assembly for one file, and I use perf record .../perf report to get information on which parts of the code are the bottleneck.

At this point it might make sense to say that if you are using MSVC you should probably get a decent compiler like clang or gcc if you plan on doing such low-level optimizations. My experience with MSVC has been terrible, not only from the language support side, where it does not support all the features of C99, a standard that is now more than 20 years old at the time of writing, but also from a point of view of generated assembly quality, where even simple things such as using std::fill on pointers generates abysmally bad code.

Summary

This post covered two simple but really effective ways to drive the output of your compiler. The kind of intrinsics and attributes that we used are all hardware-agnostic, but might not be supported by all C or C++ compilers. An alternative is to use PGO (Profile-Guided Optimization), but there is no guarantee that the compiler will use the profiling information correctly and do a better job at register allocation.