SIMD for faster computing - The Edition Guide

The basics of SIMD are now available! SIMD stands for “single instruction, multiple data.” Consider a function like this:


# #![allow(unused_variables)]
#fn main() {
pub fn foo(a: &[u8], b: &[u8], c: &mut [u8]) {
    for ((a, b), c) in a.iter().zip(b).zip(c) {
        *c = *a + *b;
    }
}
#}

Here, we’re taking two slices, and adding the numbers together, placing the result in a third slice. The simplest possible way to do this would be to do exactly what the code does, and loop through each set of elements, add them together, and store it in the result. However, compilers can often do better. LLVM will usually “autovectorize” code like this, which is a fancy term for “use SIMD.” Imagine that a and b were both 16 elements long. Each element is a u8, and so that means that each slice would be 128 bits of data. Using SIMD, we could put both a and b into 128 bit registers, add them together in a single instruction, and then copy the resulting 128 bits into c. That’d be much faster!

While stable Rust has always been able to take advantage of autovectorization, sometimes, the compiler just isn’t smart enough to realize that we can do something like this. Additionally, not every CPU has these features, and so LLVM may not use them so your program can be used on a wide variety of hardware. The std::arch module allows us to use these kinds of instructions directly, which means we don’t need to rely on a smart compiler. Additionally, it includes some features that allow us to choose a particular implementation based on various criteria. For example:

#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"),
      target_feature = "avx2"))]
fn foo() {
    #[cfg(target_arch = "x86")]
    use std::arch::x86::_mm256_add_epi64;
    #[cfg(target_arch = "x86_64")]
    use std::arch::x86_64::_mm256_add_epi64;

    unsafe {
        _mm256_add_epi64(...);
    }
}

Here, we use cfg flags to choose the correct version based on the machine we’re targeting; on x86 we use that version, and on x86_64 we use its version. We can also choose at runtime:

fn foo() {
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    {
        if is_x86_feature_detected!("avx2") {
            return unsafe { foo_avx2() };
        }
    }

    foo_fallback();
}

Here, we have two versions of the function: one which uses AVX2, a specific kind of SIMD feature that lets you do 256-bit operations. The is_x86_feature_detected! macro will generate code that detects if your CPU supports AVX2, and if so, calls the foo_avx2 function. If not, then we fall back to a non-AVX implementation, foo_fallback. This means that our code will run super fast on CPUs that support AVX2, but still work on ones that don’t, albeit slower.

If all of this seems a bit low-level and fiddly, well, it is! std::arch is specifically primitives for building these kinds of things. We hope to eventually stabilize a std::simd module with higher-level stuff in the future. But landing the basics now lets the ecosystem experiment with higher level libraries starting today. For example, check out the faster crate. Here’s a code snippet with no SIMD:

let lots_of_3s = (&[-123.456f32; 128][..]).iter()
    .map(|v| {
        9.0 * v.abs().sqrt().sqrt().recip().ceil().sqrt() - 4.0 - 2.0
    })
    .collect::<Vec<f32>>();

To use SIMD with this code via faster, you’d change it to this:

let lots_of_3s = (&[-123.456f32; 128][..]).simd_iter()
    .simd_map(f32s(0.0), |v| {
        f32s(9.0) * v.abs().sqrt().rsqrt().ceil().sqrt() - f32s(4.0) - f32s(2.0)
    })
    .scalar_collect();

It looks almost the same: simd_iter instead of iter, simd_map instead of map, f32s(2.0) instead of 2.0. But you get a SIMD-ified version generated for you.

Beyond that, you may never write any of this yourself, but as always, the libraries you depend on may. For example, the regex crate contains these SIMD speedups without you needing to do anything at all!