Verifying Rust Zeroize with Assembly...including portable SIMD

When writing code that deals with sensitive information like passwords or payment data, it's important to zeroize memory when you're done with it. Failing to do so can leave sensitive in memory even after the program is terminated and even end up on disk when the computer uses swap.

In this post, I'll explain what zeroizing is, why and when you should use it and how to implement it correctly.

What is Zeroizing?

When a sensitive value, say an encryption key, is used in a program it must be stored in memory: either on the stack or in the heap. In either case, even after memory is dropped (or freed, garbage collected etc), the contents may still lurk in the computer - even beyond the life of the program. It is therefore important that such data be cleared before the memory is dropped so that secrets are not leaked to unexpected places.

Why is Zeroizing important?

The code below demonstrates that even after it has been dropped, data stored in a given memory location can still be read.

use std::mem;
use std::ptr;

struct SensitiveData {
    data: [u8; 16],  // Representing sensitive data
}

fn main() {
    // Some mock sensitive data
    let sensitive = SensitiveData { data: [42; 16] };

    let data_location = &sensitive.data as *const u8;
    mem::drop(sensitive);

    // Attempt to read the data back
    // after it has been dropped
    let mut recovered_data = [0u8; 16];
    unsafe {
        ptr::copy_nonoverlapping(
          data_location,
          recovered_data.as_mut_ptr(),
          16
        );
    }

    println!("Recovered data: {:?}", recovered_data);
}

The code calls creates a mock SensitiveData value and then calls mem::drop directly instead of letting Rust do it when the value goes out of scope. Before doing so, it stores the location of the memory that was used for the data as a raw pointer and then uses that location to read back the original contents of the memory.

While this is a very simple example, it illustrates that just because memory is dropped, data still exists in the system even if the program doesn't care about it anymore.

How to Zeroize

Zeroizing memory is surprisingly very tricky. Even Rust, famous for memory safety has no formal built-in way to do this. The main challenge is stopping the compiler from optimizing away code that it thinks is not necessary.

Let's look at an example.

// lib.rs (simd_zeroize)
pub struct SafeArray([u32; 4]);

impl SafeArray {
    pub fn consume_and_sum(self) -> u32 {
        // Careful! This could overflow!
        self.0.into_iter().sum()
    }
}

In this code, I have a type called SafeArray which just wraps a 4-element array of u32. I've created my own type so that I can implement the Drop trait in a moment.

My type has a single function which consumes self and sums all elements as a u32. Because self is consumed but not returned it will be dropped. (Be aware that this code could easily cause an addition overflow but I'm intentionally keeping it very simple to limit how much assembly code is generated).

Inspecting the compiled code

To really understand what's going on here we can look at the compiled assembly code. I'm working on a Mac and can do this using the objdump tool. Compiler Explorer is also a handy tool but doesn't seem to support Arm assembly which is what Rust will use when compiling on Apple Silicon.

Before looking at the assembly, the code must be compiled in release mode as this will ensure that all of the compiler's target optimizations are applied.

cargo build --release

Then I'll use objdump to disassemble the machine code into Arm64 ASM:

objdump -d target/debug/libsimd_zeroize.rlib > assembly.s

Here's the assembly.s file:

0000000000000000 <ltmp0>:
       0: 00 00 c0 3d  	ldr	q0, [x0]
       4: 00 b8 b1 4e  	addv.4s	s0, v0
       8: 00 00 26 1e  	fmov	w0, s0
       c: c0 03 5f d6  	ret

Don't worry if you don't know or understand assembly code, we'll focus just on specific instructions for this exercise.

The line starting with 0000000000000000 is the label Rust has given to the consume_and_sum method and the actual machine instructions are contained below it. These steps load the values from a memory address stored in x0 into a register called q0, add all 4 values in one step (using the vectorized addv.4s instruction), move the result into an output register and return.

Registers are what the CPU uses to perform most operations so this code loads data from memory into the register to that an operation can be performed.

Implementing Drop

Let's see what happens when we try to implement zeroization when our SafeArray is dropped.

impl Drop for SafeArray {
    fn drop(&mut self) {
        // Demonstration only: Don't do this
        self.0 = [0; 4];
    }
}

This is the ASM for the whole program:

0000000000000000 <ltmp0>:
       0: 00 00 c0 3d  	ldr	q0, [x0]
       4: 00 b8 b1 4e  	addv.4s	s0, v0
       8: 01 00 26 1e  	fmov	w1, s0
       c: 1f 7c 00 a9  	stp	xzr, xzr, [x0]
      10: e0 03 01 aa  	mov	x0, x1
      14: c0 03 5f d6  	ret

The important line is shown below. It uses stp which stores a pair of registers, in this case the special zero register, xzr in the memory pointed to by x0. In other words, the memory was zeroed! It worked!

       c: 1f 7c 00 a9  	stp	xzr, xzr, [x0]

But let's not get too excited, yet. We should check that it still works for other types. Changing the code to use u8 instead of u32 (and leaving the drop implementation the same), we have:

// Changed to u8
pub struct SafeArray([u8; 4]);

impl SafeArray {
    pub fn consume_and_sum(self) -> u8 {
        // Careful! This could overflow!
        self.0.into_iter().sum()
    }
}

Compiles to the following:

0000000000000000 <ltmp0>:
       0: 08 20 40 0b  	add	w8, w0, w0, lsr #8
       4: 08 41 40 0b  	add	w8, w8, w0, lsr #16
       8: 00 61 40 0b  	add	w0, w8, w0, lsr #24
       c: c0 03 5f d6  	ret

It looks quite different from the earlier version! The compiler is using a totally different approach. This code is doing is a series of additions involving the original value in w0 and its progressively right-shifted versions. After each shift, the shifted value is added to an accumulating sum. The shifts are by 8, 16, and then 24 bits, effectively breaking w0 into four bytes, adding these bytes together, and storing the final sum back into w0.

But where is the zeroizing code!? For some reason the compiler decided that our code to zeroize was irrelevant and optimized it away.

Avoiding unsafe compiler operations

Compilers are complicated pieces of software and are designed to generate code that is optimal for the target architecture. This means their behaviour can sometimes be hard to reason about and, like in the case above, remove code that is important to security in the interests of performance.

We need a different approach to ensure our attempts to zeroize data don't get optimized away.

Thankfully, there is already a crate to do this: Zeroize!

I'll add it to my Cargo.toml with the derive feature enabled as we'll use that in a moment. I've also added #[no_mangle] to the drop which retains symbol names in the generated assembly code and will make things a bit easier to read.

# Cargo.toml

[dependencies]
zeroize = { version = "1.7.0", features = ["derive"] }

Now we can derive Zeroize for SafeArray and call zeroize in the Drop implementation:

use zeroize::Zeroize;

#[derive(Zeroize)]
pub struct SafeArray(pub [u8; 4]);

impl Drop for SafeArray {
    #[no_mangle]
    fn drop(&mut self) {
        self.0.zeroize();
    }
}

The compiled assembly is as follows:

0000000000000000 <ltmp0>:
       0: ff 43 00 d1  	sub	sp, sp, #16
       4: 08 7c 08 53  	lsr	w8, w0, #8
       8: e8 2f 00 39  	strb	w8, [sp, #11]
       c: 09 7c 10 53  	lsr	w9, w0, #16
      10: e9 2b 00 39  	strb	w9, [sp, #10]
      14: 0a 7c 18 53  	lsr	w10, w0, #24
      18: ea 27 00 39  	strb	w10, [sp, #9]
      1c: 08 01 00 0b  	add	w8, w8, w0
      20: 29 01 0a 0b  	add	w9, w9, w10
      24: 00 01 09 0b  	add	w0, w8, w9
      28: ff 33 00 39  	strb	wzr, [sp, #12]
      2c: ff 2f 00 39  	strb	wzr, [sp, #11]
      30: ff 2b 00 39  	strb	wzr, [sp, #10]
      34: ff 27 00 39  	strb	wzr, [sp, #9]
      38: ff 43 00 91  	add	sp, sp, #16
      3c: c0 03 5f d6  	ret

0000000000000040 <_drop>:
      40: 1f 00 00 39  	strb	wzr, [x0]
      44: 1f 04 00 39  	strb	wzr, [x0, #1]
      48: 1f 08 00 39  	strb	wzr, [x0, #2]
      4c: 1f 0c 00 39  	strb	wzr, [x0, #3]
      50: c0 03 5f d6  	ret

There is a lot more code now but for the most part it is doing the same thing as before (the addition is done over several instructions this time though).

The important part is that we have a Drop implementation that is correctly zeroizing memory 🎉. As you can see, there is the implementation of the Drop trait, conveniently labeled <_drop> (thanks to #[no_mangle]) but that the zeroizing code has also been included (via inlining) in the summation code above. In this case, the compiler has used the strb instruction to store the zero register (wzr) into each element of our array.

Using ZeroizeOnDrop

The Zeroize crate comes with a marker trait called ZeroizeOnDrop which works for any Zeroize type and means I don't have to implement Drop every time. I can derive ZeroizeOnDrop instead of using my own Drop implementation.

use zeroize::{Zeroize, ZeroizeOnDrop};

#[derive(Zeroize, ZeroizeOnDrop)]
pub struct SafeArray(pub [u8; 4]);

Caution!

Implementing Zeroize alone won't automatically zeroize memory on drop. Zeroize just implements the zeroize method to clear memory. The ZeroizeOnDrop trait must be implemented as well to automatically zeroize when the value is dropped.

But...what about Portable SIMD?

But you may also be asking, what is SIMD!?

...um, what is SIMD?

Single Instruction, Multiple Data (SIMD) is a parallel processing paradigm used in computer architecture to enhance performance by executing the same operation simultaneously on multiple data points. This approach is especially effective for tasks that require the same computation to be repeated over a large data set, such as in digital signal processing, image and video processing, and scientific simulations. In my case, I'm using SIMD for high-performance cryptography implementations.

SIMD architectures achieve this by employing vector processors or SIMD extensions in CPUs, where a single instruction directs the simultaneous execution of operations on multiple data elements within wider registers. For instance, a SIMD instruction could add or multiply pairs of numbers in a single operation, significantly speeding up computations compared to processing each pair sequentially. This method leverages data-level parallelism, different from the traditional sequential execution model, and is a key feature in modern processors to boost computational efficiency and performance.

For example, with SIMD I can sum 8 arrays of 4 integers in parallel.

#![feature(portable_simd)]
use core::simd::prelude::Simd;

let x: [Simd<u32, 8>; 4] = [
    Simd::from_array([1, 1, 1, 1, 1, 1, 1, 1]),
    Simd::from_array([2, 2, 2, 2, 2, 2, 2, 2]),
    Simd::from_array([1, 2, 3, 4, 5, 6, 7, 8]),
    Simd::from_array([0, 0, 0, 0, 0, 0, 0, 0]),
];

let sums = x.into_iter().reduce(|sum, x| sum + x);
dbg!(sums);

This code outputs:

Some(
    [
        4,  // 1 + 2 + 1 + 0
        5,  // 1 + 2 + 2 + 0
        6,  // etc
        7,
        8,
        9,
        10,
        11,
    ],
)

Neat, huh?!

OK, back to Zeroize for SIMD

While the Zeroize crate is awesome, and you should absolutely use it, it doesn't currently have implementations for the forthcoming portable SIMD modules for Rust. Unlike working with SIMD directly, which requires knowledge of the specific CPU architecture you're building for, Portable SIMD abstracts common CPU vectorizations into a universal interface that works on most architectures.

I've created a type which wraps Simd<u16, 8>, a vector of 8 u16 values and a simple method that adds 2 values, consuming both.

pub struct MySimd(Simd<u16, 8>);

impl MySimd {
    #[no_mangle]
    pub fn consume_and_add(self, other: Self) -> Self {
        Self(self.0 + other.0)
    }
}

The generated assembly is as follows:

0000000000000000 <ltmp0>:
       0: 00 00 c0 3d  	ldr	q0, [x0]
       4: 21 00 c0 3d  	ldr	q1, [x1]
       8: 20 84 60 4e  	add.8h	v0, v1, v0
       c: 00 01 80 3d  	str	q0, [x8]
      10: c0 03 5f d6  	ret

We just added 8 pairs of numbers in only 5 instructions! Let's try adding a Drop implementation.

impl Drop for MySimd {
    fn drop(&mut self) {
        // splat is roughly equivalent to `[0u16; 8]
        self.0 &= Simd::splat(0);
    }
}

But oh no! The generated assembly is identical! My drop code was completely ignored 😫.

0000000000000000 <ltmp0>:
       0: 00 00 c0 3d  	ldr	q0, [x0]
       4: 21 00 c0 3d  	ldr	q1, [x1]
       8: 20 84 60 4e  	add.8h	v0, v1, v0
       c: 00 01 80 3d  	str	q0, [x8]
      10: c0 03 5f d6  	ret

Using unsafe to be safe!?

Ironically, the only way we can make this code safely and correctly zero memory that may contain sensitive data is to use some unsafe operations. The Zeroize crate itself uses two approaches to avoid compiler optimizations removing zeroizing code. I'll use them both here:

use core::{ptr, sync::atomic};

impl Drop for MySimd {
    fn drop(&mut self) {
        unsafe {
          ptr::write_volatile(self, core::mem::zeroed())
        };
        atomic::compiler_fence(atomic::Ordering::SeqCst);
    }
}

Before explaining what's going on, let's first see if it works.

0000000000000000 <ltmp0>:
       0: 00 e4 00 6f  	movi.2d	v0, #0000000000000000
       4: 00 00 80 3d  	str	q0, [x0]
       8: c0 03 5f d6  	ret

000000000000000c <_consume_and_add>:
       c: 00 00 c0 3d  	ldr	q0, [x0]
      10: 21 00 c0 3d  	ldr	q1, [x1]
      14: 20 84 60 4e  	add.8h	v0, v1, v0
      18: 00 01 80 3d  	str	q0, [x8]
      1c: 00 e4 00 6f  	movi.2d	v0, #0000000000000000
      20: 20 00 80 3d  	str	q0, [x1]
      24: 00 00 80 3d  	str	q0, [x0]
      28: c0 03 5f d6  	ret

The two functions represent the consume_and_add method on MySimd and the drop method in the Drop trait. The top function confusingly denoted by ltmp0 (I'm still not sure why) is the Drop code and it contains:

       0: 00 e4 00 6f  	movi.2d	v0, #0000000000000000

This moves the special zero value into the vector v0 which was dropped. Because the consume_and_add method returns a vector, only one of the 2 arguments is actually dropped. You can also see that the same code has been inlined into the consume_and_add function.

So, what's going on here?

Firstly, we're using write_volatile to reliably zero the target memory. The Rust compiler guarantees not to mess with it! Unfortunately, the method is unsafe but its the only way to safely zero the data.

Secondly, we're using what's called an atomic compiler fence which tells the compiler it is not allowed to reorganize the memory in question. It doesn't prevent the CPU from doing so in hardware though that is a post for another day.

Implementing Zeroize

Instead of implementing Drop I can use my custom Zeroize implementation and then just implement ZeroizeOnDrop like we did earlier.

impl Zeroize for MySimd {
    fn zeroize(&mut self) {
        unsafe {
          ptr::write_volatile(self, core::mem::zeroed())
        };
        atomic::compiler_fence(atomic::Ordering::SeqCst);
    }
}

impl ZeroizeOnDrop for MySimd {}

Better, safer code

While you may not have this exact problem in your day-to-day code, understanding what's happening under the hood can be instructive. And hopefully lead to better and safer code.

:wq