# Verifying Rust Zeroize with Assembly...including portable SIMD

*Published on Tue Jan 09 2024 00:00:00 GMT+0000 (Coordinated Universal Time)*

*By Dan Draper — CEO and Founder*

When writing code that deals with sensitive information like passwords or payment data, it's important to zeroize memory when you're done with it. Failing to do so can leave sensitive in memory even after the program is terminated and even end up on disk when the computer uses swap.

## Content

In this post, I'll explain what zeroizing is, why and when you should use it and how to implement it correctly.

## What is _Zeroizing_?

When a sensitive value, say an encryption key, is used in a program it must be stored in memory: either on the stack or in the heap. In either case, even after memory is dropped (or freed, garbage collected etc), the contents may still lurk in the computer - even beyond the life of the program. It is therefore important that such data be cleared before the memory is dropped so that secrets are not leaked to unexpected places.

## Why is Zeroizing important?

The code below demonstrates that even after it has been dropped, data stored in a given memory location can still be read.

```rust
use std::mem;
use std::ptr;

struct SensitiveData {
    data: [u8; 16],  // Representing sensitive data
}

fn main() {
    // Some mock sensitive data
    let sensitive = SensitiveData { data: [42; 16] };

    let data_location = &sensitive.data as *const u8;
    mem::drop(sensitive);

    // Attempt to read the data back
    // after it has been dropped
    let mut recovered_data = [0u8; 16];
    unsafe {
        ptr::copy_nonoverlapping(
          data_location,
          recovered_data.as_mut_ptr(),
          16
        );
    }

    println!("Recovered data: {:?}", recovered_data);
}
```

The code calls creates a mock `SensitiveData` value and then calls `mem::drop` directly instead of letting Rust do it when the value goes out of scope. Before doing so, it stores the location of the memory that was used for the data as a raw pointer and then uses that location to read back the original contents of the memory.

While this is a very simple example, it illustrates that just because memory is dropped, data still exists in the system even if the program doesn't care about it anymore.

## How to Zeroize

Zeroizing memory is surprisingly very tricky. Even Rust, famous for memory safety has no formal built-in way to do this. The main challenge is stopping the compiler from optimizing away code that it _thinks_ is not necessary.

Let's look at an example.

```rust
// lib.rs (simd_zeroize)
pub struct SafeArray([u32; 4]);

impl SafeArray {
    pub fn consume_and_sum(self) -> u32 {
        // Careful! This could overflow!
        self.0.into_iter().sum()
    }
}
```

In this code, I have a type called `SafeArray` which just wraps a 4-element array of `u32`. I've created my own type so that I can implement the `Drop` trait in a moment.

My type has a single function which consumes `self` and sums all elements as a u32. Because `self` is consumed but not returned it will be dropped. (Be aware that this code could easily cause an addition overflow but I'm intentionally keeping it very simple to limit how much assembly code is generated).

## Inspecting the compiled code

To really understand what's going on here we can look at the compiled assembly code. I'm working on a Mac and can do this using the `objdump` tool. [Compiler Explorer](https://godbolt.org/) is also a handy tool but doesn't seem to support Arm assembly which is what Rust will use when compiling on Apple Silicon.

Before looking at the assembly, the code must be compiled in **release** mode as this will ensure that all of the compiler's target optimizations are applied.

```
cargo build --release
```

Then I'll use `objdump` to disassemble the machine code into Arm64 ASM:

```
objdump -d target/debug/libsimd_zeroize.rlib > assembly.s
```

Here's the `assembly.s` file:

```arm
0000000000000000 <ltmp0>:
       0: 00 00 c0 3d  	ldr	q0, [x0]
       4: 00 b8 b1 4e  	addv.4s	s0, v0
       8: 00 00 26 1e  	fmov	w0, s0
       c: c0 03 5f d6  	ret
```

Don't worry if you don't know or understand assembly code, we'll focus just on specific instructions for this exercise.

The line starting with `0000000000000000` is the label Rust has given to the `consume_and_sum` method and the actual machine instructions are contained below it. These steps load the values from a memory address stored in `x0` into a register called `q0`, add all 4 values in one step (using the vectorized addv.4s instruction), move the result into an output register and return.

_Registers are what the CPU uses to perform most operations so  this code loads data from memory into the register to that an operation can be performed._

## Implementing Drop

Let's see what happens when we try to implement zeroization when our `SafeArray` is dropped.

```rust
impl Drop for SafeArray {
    fn drop(&mut self) {
        // Demonstration only: Don't do this
        self.0 = [0; 4];
    }
}
```

This is the ASM for the whole program:

```armasm
0000000000000000 <ltmp0>:
       0: 00 00 c0 3d  	ldr	q0, [x0]
       4: 00 b8 b1 4e  	addv.4s	s0, v0
       8: 01 00 26 1e  	fmov	w1, s0
       c: 1f 7c 00 a9  	stp	xzr, xzr, [x0]
      10: e0 03 01 aa  	mov	x0, x1
      14: c0 03 5f d6  	ret
```

The important line is shown below. It uses `stp` which stores a pair of registers, in this case the special *zero* register, `xzr` in the memory pointed to by `x0`. In other words, the memory was zeroed! It worked!

```armasm
       c: 1f 7c 00 a9  	stp	xzr, xzr, [x0]
```

But let's not get too excited, yet. We should check that it still works for other types. Changing the code to use `u8` instead of `u32` (and leaving the drop implementation the same), we have:

```rust
// Changed to u8
pub struct SafeArray([u8; 4]);

impl SafeArray {
    pub fn consume_and_sum(self) -> u8 {
        // Careful! This could overflow!
        self.0.into_iter().sum()
    }
}
```

Compiles to the following:

```armasm
0000000000000000 <ltmp0>:
       0: 08 20 40 0b  	add	w8, w0, w0, lsr #8
       4: 08 41 40 0b  	add	w8, w8, w0, lsr #16
       8: 00 61 40 0b  	add	w0, w8, w0, lsr #24
       c: c0 03 5f d6  	ret
```

It looks quite different from the earlier version! The compiler is using a totally different approach. This code is doing is a series of additions involving the original value in `w0` and its progressively right-shifted versions. After each shift, the shifted value is added to an accumulating sum. The shifts are by 8, 16, and then 24 bits, effectively breaking `w0` into four bytes, adding these bytes together, and storing the final sum back into `w0`.

_But where is the zeroizing code!?_ For some reason the compiler decided that our code to zeroize was irrelevant and optimized it away.

## Avoiding unsafe compiler operations

Compilers are complicated pieces of software and are designed to generate code that is optimal for the target architecture. This means their behaviour can sometimes be hard to reason about and, like in the case above, remove code that is important to security in the interests of performance.

We need a different approach to ensure our attempts to zeroize data don't get optimized away.

Thankfully, there is already a crate to do this: [Zeroize](https://crates.io/crates/zeroize)!

I'll add it to my `Cargo.toml` with the `derive` feature enabled as we'll use that in a moment. I've also added `#[no_mangle]` to the `drop` which retains symbol names in the generated assembly code and will make things a bit easier to read.

```toml
# Cargo.toml

[dependencies]
zeroize = { version = "1.7.0", features = ["derive"] }
```

Now we can derive `Zeroize` for `SafeArray` and call `zeroize` in the `Drop` implementation:

```rust
use zeroize::Zeroize;

#[derive(Zeroize)]
pub struct SafeArray(pub [u8; 4]);

impl Drop for SafeArray {
    #[no_mangle]
    fn drop(&mut self) {
        self.0.zeroize();
    }
}
```

The compiled assembly is as follows:

```armasm
0000000000000000 <ltmp0>:
       0: ff 43 00 d1  	sub	sp, sp, #16
       4: 08 7c 08 53  	lsr	w8, w0, #8
       8: e8 2f 00 39  	strb	w8, [sp, #11]
       c: 09 7c 10 53  	lsr	w9, w0, #16
      10: e9 2b 00 39  	strb	w9, [sp, #10]
      14: 0a 7c 18 53  	lsr	w10, w0, #24
      18: ea 27 00 39  	strb	w10, [sp, #9]
      1c: 08 01 00 0b  	add	w8, w8, w0
      20: 29 01 0a 0b  	add	w9, w9, w10
      24: 00 01 09 0b  	add	w0, w8, w9
      28: ff 33 00 39  	strb	wzr, [sp, #12]
      2c: ff 2f 00 39  	strb	wzr, [sp, #11]
      30: ff 2b 00 39  	strb	wzr, [sp, #10]
      34: ff 27 00 39  	strb	wzr, [sp, #9]
      38: ff 43 00 91  	add	sp, sp, #16
      3c: c0 03 5f d6  	ret

0000000000000040 <_drop>:
      40: 1f 00 00 39  	strb	wzr, [x0]
      44: 1f 04 00 39  	strb	wzr, [x0, #1]
      48: 1f 08 00 39  	strb	wzr, [x0, #2]
      4c: 1f 0c 00 39  	strb	wzr, [x0, #3]
      50: c0 03 5f d6  	ret
```

There is a lot more code now but for the most part it is doing the same thing as before (the addition is done over several instructions this time though).

The **important part** is that we have a `Drop` implementation that is correctly zeroizing memory 🎉. As you can see, there is the implementation of the `Drop` trait, conveniently labeled `<_drop>` (thanks to `#[no_mangle]`) but that the zeroizing code has also been included (via _inlining_) in the summation code above. In this case, the compiler has used the `strb` instruction to store the zero register (`wzr`) into each element of our array.

## Using ZeroizeOnDrop

The Zeroize crate comes with a marker trait called `ZeroizeOnDrop` which works for any `Zeroize` type and means I don't have to implement `Drop` every time. I can derive `ZeroizeOnDrop` instead of using my own `Drop` implementation.

```rust
use zeroize::{Zeroize, ZeroizeOnDrop};

#[derive(Zeroize, ZeroizeOnDrop)]
pub struct SafeArray(pub [u8; 4]);
```

## Caution!

Implementing `Zeroize` alone won't automatically zeroize memory on drop. `Zeroize` just implements the `zeroize` method to clear memory. The `ZeroizeOnDrop` trait must be implemented as well to automatically zeroize when the value is dropped.

## But...what about Portable SIMD?

But you may also be asking, what is SIMD!?

## ...um, what is SIMD?

Single Instruction, Multiple Data (SIMD) is a parallel processing paradigm used in computer architecture to enhance performance by executing the same operation simultaneously on multiple data points. This approach is especially effective for tasks that require the same computation to be repeated over a large data set, such as in digital signal processing, image and video processing, and scientific simulations. In my case, I'm using SIMD for high-performance cryptography implementations.

SIMD architectures achieve this by employing vector processors or SIMD extensions in CPUs, where a single instruction directs the simultaneous execution of operations on multiple data elements within wider registers. For instance, a SIMD instruction could add or multiply pairs of numbers in a single operation, significantly speeding up computations compared to processing each pair sequentially. This method leverages data-level parallelism, different from the traditional sequential execution model, and is a key feature in modern processors to boost computational efficiency and performance.

For example, with SIMD I can sum 8 arrays of 4 integers in parallel.

```rust
#![feature(portable_simd)]
use core::simd::prelude::Simd;

let x: [Simd<u32, 8>; 4] = [
    Simd::from_array([1, 1, 1, 1, 1, 1, 1, 1]),
    Simd::from_array([2, 2, 2, 2, 2, 2, 2, 2]),
    Simd::from_array([1, 2, 3, 4, 5, 6, 7, 8]),
    Simd::from_array([0, 0, 0, 0, 0, 0, 0, 0]),
];

let sums = x.into_iter().reduce(|sum, x| sum + x);
dbg!(sums);
```

This code outputs:

```rust
Some(
    [
        4,  // 1 + 2 + 1 + 0
        5,  // 1 + 2 + 2 + 0
        6,  // etc
        7,
        8,
        9,
        10,
        11,
    ],
)
```

Neat, huh?!

## OK, back to Zeroize for SIMD

While the Zeroize crate is awesome, and you should absolutely use it, it doesn't currently have implementations for the forthcoming [portable SIMD](https://rust-lang.github.io/portable-simd/core_simd/simd/struct.Simd.html#) modules for Rust. Unlike working with SIMD directly, which requires knowledge of the specific CPU architecture you're building for, Portable SIMD abstracts common CPU vectorizations into a universal interface that works on most architectures.

I've created a type which wraps `Simd<u16, 8>`, a vector of 8 `u16` values and a simple method that adds 2 values, consuming both.

```rust
pub struct MySimd(Simd<u16, 8>);

impl MySimd {
    #[no_mangle]
    pub fn consume_and_add(self, other: Self) -> Self {
        Self(self.0 + other.0)
    }
}
```

The generated assembly is as follows:

```armasm
0000000000000000 <ltmp0>:
       0: 00 00 c0 3d  	ldr	q0, [x0]
       4: 21 00 c0 3d  	ldr	q1, [x1]
       8: 20 84 60 4e  	add.8h	v0, v1, v0
       c: 00 01 80 3d  	str	q0, [x8]
      10: c0 03 5f d6  	ret
```

We just added 8 pairs of numbers in only 5 instructions! Let's try adding a Drop implementation.

```rust
impl Drop for MySimd {
    fn drop(&mut self) {
        // splat is roughly equivalent to `[0u16; 8]
        self.0 &= Simd::splat(0);
    }
}
```

But oh no! The generated assembly is **identical**! My drop code was completely ignored 😫.

```armasm
0000000000000000 <ltmp0>:
       0: 00 00 c0 3d  	ldr	q0, [x0]
       4: 21 00 c0 3d  	ldr	q1, [x1]
       8: 20 84 60 4e  	add.8h	v0, v1, v0
       c: 00 01 80 3d  	str	q0, [x8]
      10: c0 03 5f d6  	ret
```

## Using unsafe to be _safe_!?

Ironically, the only way we can make this code safely and correctly zero memory that may contain sensitive data is to use some `unsafe` operations. The Zeroize crate itself uses two approaches to avoid compiler optimizations removing zeroizing code. I'll use them both here:

```rust
use core::{ptr, sync::atomic};

impl Drop for MySimd {
    fn drop(&mut self) {
        unsafe {
          ptr::write_volatile(self, core::mem::zeroed())
        };
        atomic::compiler_fence(atomic::Ordering::SeqCst);
    }
}
```

Before explaining what's going on, let's first see if it works. 

```armasm
0000000000000000 <ltmp0>:
       0: 00 e4 00 6f  	movi.2d	v0, #0000000000000000
       4: 00 00 80 3d  	str	q0, [x0]
       8: c0 03 5f d6  	ret

000000000000000c <_consume_and_add>:
       c: 00 00 c0 3d  	ldr	q0, [x0]
      10: 21 00 c0 3d  	ldr	q1, [x1]
      14: 20 84 60 4e  	add.8h	v0, v1, v0
      18: 00 01 80 3d  	str	q0, [x8]
      1c: 00 e4 00 6f  	movi.2d	v0, #0000000000000000
      20: 20 00 80 3d  	str	q0, [x1]
      24: 00 00 80 3d  	str	q0, [x0]
      28: c0 03 5f d6  	ret
```

The two functions represent the `consume_and_add` method on `MySimd` and the `drop` method in the `Drop` trait. The top function confusingly denoted by `ltmp0` (I'm still not sure why) is the Drop code and it contains:

```armasm
       0: 00 e4 00 6f  	movi.2d	v0, #0000000000000000
```

This moves the special zero value into the vector `v0` which was dropped. Because the `consume_and_add` method returns a vector, only one of the 2 arguments is actually dropped. You can also see that the same code has been inlined into the `consume_and_add` function.

## So, what's going on here?

Firstly, we're using [write_volatile](https://doc.rust-lang.org/std/ptr/fn.write_volatile.html) to reliably zero the target memory. The Rust compiler guarantees not to mess with it! Unfortunately, the method is unsafe but its the only way to _safely_ zero the data.

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uxo9knbjx5gqjap4ivu3.gif)

Secondly, we're using what's called an atomic [compiler fence](https://doc.rust-lang.org/stable/core/sync/atomic/fn.compiler_fence.html) which tells the compiler it is not allowed to reorganize the memory in question. It doesn't prevent the CPU from doing so in hardware though that is a post for another day.

## Implementing Zeroize

Instead of implementing `Drop` I can use my custom `Zeroize` implementation and then just implement `ZeroizeOnDrop` like we did earlier.

```rust
impl Zeroize for MySimd {
    fn zeroize(&mut self) {
        unsafe {
          ptr::write_volatile(self, core::mem::zeroed())
        };
        atomic::compiler_fence(atomic::Ordering::SeqCst);
    }
}

impl ZeroizeOnDrop for MySimd {}
```

## Better, safer code

While you may not have this exact problem in your day-to-day code, understanding what's happening under the hood can be instructive. And hopefully lead to better and safer code.

:wq