Verifying Rust Zeroize with Assembly...including portable SIMD
When writing code that deals with sensitive information like passwords or payment data, it's important to zeroize memory when you're done with it. Failing to do so can leave sensitive data in memory even after the program is terminated and even end up on disk when the computer uses swap.
In this post, I'll explain what zeroizing is, why and when you should use it and how to implement it correctly.
What is Zeroizing?
When a sensitive value, say an encryption key, is used in a program it must be stored in memory: either on the stack or in the heap. In either case, even after memory is dropped (or freed, garbage collected etc), the contents may still lurk in the computer - even beyond the life of the program. It is therefore important that such data be cleared before the memory is dropped so that secrets are not leaked to unexpected places.
Why is Zeroizing important?
The code below demonstrates that even after it has been dropped, data stored in a given memory location can still be read.
1use std::mem;
2use std::ptr;
3
4struct SensitiveData {
5 data: [u8; 16], // Representing sensitive data
6}
7
8fn main() {
9 // Some mock sensitive data
10 let sensitive = SensitiveData { data: [42; 16] };
11
12 let data_location = &sensitive.data as *const u8;
13 mem::drop(sensitive);
14
15 // Attempt to read the data back
16 // after it has been dropped
17 let mut recovered_data = [0u8; 16];
18 unsafe {
19 ptr::copy_nonoverlapping(
20 data_location,
21 recovered_data.as_mut_ptr(),
22 16
23 );
24 }
25
26 println!("Recovered data: {:?}", recovered_data);
27}
The code calls creates a mock SensitiveData
value and then calls mem::drop
directly instead of letting Rust do it when the value goes out of scope. Before doing so, it stores the location of the memory that was used for the data as a raw pointer and then uses that location to read back the original contents of the memory.
While this is a very simple example, it illustrates that just because memory is dropped, data still exists in the system even if the program doesn't care about it anymore.
How to Zeroize
Zeroizing memory is surprisingly very tricky. Even Rust, famous for memory safety has no formal built-in way to do this. The main challenge is stopping the compiler from optimizing away code that it thinks is not necessary.
Let's look at an example.
1// lib.rs (simd_zeroize)
2pub struct SafeArray([u32; 4]);
3
4impl SafeArray {
5 pub fn consume_and_sum(self) -> u32 {
6 // Careful! This could overflow!
7 self.0.into_iter().sum()
8 }
9}
In this code, I have a type called SafeArray
which just wraps a 4-element array of u32
. I've created my own type so that I can implement the Drop
trait in a moment.
My type has a single function which consumes self
and sums all elements as a u32. Because self
is consumed but not returned it will be dropped. (Be aware that this code could easily cause an addition overflow but I'm intentionally keeping it very simple to limit how much assembly code is generated).
Inspecting the compiled code
To really understand what's going on here we can look at the compiled assembly code. I'm working on a Mac and can do this using the objdump
tool. Compiler Explorer is also a handy tool but doesn't seem to support Arm assembly which is what Rust will use when compiling on Apple Silicon.
Before looking at the assembly, the code must be compiled in release mode as this will ensure that all of the compiler's target optimizations are applied.
1cargo build --release
Then I'll use objdump
to disassemble the machine code into Arm64 ASM:
1objdump -d target/debug/libsimd_zeroize.rlib > assembly.s
Here's the assembly.s
file:
10000000000000000 <ltmp0>:
2 0: 00 00 c0 3d ldr q0, [x0]
3 4: 00 b8 b1 4e addv.4s s0, v0
4 8: 00 00 26 1e fmov w0, s0
5 c: c0 03 5f d6 ret
Don't worry if you don't know or understand assembly code, we'll focus just on specific instructions for this exercise.
The line starting with 0000000000000000
is the label Rust has given to the consume_and_sum
method and the actual machine instructions are contained below it. These steps load the values from a memory address stored in x0
into a register called q0
, add all 4 values in one step (using the vectorized addv.4s instruction), move the result into an output register and return.
Registers are what the CPU uses to perform most operations so this code loads data from memory into the register to that an operation can be performed.
Implementing Drop
Let's see what happens when we try to implement zeroization when our SafeArray
is dropped.
1impl Drop for SafeArray {
2 fn drop(&mut self) {
3 // Demonstration only: Don't do this
4 self.0 = [0; 4];
5 }
6}
This is the ASM for the whole program:
10000000000000000 <ltmp0>:
2 0: 00 00 c0 3d ldr q0, [x0]
3 4: 00 b8 b1 4e addv.4s s0, v0
4 8: 01 00 26 1e fmov w1, s0
5 c: 1f 7c 00 a9 stp xzr, xzr, [x0]
6 10: e0 03 01 aa mov x0, x1
7 14: c0 03 5f d6 ret
The important line is shown below. It uses stp
which stores a pair of registers, in this case the special zero register, xzr
in the memory pointed to by x0
. In other words, the memory was zeroed! It worked!
1 c: 1f 7c 00 a9 stp xzr, xzr, [x0]
But let's not get too excited, yet. We should check that it still works for other types. Changing the code to use u8
instead of u32
(and leaving the drop implementation the same), we have:
1// Changed to u8
2pub struct SafeArray([u8; 4]);
3
4impl SafeArray {
5 pub fn consume_and_sum(self) -> u8 {
6 // Careful! This could overflow!
7 self.0.into_iter().sum()
8 }
9}
Compiles to the following:
10000000000000000 <ltmp0>:
2 0: 08 20 40 0b add w8, w0, w0, lsr #8
3 4: 08 41 40 0b add w8, w8, w0, lsr #16
4 8: 00 61 40 0b add w0, w8, w0, lsr #24
5 c: c0 03 5f d6 ret
It looks quite different from the earlier version! The compiler is using a totally different approach. This code is doing is a series of additions involving the original value in w0
and its progressively right-shifted versions. After each shift, the shifted value is added to an accumulating sum. The shifts are by 8, 16, and then 24 bits, effectively breaking w0
into four bytes, adding these bytes together, and storing the final sum back into w0
.
But where is the zeroizing code!? For some reason the compiler decided that our code to zeroize was irrelevant and optimized it away.
Avoiding unsafe compiler operations
Compilers are complicated pieces of software and are designed to generate code that is optimal for the target architecture. This means their behaviour can sometimes be hard to reason about and, like in the case above, remove code that is important to security in the interests of performance.
We need a different approach to ensure our attempts to zeroize data don't get optimized away.
Thankfully, there is already a crate to do this: Zeroize!
I'll add it to my Cargo.toml
with the derive
feature enabled as we'll use that in a moment. I've also added #[no_mangle]
to the drop
which retains symbol names in the generated assembly code and will make things a bit easier to read.
1# Cargo.toml
2
3[dependencies]
4zeroize = { version = "1.7.0", features = ["derive"] }
Now we can derive Zeroize
for SafeArray
and call zeroize
in the Drop
implementation:
1use zeroize::Zeroize;
2
3#[derive(Zeroize)]
4pub struct SafeArray(pub [u8; 4]);
5
6impl Drop for SafeArray {
7 #[no_mangle]
8 fn drop(&mut self) {
9 self.0.zeroize();
10 }
11}
The compiled assembly is as follows:
10000000000000000 <ltmp0>:
2 0: ff 43 00 d1 sub sp, sp, #16
3 4: 08 7c 08 53 lsr w8, w0, #8
4 8: e8 2f 00 39 strb w8, [sp, #11]
5 c: 09 7c 10 53 lsr w9, w0, #16
6 10: e9 2b 00 39 strb w9, [sp, #10]
7 14: 0a 7c 18 53 lsr w10, w0, #24
8 18: ea 27 00 39 strb w10, [sp, #9]
9 1c: 08 01 00 0b add w8, w8, w0
10 20: 29 01 0a 0b add w9, w9, w10
11 24: 00 01 09 0b add w0, w8, w9
12 28: ff 33 00 39 strb wzr, [sp, #12]
13 2c: ff 2f 00 39 strb wzr, [sp, #11]
14 30: ff 2b 00 39 strb wzr, [sp, #10]
15 34: ff 27 00 39 strb wzr, [sp, #9]
16 38: ff 43 00 91 add sp, sp, #16
17 3c: c0 03 5f d6 ret
18
190000000000000040 <_drop>:
20 40: 1f 00 00 39 strb wzr, [x0]
21 44: 1f 04 00 39 strb wzr, [x0, #1]
22 48: 1f 08 00 39 strb wzr, [x0, #2]
23 4c: 1f 0c 00 39 strb wzr, [x0, #3]
24 50: c0 03 5f d6 ret
There is a lot more code now but for the most part it is doing the same thing as before (the addition is done over several instructions this time though).
The important part is that we have a Drop
implementation that is correctly zeroizing memory 🎉. As you can see, there is the implementation of the Drop
trait, conveniently labeled <_drop>
(thanks to #[no_mangle]
) but that the zeroizing code has also been included (via inlining) in the summation code above. In this case, the compiler has used the strb
instruction to store the zero register (wzr
) into each element of our array.
Using ZeroizeOnDrop
The Zeroize crate comes with a marker trait called ZeroizeOnDrop
which works for any Zeroize
type and means I don't have to implement Drop
every time. I can derive ZeroizeOnDrop
instead of using my own Drop
implementation.
1use zeroize::{Zeroize, ZeroizeOnDrop};
2
3#[derive(Zeroize, ZeroizeOnDrop)]
4pub struct SafeArray(pub [u8; 4]);
Caution!
Implementing Zeroize
alone won't automatically zeroize memory on drop. Zeroize
just implements the zeroize
method to clear memory. The ZeroizeOnDrop
trait must be implemented as well to automatically zeroize when the value is dropped.
But...what about Portable SIMD?
But you may also be asking, what is SIMD!?
...um, what is SIMD?
Single Instruction, Multiple Data (SIMD) is a parallel processing paradigm used in computer architecture to enhance performance by executing the same operation simultaneously on multiple data points. This approach is especially effective for tasks that require the same computation to be repeated over a large data set, such as in digital signal processing, image and video processing, and scientific simulations. In my case, I'm using SIMD for high-performance cryptography implementations.
SIMD architectures achieve this by employing vector processors or SIMD extensions in CPUs, where a single instruction directs the simultaneous execution of operations on multiple data elements within wider registers. For instance, a SIMD instruction could add or multiply pairs of numbers in a single operation, significantly speeding up computations compared to processing each pair sequentially. This method leverages data-level parallelism, different from the traditional sequential execution model, and is a key feature in modern processors to boost computational efficiency and performance.
For example, with SIMD I can sum 8 arrays of 4 integers in parallel.
1#![feature(portable_simd)]
2use core::simd::prelude::Simd;
3
4let x: [Simd<u32, 8>; 4] = [
5 Simd::from_array([1, 1, 1, 1, 1, 1, 1, 1]),
6 Simd::from_array([2, 2, 2, 2, 2, 2, 2, 2]),
7 Simd::from_array([1, 2, 3, 4, 5, 6, 7, 8]),
8 Simd::from_array([0, 0, 0, 0, 0, 0, 0, 0]),
9];
10
11let sums = x.into_iter().reduce(|sum, x| sum + x);
12dbg!(sums);
This code outputs:
1Some(
2 [
3 4, // 1 + 2 + 1 + 0
4 5, // 1 + 2 + 2 + 0
5 6, // etc
6 7,
7 8,
8 9,
9 10,
10 11,
11 ],
12)
Neat, huh?!
OK, back to Zeroize for SIMD
While the Zeroize crate is awesome, and you should absolutely use it, it doesn't currently have implementations for the forthcoming portable SIMD modules for Rust. Unlike working with SIMD directly, which requires knowledge of the specific CPU architecture you're building for, Portable SIMD abstracts common CPU vectorizations into a universal interface that works on most architectures.
I've created a type which wraps Simd<u16, 8>
, a vector of 8 u16
values and a simple method that adds 2 values, consuming both.
1pub struct MySimd(Simd<u16, 8>);
2
3impl MySimd {
4 #[no_mangle]
5 pub fn consume_and_add(self, other: Self) -> Self {
6 Self(self.0 + other.0)
7 }
8}
The generated assembly is as follows:
10000000000000000 <ltmp0>:
2 0: 00 00 c0 3d ldr q0, [x0]
3 4: 21 00 c0 3d ldr q1, [x1]
4 8: 20 84 60 4e add.8h v0, v1, v0
5 c: 00 01 80 3d str q0, [x8]
6 10: c0 03 5f d6 ret
We just added 8 pairs of numbers in only 5 instructions! Let's try adding a Drop implementation.
1impl Drop for MySimd {
2 fn drop(&mut self) {
3 // splat is roughly equivalent to `[0u16; 8]
4 self.0 &= Simd::splat(0);
5 }
6}
But oh no! The generated assembly is identical! My drop code was completely ignored 😫.
10000000000000000 <ltmp0>:
2 0: 00 00 c0 3d ldr q0, [x0]
3 4: 21 00 c0 3d ldr q1, [x1]
4 8: 20 84 60 4e add.8h v0, v1, v0
5 c: 00 01 80 3d str q0, [x8]
6 10: c0 03 5f d6 ret
Using unsafe to be safe!?
Ironically, the only way we can make this code safely and correctly zero memory that may contain sensitive data is to use some unsafe
operations. The Zeroize crate itself uses two approaches to avoid compiler optimizations removing zeroizing code. I'll use them both here:
1use core::{ptr, sync::atomic};
2
3impl Drop for MySimd {
4 fn drop(&mut self) {
5 unsafe {
6 ptr::write_volatile(self, core::mem::zeroed())
7 };
8 atomic::compiler_fence(atomic::Ordering::SeqCst);
9 }
10}
Before explaining what's going on, let's first see if it works.
10000000000000000 <ltmp0>:
2 0: 00 e4 00 6f movi.2d v0, #0000000000000000
3 4: 00 00 80 3d str q0, [x0]
4 8: c0 03 5f d6 ret
5
6000000000000000c <_consume_and_add>:
7 c: 00 00 c0 3d ldr q0, [x0]
8 10: 21 00 c0 3d ldr q1, [x1]
9 14: 20 84 60 4e add.8h v0, v1, v0
10 18: 00 01 80 3d str q0, [x8]
11 1c: 00 e4 00 6f movi.2d v0, #0000000000000000
12 20: 20 00 80 3d str q0, [x1]
13 24: 00 00 80 3d str q0, [x0]
14 28: c0 03 5f d6 ret
The two functions represent the consume_and_add
method on MySimd
and the drop
method in the Drop
trait. The top function confusingly denoted by ltmp0
(I'm still not sure why) is the Drop code and it contains:
1 0: 00 e4 00 6f movi.2d v0, #0000000000000000
This moves the special zero value into the vector v0
which was dropped. Because the consume_and_add
method returns a vector, only one of the 2 arguments is actually dropped. You can also see that the same code has been inlined into the consume_and_add
function.
So, what's going on here?
Firstly, we're using write_volatile to reliably zero the target memory. The Rust compiler guarantees not to mess with it! Unfortunately, the method is unsafe but its the only way to safely zero the data.
Secondly, we're using what's called an atomic compiler fence which tells the compiler it is not allowed to reorganize the memory in question. It doesn't prevent the CPU from doing so in hardware though that is a post for another day.
Implementing Zeroize
Instead of implementing Drop
I can use my custom Zeroize
implementation and then just implement ZeroizeOnDrop
like we did earlier.
1impl Zeroize for MySimd {
2 fn zeroize(&mut self) {
3 unsafe {
4 ptr::write_volatile(self, core::mem::zeroed())
5 };
6 atomic::compiler_fence(atomic::Ordering::SeqCst);
7 }
8}
9
10impl ZeroizeOnDrop for MySimd {}
Better, safer code
While you may not have this exact problem in your day-to-day code, understanding what's happening under the hood can be instructive. And hopefully lead to better and safer code.
:wq