Rust without the async (hard) part

May 09, 2022

Last week two blog posts, titled Rust Is Hard, Or: The Misery of Mainstream Programming and (async) Rust doesn’t have to be hard, sparked a lot of discussions on Hacker News and Reddit.

This topic is close to my heart. After all, at lunatic we are building a Rust runtime with the performance characteristics of async Rust, but without the issues of async Rust. It brings Rust back to its pre-1.0 days when it still used user space threads as a concurrency model.

In this blog post I would like to compare both approaches and show the differences between the programming models and runtime characteristics.

What problem are we solving with async Rust?

Massive concurrency. Your first reaction might be, WAIT I don’t even need massive concurrency, and you might be completely right. If you don’t need it, you would be even better off without it (able to achieve higher throughput).

However, if you are doing web apps or any networking stuff, massive concurrency benefits are almost always too important to ignore. Look at a web server, it accepts an incoming request, opens a connection to the database and waits until the database returns some data. Once it has the data it will write it back to the client. Putting a faster CPU into the server will not help us here. Most of the time we are actually not doing any compute, just waiting. But if we can wait on more of the database queries concurrently without queuing them, we can answer more requests in a shorter time.

This becomes even more important if we introduce some real-time components to our websites. Now we have a permanent concurrent connection to the server that mostly doesn’t do anything, but occasionally will push a notification to the user. More users looking at our site simultaneously means more concurrency on the backend.

That’s why async is almost the default choice for everyone starting a new web project.

Why can’t we “just wait”?

What is the big deal with waiting? If we aren’t doing any work, why is waiting so hard? The main issue comes from the fact that while we are waiting we need to hold onto all data required to continue the computation, like local variables.

There is a concurrency primitive that ships with every Operating System, the thread. When Operating Systems switch out threads, they preserve all CPU registers for us. Let’s look at an example.

use std::io::Write;

fn write_number(mut stream: std::net::TcpStream, number: u64) {
    let number_as_bytes = number.to_le_bytes();
    stream.write_all(&number_as_bytes).unwrap();
    println!(
        "Number `{}` written as `{:?}`",
        number, number_as_bytes
    );
}

Once the execution reaches the line

stream.write_all(&number_as_bytes).unwrap();

the Operating System may take some time to write the bytes to the networking interface and is free to schedule other work on this CPU core. However, after the data is written and the computation is resumed, the local variables number and number_as_bytes are still used in the println! statement and need to “survive” the wait. So, where are they saved?

Sometimes, these variables will be inside of CPU registers, this means that they are automatically saved by the OS with all other CPU registers belonging to this thread. Other times, we have too many variables to fit into CPU registers, and they will be saved on the stack. Each thread gets a separate stack. The stack is just a memory region where the thread can put local variables or even control flow data, like information about where to return from the currently executing function.

How big the stack is depends on the Operating System, on Windows it’s usually 1Mb by default and on Linux 8Mb. This may seem like a lot, if we had 1,000 threads we would reserve 8Gb just for the stack space. Luckily, this is mostly virtual memory, meaning that the Operating System will only back it with “real” memory in increments of 4Kb (page size) once some data is written to it. Until the available space is filled up, and we reach the famous stack overflow panic. In the majority of applications we never come close to it though.

So, can we use OS threads?

The problem is that threads just don’t work in practice for massive concurrency. On my old MacBook Pro the whole machine would reboot if you spawned more than 2,000 threads inside one process and others have similar issues. Linux will be a bit better, but I never heard of anyone being able to pull off massive concurrency just with Operating System threads.

In the end, that’s why we have so many solutions in the user space, to address shortcomings of OS threads.

What are our options?

If the Operating System is so bad at it, can we maybe somehow do it on our own in user space? Preserve local variables (context) and schedule different tasks while we are waiting on i/o.

In the rest of this post I’m going to focus on two approaches of solving this problem in user space:

Async Rust
Virtual threads (lunatic)

Let’s look at the previous example in async Rust and the equivalent lunatic version with virtual threads.

use tokio::io::AsyncWriteExt;

async fn write_number(
    mut stream: tokio::net::TcpStream,
    number: u64
) {
    let number_as_bytes = number.to_le_bytes();
    stream.write_all(&number_as_bytes).await.unwrap();
    println!(
        "Number `{}` written as `{:?}`",
        number, number_as_bytes
    );
}

use std::io::Write;

fn write_number(
    mut stream: lunatic::net::TcpStream,
    number: u64
) {
    let number_as_bytes = number.to_le_bytes();
    stream.write_all(&number_as_bytes).unwrap();
    println!(
        "Number `{}` written as `{:?}`",
        number, number_as_bytes
    );
}

Both examples look similar. The lunatic one is almost indistinguishable from the original example using threads, except the lunatic::net::TcpStream type. It also works similar to the one using threads. It allocates a memory region for a virtual stack where it preserves the local variables during context switches.

The async example may syntactically look similar to the thread one, but it works completely different under the hood. I think that this is the main reason why many people consider async Rust to be hard. It gives us a similar syntax, but the execution model is different, making it much harder to reason about what is happening.

What happens here is that the Rust compiler analyzes the async function, concludes that there are two variables that need to be preserved if a context switch happens at .await and creates a special struct that just has space for these two variables.

The benefit of this approach is clear right away, instead of pre-reserving 1Mb of space for a stack to store local variables, we get a tightly packed structure just using the amount of space required to preserve two variables. And the Rust compiler is smart, if we have multiple .await points it will re-use this struct to preserve the minimal amount of information, depending on the access patterns after the .await points.

The async function has the syntax of a regular function, but it doesn’t “execute” in the same way. We don’t have a regular stack that grows as we call more functions, instead the compiler creates a state machine that an async executor can drive forward.

This becomes obvious in the following example, where we introduce a recursive call to the write function in case the number is lower than 5.

use tokio::io::AsyncWriteExt;

async fn write_number_inc(
    stream: &mut tokio::net::TcpStream,
    number: u64
) {
    if number < 5 {
        write_number_inc(stream, number + 1).await;
    }

    let number_as_bytes = number.to_le_bytes();
    stream.write_all(&number_as_bytes).await.unwrap();
    println!(
        "Number `{}` written as `{:?}`",
        number, number_as_bytes
    );
}

This fails to compile with error[E0733]: recursion in an "async fn" requires boxing.

Instead, we need to do this:

Box::pin(async move {
    write_number_inc(stream, number + 1).await;
}).await;

Of course, this will also fail to compile, because we are suddenly moving the stream into a new place. How do we resolve this? Unclear, as TcpStream is not Clone. We probably need an Arc<Mutex> :)

use std::io::Write;

fn write_number_inc(
    stream: &mut lunatic::net::TcpStream,
    number: u64
) {
    if number < 5 {
        write_number_inc(stream, number + 1);
    }

    let number_as_bytes = number.to_le_bytes();
    stream.write_all(&number_as_bytes).unwrap();
    println!(
        "Number `{}` written as `{:?}`",
        number, number_as_bytes
    );
}

Works as you would expect.

It’s obvious why you can’t have recursive calls in async functions. The compiler can’t pre-determine how many recursive calls you are going to have and can’t reserve the “perfect” amount of space to preserve all local variables during context switches.

On the lunatic side we just keep pushing the new local variables onto the virtual stack, until we run out of space. We don’t need to determine during compile time our maximum stack usage and can fail during runtime if we end up with too many recursive calls, similar to how OS threads work.

It was easy to explain why async functions can’t be recursive, but some other limitations of async Rust are much harder to explain. Let’s say we want to serialize the number we are writing to the TcpStream with the popular serialization library serde and high performance serializer bincode.

use tokio::io::AsyncWriteExt;

async fn write_number(
    stream: &mut tokio::net::TcpStream,
    number: u64
) {
    let temp_buff
        = bincode::serialize(&number).unwrap();
    stream.write_all(&temp_buff).await.unwrap();
    println!(
        "Number `{}` written as bincode.",
        number
    );
}

Serde doesn’t play nicely with async Rust and you are forced to temporarily create buffers.

use std::io::Write;

fn write_number_inc(
    stream: &mut lunatic::net::TcpStream,
    number: u64
) {
    bincode::serialize_into(stream, &number);
    println!(
        "Number `{}` written as bincode.",
        number
    );
}

Lunatic’s TcpStream implements Read & Write traits and bincode can serialize the number directly into the stream without needing to produce in-between copies.

To explain why async Rust types don’t support Read and Write traits or why we can’t have async functions in traits would require a blog post on its own, so I won’t go there. The complexity of this compiler transformation, implications on the execution model and lifetimes inside async functions are just hard to comprehend. There are many awesome posts written about it, go read them instead.

Workarounds?

The truth is, once we step into the async Rust world we get a lot of limitations. The general advice is, just work around them. Don’t do a zero-copy serialization, allocate an in-between buffer. Don’t have async traits? Just use a macro that will allocate some memory every time the function is called. Can’t satisfy the borrow checker? Just wrap the value into an Arc<Mutex>. Rust is plenty fast that this usually doesn’t matter.

I just think that there is a better way. If we are already sacrificing so much, why do we insist on having “perfectly” sized stacks with async Rust? An approach where we use virtual stacks seems to work just fine.

It’s important to point out that lunatic still schedules the virtual threads in user-space. As a matter of fact, it uses an async Rust executor to do so. The decision when a task is going to be switched out or what tasks are ready are the same in lunatic and async Rust. The only difference is the per task “stack” size and the fact that lunatic can’t determine the maximum stack size upfront.

(This is not completely true, lunatic also inserts preemption points into the code so that long-running compute workloads don’t block the scheduler. Something that the developer needs to manually handle when working with async Rust.)

Conclusion

Massive concurrency is a necessary part of modern web/networking apps. With this post I wanted to present an alternative to async Rust to achieve massive concurrency. It’s not really a novel idea, because Rust used to work that way in the past, but ultimately went into another direction.

The effort that went into making async Rust was huge. It pushes the type system to the limit and I believe that it’s pretty well-designed. In the end, that’s what actually allowed me to build the lunatic runtime on top of it. When you work on such a low-level and performance critical code, you will need to completely understand what is actually happening in the background and the errors thrown at you by the compiler can be great to guide you in the right direction.

However, there is another part to Rust. A beautifully designed, somewhat high-level language, with an awesome type system and an ecosystem of composable libraries. And I feel that this ecosystem would benefit much more from a virtual thread approach to massive concurrency, where you still get to use the Read & Write traits to compose things together and wouldn’t need to build a complex mental model of execution.