异步 Rust 从未离开 MVP 阶段

异步 Rust 从未离开 MVP 阶段
Async Rust never left the MVP state

原始链接: https://tweedegolf.nl/en/blog/237/async-rust-never-left-the-mvp-state

## Async Rust 代码膨胀与编译器优化本文深入探讨了 Rust 中 “async 代码膨胀” 的问题——异步代码生成的二进制文件尺寸出乎意料地大，尤其对于资源受限的环境（如微控制器）而言，这尤其成问题。虽然 async Rust 提供了强大的并发性，但它并不总是能实现承诺的零成本抽象。作者建议在编译器层面解决这个问题，并提交了一个项目目标以资助这项工作。目前的代码膨胀源于即使对于简单的异步函数，也会生成不必要的复杂状态机。具体来说，`Future::poll` 函数包含 `Returned` 和 `Panicked` 状态，这会引入开销。作者建议在发布模式下用 `Poll::Pending` 替换 panic，从而在嵌入式固件中实现 2-5% 的尺寸缩减。进一步的优化包括消除没有 `await` 调用时的异步块的状态机（节省 0.2%）以及内联 future 以避免冗余的状态机创建。折叠状态机内的相同代码路径也为减小尺寸提供了显著的机会。这些变化结合起来，显示出提高性能和减小二进制文件尺寸的潜力。作者正在寻求资助（预估 30,000 欧元）以在 Rust 编译器中直接实现这些优化，并欢迎感兴趣的组织联系。 ([https://rust-lang.github.io/rust-project-goals/2026/async-statemachine-optimisation.html](https://rust-lang.github.io/rust-project-goals/2026/async-statemachine-optimisation.html))

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Async Rust 从未离开 MVP 状态 (tweedegolf.nl) 30 分，pjmlp 发表于 32 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 hmry 发表于 6 分钟前 [–] 好文章！喜欢这种深入优化的分析。希望项目目标能够实现！我之前觉得编译器往往不会花太多精力去优化“简单”的情况。标题有点夸张，不过。我也会点进“编译器仍然错过的 Async Rust 优化”这样的标题。回复考虑申请 YC 2026 夏季班！申请截止至 5 月 4 日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系方式搜索：

I've previously explained async bloat and some work-arounds for it, but would much prefer to solve the issue at the root, in the compiler. I've submitted a Project Goal, and am looking for help to fund the effort.

I love me some async Rust! It's amazing how we can write executor agnostic code that can run concurrently on huge servers and tiny microcontrollers.

But especially on those tiny microcontrollers we notice that async Rust is far from the zero cost abstractions we were promised. That's because every byte of binary size counts and async introduces a lot of bloat. This bloat exists on desktops and servers as well, but it's much less noticable when you have substantially more memory and compute available.

I've previously explained some work-arounds for this issue, but would much prefer to get to the root of the problem, and work on improving async bloat in the compiler. As such I have submitted a Project Goal.

This is part 2 of my blog series on this topic. See part 1 for the initial exploration of the topic and what you can do when writing async code to avoid some of the bloat. In this second part we'll dive into the internals and translate the methods of blog 1 into optimizations for the compiler.

What I won't be talking about is the often discussed problem of futures becoming bigger than necessary and them doing a lot of copying. People are aware of that already. In fact, there is an open PR that tackles part of it: https://github.com/rust-lang/rust/pull/135527

Anatomy of a generated future

We're going to be looking at this code:

fn foo() -> impl Future<Output = i32> {
    async { 5 }
}

fn bar() -> impl Future<Output = i32> {
    async {
        foo().await + foo().await
    }
}

godbolt

We're using the desugared syntax for futures because it's easier to see what's happening.

So what does the bar future look like?

There are two await points, so the state machine must have at least two states, right?

Well, yes. But there's more.

Luckily we can ask the compiler to dump MIR for us at various passes. An interesting pass is the coroutine_resume pass. This is the last async-specific MIR pass. Why is this important? Well, async is a language feature that still exists in MIR, but not in LLVM IR. So the transformation of async to state machine happens as a MIR pass.

The bar function generates 360 lines of MIR. Pretty crazy, right? Although this gets optimized somewhat later on, the non-async version uses only 23 lines for this.

The compiler also outputs the CoroutineLayout. It's basically an enum with these states (comments my own):

variant_fields: {
    Unresumed(0): [], // Starting state
    Returned (1): [],
    Panicked (2): [],
    Suspend0 (3): [_s1], // At await point 1, _s1 = the foo future
    Suspend1 (4): [_s0, _s2], // At await point 2, _s0 = result of _s1, s2 = the second foo future
},

So what are Returned and Panicked?

Well, Future::poll is a safe function. Calling it must not induce any UB, even when the future is done. So after Suspend1 the future returns Ready and the future is changed to the Returned state. Once polled again in that state, the poll function will panic.

The Panicked state exists so that after an async fn has panicked, but the catch-unwind mechanism was used to catch it, the future can't be polled anymore. Polling a future in the Panicked state will panic. If this mechanism wasn't there, we could poll the future again after a panic. But the future may be in an incomplete state and so that could cause UB. This mechanism is very similar to mutex poisoning.

(I'm 90% sure I'm correct about the Panicked state, but I can't really find any docs that actually describe this.)

Cool, this seems reasonable.

Why panic?

But is it reasonable? Futures in the Returned state will panic. But they don't have to. The only thing we can't do is cause UB to happen.

Panics are relatively expensive. They introduce a path with a side-effect that's not easily optimized out. What if instead, we just return Pending again? Nothing unsafe going on, so we fulfill the contract of the Future type.

I've hacked this in the compiler to try it out and saw a 2%-5% reduction in binary size for async embedded firmware.

So I propose this should be a switch, just like overflow-checks = false is for integer overflow. In debug builds it would still panic so that wrong behavior is immediately visible, but in release builds we get smaller futures.

Similarly, when panic=abort is used, we might be able to get rid of the Panicked state altogether. I want to look into the repercussions of that.

Always a state machine

We've looked at bar, but not yet at foo.

fn foo() -> impl Future<Output = i32> {
    async { 5 }
}

Let's implement it manually, to see what the optimal solution would be.

struct FooFut;
impl Future for FooFut {
    type Output = i32;
    
    fn poll(self: Pin<&mut Self>, _cx: &mut Context<'_>) -> Poll<Self::Output> {
        Poll::Ready(5)
    }
}

Easy right? We don't need any state. We just return the number.

Let's see what the generated MIR is for the version the compiler gives us:

// MIR for `foo::{closure#0}` 0 coroutine_resume
/* coroutine_layout = CoroutineLayout {
    field_tys: {},
    variant_fields: {
        Unresumed(0): [],
        Returned (1): [],
        Panicked (2): [],
    },
    storage_conflicts: BitMatrix(0x0) {},
} */

fn foo::{closure#0}(_1: Pin<&mut {async block@src\main.rs:5:5: 5:10}>, _2: &mut Context<'_>) -> Poll<i32> {
    debug _task_context => _2;
    let mut _0: core::task::Poll<i32>;
    let mut _3: i32;
    let mut _4: u32;
    let mut _5: &mut {async block@src\main.rs:5:5: 5:10};

    bb0: {
        _5 = copy (_1.0: &mut {async block@src\main.rs:5:5: 5:10});
        _4 = discriminant((*_5));
        switchInt(move _4) -> [0: bb1, 1: bb4, otherwise: bb5];
    }

    bb1: {
        _3 = const 5_i32;
        goto -> bb3;
    }

    bb2: {
        _0 = Poll::<i32>::Ready(move _3);
        discriminant((*_5)) = 1;
        return;
    }

    bb3: {
        goto -> bb2;
    }

    bb4: {
        assert(const false, "`async fn` resumed after completion") -> [success: bb4, unwind unreachable];
    }

    bb5: {
        unreachable;
    }
}

Yikes! That's a lot of code!

Notice at line 4 that we still have the 3 default states and at line 22 that we're still switching on it. There's a big optimization opportunity here that we're not using, i.e. to have no states and always return Poll::Ready(5) on every poll.

I've also hacked this in the compiler and it saved 0.2% of binary size. Not a lot, but it's quite a simple optimization, so likely still worthwhile.

It does change the behavior a bit, but only for executors that aren't compliant. It means that the future always returns Ready. The behavior in the compiler right now is that any subsequent polls will panic.

LLVM to the rescue?

Ok, so the MIR output isn't great. But LLVM will pick up the pieces right?

Well, sometimes, yeah. But only when the futures are simple enough and you're running opt-level=3. If the future grows too complex (which happens fast because futures nest very deeply in idiomatic async Rust code) or you're optimizing for size (which we often do with embedded or wasm), LLVM doesn't optimize this all away.

Here's an example in godbolt: https://godbolt.org/z/58ahb3nne

If you look through the generated assembly, you'll notice that it does know that foo returns 5, but that it doesn't optimize the answer of bar to be 10. The poll function of foo is also still called. This is done because of the potential panic the compiler can't fully account for. It doesn't realize foo is only called once and won't ever panic in practice.

If we comment out the panicking branch in the IR, we see it gets optimized better: https://godbolt.org/z/38KqjsY8E

LLVM is not our savior here sadly. We really do need to give it good inputs.

It does better with opt-level=3, but eventually can't keep up either when the code gets less trivial. And that's because we're relying on LLVM to spot that it should optimize out the things we're asking it to do.

Futures aren't (trivially) inlined

Inlining is great since it enables further optimization passes. Sadly, generated Rust futures are never inlined. After each future gets its implementation, then LLVM and the linker get an opportunity to do inlining. But as we've seen above, that's too late.

The prime opportunity for inlining is this:

async fn foo(blah: SomeType) -> OtherType {
    // ...
}

async fn bar(blah: SomeType) -> OtherType {
    foo(blah).await
}

This is a pattern that happens a lot when creating abstractions using traits. With the current compiler, bar gets its own state machine that calls the foo state machine, which is very wasteful. Instead,bar could also become foo by just returning the foo future.

Things get a little more difficult when we add a preamble and a postamble to the example.

async fn foo(blah: bool) -> i32 {
    // ...
}

async fn bar(input: u32) -> i32 {
    let blah = input > 10; // Preamble
    let result = foo(blah).await;
    result * 2 // Postamble
}

This pattern is common when translating an async function from one signature to another, which happens for trait impls.

Note that bar doesn't need any async state of its own here either. No data is kept over the single await point that isn't captured by foo. bar can't simply become foo, but we can mostly rely on the state of foo. The manual implementation would be something like:

enum BarFut {
    Unresumed { input: u32 },
    Inlined { foo: FooFut }
}
impl Future for BarFut {
    type Output = i32;
    
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        // Ignoring pin projection here
        loop {
            match self {
                Unresumed { input } => {
                    let blah = input > 10; // Preamble
                    *self = BarFut::Inlined { foo: foo(blah) };
                },
                Inlined { foo } => {
                    break foo
                        .poll(cx)
                        .map(|result| result * 2) // Postamble
                },
            }
        }
    }
}

That's a lot better than what's currently generated. If only we were allowed to execute the code up to the first await point, then we could get rid of the Unresumed state. But "futures don't do anything unless polled" is guaranteed, so we can't change that.

There are more optimizations you could do with inlining if you were able to query properties of the futures you're polling. I don't think this is possible, at least not with the current architecture in rustc. Every async block is transformed individually and no data is kept about it afterwards.

For example, if you could query if a future always returns ready at the first poll, you wouldn't have to create a state for the await point in the future of the caller. If that were possible and you could apply these optimizations recursively, you could collapse a lot of futures into much simpler state machines.

I haven't tested out inlining yet, but I think this would significantly help binary size and performance.

Collapsing states

The state machine gets an extra state for each await point in the async block. But there's code where multiple states could be collapsed into 1.

Take this example:

pub async fn process_command() {
    match get_command() {
        CommandId::A => send_response(123).await,
        CommandId::B => send_response(456).await,
    }
}

It's very natural to write it that way. But what happens is that we're getting two identical states:

/* coroutine_layout = CoroutineLayout {
    field_tys: {
        _s0: CoroutineSavedTy { // Identical to _s1
            ty: Coroutine(
                DefId(0:11 ~ mir_test[b831]::send_response::{closure#0}),
                [
                    (),
                    std::future::ResumeTy,
                    (),
                    (),
                    (u32,),
                ],
            ),
            source_info: SourceInfo {
                span: src/main.rs:13:25: 13:49 (#14),
                scope: scope[0],
            },
            ignore_for_traits: false,
        },
        _s1: CoroutineSavedTy { // Identical to _s0
            ty: Coroutine(
                DefId(0:11 ~ mir_test[b831]::send_response::{closure#0}),
                [
                    (),
                    std::future::ResumeTy,
                    (),
                    (),
                    (u32,),
                ],
            ),
            source_info: SourceInfo {
                span: src/main.rs:14:25: 14:49 (#16),
                scope: scope[0],
            },
            ignore_for_traits: false,
        },
    },
    variant_fields: {
        Unresumed(0): [],
        Returned (1): [],
        Panicked (2): [],
        Suspend0 (3): [_s0], // 2 states
        Suspend1 (4): [_s1],
    },
    storage_conflicts: BitMatrix(2x2) {
        (_s0, _s0),
        (_s1, _s1),
    },
} */

The MIR for this function is 456 lines long and many basic blocks are essentially duplicates.

We can refactor the code manually to:

pub async fn process_command() {
    let response = match get_command() {
        CommandId::A => 123,
        CommandId::B => 456,
    };
    send_response(response).await;
}

Here we don't get the duplicate states:

/* coroutine_layout = CoroutineLayout {
    field_tys: {
        _s0: CoroutineSavedTy {
            ty: Coroutine(
                DefId(0:11 ~ mir_test[b831]::send_response::{closure#0}),
                [
                    (),
                    std::future::ResumeTy,
                    (),
                    (),
                    (u32,),
                ],
            ),
            source_info: SourceInfo {
                span: src/main.rs:16:5: 16:34 (#14),
                scope: scope[1],
            },
            ignore_for_traits: false,
        },
    },
    variant_fields: {
        Unresumed(0): [],
        Returned (1): [],
        Panicked (2): [],
        Suspend0 (3): [_s0],
    },
    storage_conflicts: BitMatrix(1x1) {
        (_s0, _s0),
    },
} */

The total MIR length is now 302 lines and nothing is duplicated.

So it seems like a good optimization pass to search for identical code paths and states and collapse them into one. This optimization probably stacks well with the inlining pass.

Some testing results

Replace Returned' panic with Poll::Pending: 2-5% binary size savings on embedded.
When no await, no statemachine: 0.2% binary size savings on embedded.
Together: ~3% perf increase on x86 in synthetic benchmark with smol executor.

Future inlining should have a greater effect again.

Ultimately it's hard to know the improvements until after it can be benched in real systems.

Summary

Hopefully this article shed some light on some of the async Rust issues!

I would love to work on these items in the compiler:

Returned state no longer panics in release mode
Async blocks without awaits should not get state machines, but just return ready every time
Future inlining for futures with a single await
Collapse identical states

Links to my hacks:

Supporting this Project Goal

I want to work on this in the compiler and as such have submitted it as a Project Goal: https://rust-lang.github.io/rust-project-goals/2026/async-statemachine-optimisation.html

But I need your help because I can't do much without funding.

If you're with a company or organization that would benefit from this work and would be willing to (partially) fund it, please contact me at [email protected]. The scope is flexible and so is the amount of funding required. However, I have estimated that €30k could get all or at least a lot of this work done.