Futurelock：异步 Rust 中的隐性风险

Futurelock：异步 Rust 中的隐性风险
Futurelock: A subtle risk in async Rust

原始链接: https://rfd.shared.oxide.computer/rfd/0609

核心问题不在于有界通道本身，而在于 Omicron 中*如何*使用它们，特别是与 `send(msg).await` 结合使用。虽然有界通道正确地执行 FIFO 顺序，但 `send().await` 会创建一个*无界*等待队列，可能导致 futurelocks。当前使用小容量通道（如 capacity=1）并进行阻塞发送的做法，通常基于单个 actor/多个客户端模型。更好的方法是使用更大容量的通道，结合非阻塞的 `try_send()`，并在通道满时传播失败。选择合适的容量至关重要——足够大以处理预期的并发量，但又不能大到压垮 actor。找到“合适”的大小通常需要经验，类似于设置超时。`send_timeout()` 无法解决问题，因为它仍然涉及阻塞，并且在 futurelock 期间不会被轮询。本质上，应避免在有界通道*周围*创建无界队列。

原文

Bounded channels are not really the issue here. Even in omicron#9259, the capacity=1 channel was basically behaving as documented and as one would expect. It woke up a sender when capacity was available, and the other senders were blocked to maintain the documented FIFO property. However, some of the patterns that we use with bounded channels are problematic on their own and, if changed, could prevent the channel from getting caught up in a futurelock.

In Omicron, we commonly use bounded channels with send(msg).await. The bound is intended to cap memory usage and provide backpressure, but using the blocking send creates a second unbounded queue: the wait queue for the channel. Instead, we could consider using a larger capacity channel plus try_send() and propagate failure from try_send().

As an example, when we use the actor pattern, we typically observe that there’s only one actor and potentially many clients, so there’s not much point in buffering messages in the channel. So we use capacity = 1 and let clients block in send().await. But we could instead have capacity = 16 and have clients use try_send() and propagate failure if they’re unable to send the message. The value 16 here is pretty arbitrary. You want it to be large enough to account for an expected amount of client concurrency, but not larger. If the value is too small, you’ll wind up with spurious failures when the client could have just waited a bit longer. If the value is too large, you can wind up queueing so much work that the actor is always behind (and clients are potentially even timing out at a higher level). One might observe:

Channel limits, channel limits: always wrong!

Some too short and some too long!

But as with timeouts, it’s often possible to find values that work in practice.

Using send_timeout() is not a mitigation because this still results in the sender blocking. It needs to be polled after the timeout expires in order to give up. But with futurelock, it will never be polled.

Futurelock：异步 Rust 中的隐性风险 Futurelock: A subtle risk in async Rust

Futurelock：异步 Rust 中的隐性风险
Futurelock: A subtle risk in async Rust