异步 Python 实际上是确定性的

异步 Python 实际上是确定性的
Async Python Is Secretly Deterministic

原始链接: https://www.dbos.dev/blog/async-python-is-secretly-deterministic

为Python持久化执行库添加异步支持是一个挑战：持久化工作流*必须*是确定性的，以便通过重放实现可靠的恢复。然而，`asyncio`的并发性可能会引入不可预测的执行顺序，从而阻碍确定性重放。关键在于理解`asyncio`事件循环。尽管看起来是并发的，但它实际上是单线程的，以先进先出（FIFO）的方式从队列中处理任务。虽然任务的*内部*执行是不可预测的，但通过`asyncio.gather`创建的任务的*调度*是确定性的。为了利用这一点，作者实现了一个`@Step()`装饰器。这个装饰器在任何`await`调用之前，为每个工作流步骤分配一个唯一的、顺序递增的ID。这确保了即使在并发启动时，步骤也能以可预测的顺序处理。这种方法允许并发执行*和*确定性重放，这对于持久化工作流至关重要。结论是，对`asyncio`单线程本质的更深入理解简化了对并发的推理，并能够构建可靠的并发系统。

## 异步 Python 的确定性：总结最近的 Hacker News 讨论强调了 Python 的 `asyncio` 事件循环的一个令人惊讶的特性：它实际上是确定的。近十年来，标准库一直以函数*创建*的顺序一致地调度异步函数，为开发者提供稳定的行为。然而，这并非一个保证的功能——像 `trio` 这样的其他事件循环会故意随机化调度。虽然确定性通过允许可重复的运行来帮助调试，但它在 I/O 操作或任务生成其他任务时会失效。它对于函数*内部*的可预测执行很有价值，但不能保证整个程序的确定性。社区争论是否依赖这个“实现细节”是明智的，考虑到未来可能的更改或替代运行时（例如基于 Rust 的更快实现）。有些人认为这是一个脆弱的依赖，而另一些人则认为它对于构建可靠的并发系统很有用，尤其是在像 Temporal 这样的框架中，该框架利用它来实现持久执行和更轻松的调试。最终，开发者应该意识到这种行为及其局限性。

原文

When adding async support to our Python durable execution library, we ran into a fundamental challenge: durable workflows must be deterministic to enable replay-based recovery.

Making async Python workflows deterministic is difficult because they often run many steps concurrently. For example, a common pattern is to start many concurrent steps and use asyncio.gather to collect the results:

This is great for performance (assuming tasks are I/O-bound) as the workflow doesn’t have to wait for one step to complete before starting the next. But it’s not easy to order the workflow’s steps because those steps all run at the same time, with their executions overlapping, and they can complete in any order.

The problem is that concurrency introduces non-obvious step execution ordering. When multiple tasks run at the same time, the exact interleaving of their execution can vary. But during recovery, the workflow must be able to replay those steps deterministically, recovering completed steps from checkpoints and re-executing incomplete steps. This requires a well-defined step order that’s consistent across workflow executions.

So how do we get the best of both worlds? We want workflows that can execute steps concurrently, but still produce a deterministic execution order that can be replayed correctly during recovery. To make that possible, we need to better understand how the async Python event loop really works.

How Async Python Works

At the core of async Python is an event loop. Essentially, this is a single thread running a scheduler that executes a queue of tasks. When you call an async function, it doesn’t actually run; instead it creates a “coroutine,” a frozen function call that does not execute. To actually run an async function, you have to either await it directly (which immediately executes it, precluding concurrency) or create an async task for it (using asyncio.create_task or asyncio.gather), which schedules it on the event loop’s queue. The most common way to run many async functions concurrently is asyncio.gather, which takes in a list of coroutines, schedules each as a task, then waits for them all to complete.

Even after you schedule an async function by creating a task for it, it still doesn’t execute immediately. That’s because the event loop is single-threaded: it can only run one task at a time. For a new task to run, the current task has to yield control back to the event loop by calling await on something that isn’t ready yet. As tasks yield control, the event loop scheduler works its way through the queue, running each task sequentially until it itself yields control. When an awaited operation completes, the task awaiting it is placed back in the queue to resume where it left off.

Critically, the event loop schedules newly created tasks in FIFO order. Let’s say a list of coroutines is passed into asyncio.gather as in the code snippet above. asyncio.gather wraps each coroutine in a task, scheduling them for execution, then yields control back to the event loop. The event loop then dequeues the task created from the first coroutine passed into asyncio.gather and runs it until it yields control. Then, the event loop dequeues the second task, then the third, and so on.The order of execution after that is completely unpredictable and depends on what the tasks are actually doing, but tasks start in a deterministic order:

This makes it possible to deterministically order steps using code placed before the step’s first await. We can do this in the @Step() decorator, which wraps step execution. Before doing anything else, and in particular anything that might require an await, @Step() increments and assigns a step ID from workflow context. This way, step IDs are deterministically assigned in the exact order steps are passed into asyncio.gather. This guarantees that the step processing task one is step one, the step processing task two is step two, and so on.

To sum it up, when building Python libraries, it’s really important to understand the subtleties of asyncio and the event loop. While it might seem unintuitive at first, the single-threaded execution model is actually easier to reason about than parallel threads because tasks execute predictably and can only interleave when control is explicitly yielded via await. This makes it possible to write simple code that’s both concurrent and safe.

Learn More

If you like making systems reliable, we’d love to hear from you. At DBOS, our goal is to make durable workflows as easy to work with as possible. Check it out:

‍