PostgreSQL规划器半连接的陷阱：CTE、LIMIT和RETURNING

PostgreSQL规划器半连接的陷阱：CTE、LIMIT和RETURNING
A PostgreSQL planner semi-join gotcha with CTE, LIMIT, and RETURNING

原始链接: https://www.shayon.dev/post/2025/119/a-postgresql-planner-gotcha-with-ctes-delete-and-limit/

一个看似简单的PostgreSQL查询，使用CTE（公共表表达式）结合`DELETE ... RETURNING`和`LIMIT 1`来处理队列中的单个任务，却表现出异常行为：有时尽管使用了`LIMIT`，仍然返回多行结果。根本原因在于查询规划器选择了嵌套循环半连接（Nested Loop Semi Join）优化。它没有将CTE与`LIMIT 1`一起物化，而是针对外部扫描标识的每个候选行都执行子查询（包括`LIMIT`），导致删除了多行。这个间歇性问题取决于规划器基于统计信息和成本做出的选择。为了解决这个问题，查询被重构为直接在`DELETE`的`WHERE`子句中使用子查询：`DELETE FROM task_queue WHERE id = (SELECT id ... LIMIT 1) RETURNING item_id;` 这迫使规划器优先评估带有`LIMIT 1`的子查询。结论是：在使用CTE结合`DELETE`（或`UPDATE`）和`LIMIT`时，尤其是在依赖`RETURNING`保证原子性时，要格外小心。始终使用`EXPLAIN ANALYZE`来验证查询计划，并确保PostgreSQL按照预期执行查询，因为规划器的优化策略有时会出人意料。

这个Hacker News帖子讨论了PostgreSQL中与CTE（公共表表达式）、`LIMIT`和`RETURNING`相关的一个“陷阱”，即由于查询规划器优化子查询的方式，可能会删除意外数量的行。几位评论者对根本原因和可能的解决方法存在分歧。一些人认为问题源于子查询被执行多次，因为它是不相关的并且缺乏定义的顺序。使用`=`代替`IN`作为一种解决方法存在争议，有人担心这是偶然的，并非保证有效的修复方法。将CTE物化（`AS MATERIALIZED`）被提出作为一种更可靠的解决方案，以强制执行单次评估。讨论还涉及更广泛的SQL主题：SQL标准的局限性，查询计划对结果的影响，以及`SKIP LOCKED`的使用。此外，该帖子指出，问题源于子查询、查询规划器和无副作用函数的交互。几位参与者强调了在处理特定数据库实现时SQL标准的局限性。

优化Cloudflare D1数据库查询之旅 2025-04-07

PostgreSQL 内部结构：有关更新语句的知识 2023-12-31

（评论） 2024-02-11

Postgres 作为队列 2024-02-11

Postgres 中常见的数据库架构更改错误 2024-04-30

原文

I recently discovered an unexpected behavior in PostgreSQL involving a pattern of using a Common Table Expression (CTE) with DELETE ... RETURNING and LIMIT to process a batch of items from a queue-like table. What seemed straightforward turned out to have a surprising interaction with the query planner.

The scenario

Let’s say you have a task_queue table and want to pull exactly one task for a specific queue_group_id. A common approach is using a CTE:

-- The seemingly straightforward query
WITH deleted_tasks AS (
  DELETE FROM task_queue
  WHERE id IN (
    SELECT id FROM task_queue
    WHERE queue_group_id = 15 -- Specific group
    LIMIT 1                 -- <<< The crucial limit
    FOR UPDATE SKIP LOCKED  -- Avoid lock contention
  )
  RETURNING item_id         -- Get the ID of the item associated with the deleted task
)
SELECT item_id FROM deleted_tasks;

The intent here is clear: find one available task row (LIMIT 1), delete it, and return its associated item_id. You’d expect the RETURNING clause to give you back, at most, one item_id.

The surprise

In this case, we started seeing logs indicating that this exact query structure was sometimes returning multiple item_ids – sometimes 4, sometimes 5 – despite the explicit LIMIT 1 deep inside the query. How could that happen? If we’re telling Postgres to find and delete only one row based on the inner SELECT, how are we getting multiple rows back?

Deeper dive with EXPLAIN ANALYZE

When SQL behaves unexpectedly, EXPLAIN ANALYZE is essential. Running it on the problematic query revealed something interesting about the planner’s execution strategy:

-- Example EXPLAIN ANALYZE output showing the issue
Nested Loop Semi Join  (cost=0.00..2.06 rows=1 width=38) (actual time=0.045..0.065 rows=5 loops=1)
  Join Filter: (task_queue.id = "sub".id)
  ->  Seq Scan on task_queue  (cost=0.00..1.01 rows=1 width=14) (actual time=0.013..0.015 rows=5 loops=1)
        -- Assuming some filter or condition leads to 5 candidate rows initially
  ->  Subquery Scan on "sub"  (cost=0.00..1.03 rows=1 width=40) (actual time=0.007..0.008 rows=1 loops=5) -- <<< Subquery runs 5 times!
        ->  Limit  (cost=0.00..1.02 rows=1 width=14) (actual time=0.006..0.006 rows=1 loops=5)            -- <<< Limit runs 5 times!
              ->  Index Scan using task_queue_pkey on task_queue tasks_sub (cost=0.00..1.01 rows=1 width=14) (actual time=0.005..0.005 rows=1 loops=5)
                    Index Cond: (queue_group_id = 15)
                    -- The important part: This Limit node runs 5 times!
Planning Time: 0.123 ms
Execution Time: 0.088 ms

Look closely at the lines marked <<<. The Subquery Scan and the Limit node inside it were executed 5 times (loops=5). Instead of running the subquery with LIMIT 1 once, materializing its single result, and then using that id for the DELETE, the planner chose a different path.

Why the semi Join optimization

The planner transformed our query into a Nested Loop Semi Join. In simple terms, instead of treating the CTE/subquery as something to compute fully upfront (materialization), it decided to correlate the outer part of the DELETE (finding candidate rows in task_queue) with the inner subquery.

Here’s the catch: in this semi-join plan, the subquery (including the LIMIT 1) was executed for each candidate row found by the outer scan. If the outer scan initially identified 5 potential rows that could match the criteria (before locking and final checks), it ran the LIMIT 1 subquery 5 separate times. Each run could potentially find and lock a different row satisfying queue_group_id = 15, leading to multiple rows being deleted and returned.

❗ The query planner chose an optimization strategy (semi-join) that effectively applied the LIMIT per-candidate-row from the outer scan, not globally as we intended.

PostgreSQL CTEs are not always optimization fences. While they can sometimes force materialization, the planner often has the freedom to inline them or transform the query in other ways, as happened here.

Intermittency explained

This also explains why the issue was intermittent. The planner’s choice depends on table statistics, data distribution, costs, and internal heuristics. On some runs, it might choose a plan that materializes the subquery first (behaving as expected). On other runs, perhaps due to updated statistics or slight changes in data, it might switch to the semi-join plan, causing the unexpected behavior. This makes such issues notoriously hard to reproduce consistently.

Solution

To force the behavior we wanted (ensure the LIMIT applies globally before the DELETE), we restructured the query to avoid the CTE for the DELETE condition, using a simple subselect directly in the WHERE clause:

-- Revised query without the CTE for the DELETE condition
DELETE FROM task_queue
WHERE id = ( -- Use '=' for a single expected ID
  SELECT id FROM task_queue
  WHERE queue_group_id = 15
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
RETURNING item_id;

(Note: If you strictly need to handle the case where zero rows might be found by the subquery without erroring, using id IN (...) might still be necessary, but test the plan!)

This structure appears less ambiguous to the planner and typically encourages it to evaluate the subquery (with its LIMIT 1) first, find the single id, and then perform the delete.

Takeaway

While CTEs are a powerful tool for organizing complex queries, be cautious when using them with DELETE (or UPDATE) and LIMIT, especially when relying on the RETURNING clause for atomicity. The planner’s optimization choices might surprise you and lead to behavior counter to your intent. Always double-check your query plan with EXPLAIN ANALYZE for critical operations like this to ensure PostgreSQL is executing the query the way you expect.