我们发现了一个Aurora RDS中的竞争条件。

我们发现了一个Aurora RDS中的竞争条件。
A race condition in Aurora RDS

原始链接: https://hightouch.com/blog/uncovering-a-race-condition-in-aurora-rds

## Hightouch 与 AWS 中断：一个竞态条件的故事 Hightouch，一家专注于用户行为数据同步的公司，在十月份经历了两次 AWS 中断，影响了他们的 Events 产品。第一次中断（us-east-1，10 月 20 日）由于事件积压超载了他们的系统，促使他们计划使用 Amazon Aurora RDS 增加数据库容量。然而，随后的基础设施升级（10 月 23 日）由于 Aurora 内部的竞态条件而反复失败。故障转移——原本应该是快速的——意外地逆转，导致数据库无法访问。调查，在详细的日志记录和监控的帮助下，发现 Aurora 曾短暂允许同时向旧的和新的主实例写入数据，导致崩溃。 AWS 确认该错误源于写入者降级过程中的信号问题，与 Hightouch 的配置无关。解决方法是在有意进行故障转移期间暂停所有写入。 Hightouch 在实施此缓解措施后成功完成了升级，强调了充分的迁移准备、强大的可观察性以及认识到预演环境可能无法完全复制生产条件的重要性。他们已经更新了他们的操作手册和监控，以防止未来的中断。

## Aurora RDS 故障切换问题总结一篇近期文章详细描述了 Aurora RDS (PostgreSQL) 中一个竞态条件，即手动触发的故障切换如果在过程中未停止写入流量可能会失败。作者遇到了这个问题，并且AWS支持确认这并非其工作负载独有。核心问题似乎是故障切换机制与持续写入事务之间的冲突，可能由于超时引起。虽然存储层可以防止数据损坏，但故障切换本身无法完成。许多评论者对这个问题没有更广泛地发生或AWS没有将其列为更高优先级表示惊讶，并质疑为什么其他人没有遇到它。讨论强调了缺乏广泛认知的一些潜在原因：规避方法（重启、暂停写入）、难以复现问题以及向云供应商升级复杂问题的挑战。几位用户分享了在其他云提供商遇到类似、难以诊断的问题的经历，以及解决这些问题所需的专用支持的高成本。还有人争论了 Aurora 与标准 RDS PostgreSQL 配置在性能和成本方面的优劣。

原文

Much of the developer world is familiar with the AWS outage in us-east-1 that occurred on October 20th due to a race condition bug inside a DNS management service. The backlog of events we needed to process from that outage on the 20th stretched our system to the limits, and so we decided to increase our headroom for event handling throughput. When we attempted that infrastructure upgrade on October 23rd, we ran into yet another race condition bug in Aurora RDS. This is the story of how we figured out it was an AWS bug (later confirmed by AWS) and what we learned.

Background

The Hightouch Events product enables organizations to gather and centralize user behavioral data such as page views, clicks, and purchases. Customers can setup syncs to load events into a cloud data warehouse for analytics or stream them directly to marketing, operational, and analytics tools to support real-time personalization use cases.

Here is the portion of Hightouch’s architecture dedicated to our events system:

Hightouch events system architecture

Our system scales on three levers: Kubernetes clusters that contain event collectors and batch workers, Kafka for event processing, and Postgres as our virtual queue metadata store.

When our pagers went off during the AWS outage on the 20th, we observed:

Services were unable to connect to Kafka brokers managed by AWS MSK.
Services struggled to autoscale because we couldn’t provision new EC2 nodes.
Customer functions for realtime data transformation were unavailable due to AWS STS errors, which caused our retry queues to balloon in size.

Kafka’s durability meant that no events were dropped once they were accepted by the collectors, but there was a massive backlog to process. Syncs with consistently high traffic or with enrichments that needed to call slower 3rd party services took longer to catch up and were testing the limits of our (small) Postgres instance’s ability to act as a queue for the batch metadata.

As an aside, at Hightouch, we start with Postgres where we can. Postgres queues serve our non-events architecture well at ~1M syncs/day and for events scaled to 500K events per second at ~1s end-to-end latency on a small Aurora instance.

After observing the events on the 20th, We wanted to upsize the DB to give us more headroom. Given that Aurora supports fast failovers for scaling up instances, we decided to proceed with an upgrade on Oct 23rd without a scheduled maintenance window.

AWS Aurora RDS

The central datastore for real-time streaming and warehouse delivery of customer events uses Amazon Aurora PostgreSQL.

Aurora's architecture differs from traditional PostgreSQL in a crucial way: it separates compute from storage. An Aurora cluster consists of:

One primary writer instance that handles all write operations
Multiple read replica instances that handle read-only queries
A shared storage layer that all instances access, automatically replicated across multiple availability zones

This architecture enables fast failovers and efficient read scaling, but as we'd discover, it also introduces unique failure modes.

A failover is the process of promoting a read replica to become the new primary writer - typically done automatically when the primary fails, or manually triggered for maintenance operations like ours. When you trigger a failover in the AWS console:

Aurora designates a read replica as the new primary
The storage layer grants write privileges to the new primary
The cluster endpoint points to the new writer
The old primary becomes a read replica (if it's still healthy)

The diagram below explains how Hightouch Events uses Aurora.

How Hightouch Events uses Aurora

The Plan

This was our upgrade plan:

Add another read replica (instance #3) to maintain read capacity during the upgrade.
Upgrade the existing reader (instance #2) to the target size and give it the highest failover priority.
Trigger a failover to promote instance #2 as the new writer (expected downtime less than 15s, handled gracefully by our backend).
Upgrade the old writer (instance #1) to match the size and make it a reader.
Remove the temporary extra reader (instance #3).

The AWS docs supported this approach and we had already tested the process successfully in a staging environment while performing a load test, so we were confident in the correctness of the procedure.

The Upgrade Attempt

At 16:39 EDT on October 23, 2025, we triggered the failover to the newly-upgraded instance #2. The AWS Console showed the typical progression: parameter adjustments, instance restarts, the usual status updates.

Then the page refreshed. Instance #1 - the original writer was still the primary. The failover had reversed itself.

According to AWS everything was healthy. The cluster appeared healthy across the board. But our backend services couldn't execute write queries. Restarting the services cleared the errors and restored normal operation, but the upgrade had failed.

We tried again at 16:43. Same result: brief promotion followed by immediate reversal.

Two failed failovers in five minutes. Nothing else had changed - no code updates, no unusual queries, no traffic spikes. We had successfully tested this exact procedure in a staging environment under load earlier in the day. We checked our process to see if we had made any mistakes. We searched online to see if anyone else had encountered this issue but found nothing. Nothing obvious could explain why Aurora was refusing to complete the failover in this cluster. We were perplexed.

The Investigation

We first checked database metrics for anything unusual. There was a spike in connection count, network traffic, and commit throughput to the read replica (instance #2) during the failover.

The higher commit throughput could have been due to replication or the execution of write queries. The other two metrics simply indicated a higher query volume.

We checked the read query traffic from the app (graph below), and found that there was no change during this period. This told us the extra traffic to instance #2 came from our backend services which are supposed to connect to the writer instance.

Query traffic from the Hightouch app

When we looked at the backend application logs, we found this error -DatabaseError: cannot execute UPDATE in a read-only transaction in some pods.

Backend application logs

Our services do not connect directly to the writer instance, but rather to a cluster endpoint which points to the writer. This could mean one of 3 things:

The pods did not get the signal that the writer had changed - i.e. the cluster did not terminate the connection.
The cluster endpoint incorrectly pointed to a reader instance.
The pod was connected to the writer, but the write operation was rejected at runtime.

We did not find any evidence supporting or disproving #1 in the application logs. We had a strong suspicion it was either #2 or #3. We downloaded the database logs to take a closer look and found something interesting. In both the promoted reader and the original writer, we found the same sequence of logs:

2025-10-23 20:38:58 UTC::@:[569]:LOG:  starting PostgreSQL...
...
...
...
LOG:  database system is ready to accept connections
LOG:  server process (PID 799) was terminated by signal 9: Killed
DETAIL:  Failed process was running: <write query from backend application>
LOG:  terminating any other active server processes
FATAL:  Can't handle storage runtime process crash
LOG:  database system is shut down

This led us to a hypothesis:

During the failover window, Aurora briefly allowed both instances to process writes. The distributed storage layer rejected the concurrent write operations, causing both instances to crash.

We expect Aurora’s failover orchestration to do something like this:

Stop accepting new writes. Clients can expect connection errors until the failover completes.
Finish processing in-flight write requests.
Demote the writer and simultaneously promote the reader.
Accept new write requests on the new writer.

There was clearly a race condition between steps 3 & 4.

Testing the Hypothesis

To validate the theory, we performed a controlled failover attempt. This time:

We scaled down all services that write to the database
We triggered the failover again
We monitored for storage runtime crashes

By eliminating concurrent writes, the failover completed successfully. This strongly reinforced the race-condition hypothesis.

AWS Confirms the Root Cause

We escalated the findings and log patterns to AWS. After an internal review, AWS confirmed that:

The root cause was due to an internal signaling issue in the demotion process of the old writer, resulting in the writer being unchanged after the failover.

They also confirmed that there was nothing unique about our configuration or usage that would trigger the bug. The conditions that caused it were not under our control.

AWS has indicated a fix is on their roadmap, but as of now, the recommended mitigation aligns with our solution: use Aurora’s Failover feature on an as-needed basis and ensure that no writes are executed against the DB during the failover.

Final State

With the race condition understood and mitigated, we:

Successfully upsized the cluster in us-east-1
Updated our internal playbooks to pause writers before an intentional failover
Added monitoring to detect any unexpected writer role advertisement flips

Takeaways

The following principles were reinforced during this experience:

Prepare for the worst in any migration - you could end up in your desired end state, beginning state, or an in-between state - even for services you trust. Ensuring you’re ready to redirect traffic and handle brief outages in dependencies will minimize downtime.
The importance of good observability cannot be emphasized enough. The “brief writer advertisement” was only detectable because we were monitoring queries to each instance in Datadog and had access to database logs in RDS.
For large scale distributed systems, isolating the impact any single component can have on the system can help both uptime and maintenance. It helps a lot if the design allows for such events without completely shutting down the system.
Test setups are not always representative of production environments. Even though we practiced the upgrade process during a load test in a staging region, we could not reproduce the exact conditions that caused the race condition in Aurora. AWS confirmed that there was nothing specific about our traffic pattern that would trigger it.

If challenges like this investigation sound interesting, we encourage you to check out our careers page