Postgres 17 与 18 的基准测试
Benchmarking Postgres 17 vs. 18

原始链接: https://planetscale.com/blog/benchmarking-postgres-17-vs-18

## Postgres 18 I/O 性能基准测试:摘要 最近的基准测试探讨了 Postgres 18 的性能改进,特别是其新的 `io_method` 配置选项:`sync`、`worker`(新的默认值)和 `io_uring`。测试在各种 EC2 实例上进行,使用了不同的存储类型(gp3、io2、NVMe),并通过 `sysbench` 使用 300GB 只读工作负载。 结果显示了一个令人惊讶的结果:在网络附加存储(gp3、io2)上,Postgres 18 的 `sync` 和 `worker` 模式优于 `io_uring`。这可能是由于这些存储类型的延迟所致。然而,在本地 NVMe 驱动器上,性能更为接近,`io_uring` 在高并发下表现略胜一筹。 本地 NVMe 在所有配置中始终提供最佳性能,突出了低延迟、高 IOPS 存储的重要性。增加 EBS 功能(IOPS/吞吐量)与改进的 QPS 相关,但无法与 NVMe 相匹配。 测试表明 `worker` 是 `io_method` 的一个合理默认值,它提供了异步优势,而无需完全依赖 `io_uring` 内核接口。虽然 `io_uring` 并非普遍获胜,但它在高并发场景中显示出潜力。最终,最佳 I/O 配置取决于特定的工作负载和底层基础设施。

## Postgres 17 vs. 18 基准测试总结 最近的基准测试比较了PostgreSQL 17和18,揭示了性能上的细微差异。虽然Postgres 18在某些领域(同步、工作进程、io_uring模式)有所改进,但收益并不直接。一个关键点是,该基准测试并未完全利用PostgreSQL 18的AIO能力,因为索引扫描的AIO尚未实现,预计将在版本19或20中加入。 讨论强调性能很大程度上取决于存储。本地NVMe驱动器在两版本之间表现出相似的性能,明显优于网络存储。人们对云环境中本地存储的持久性表示担忧,并建议使用RAID或复制等冗余解决方案。 许多评论者强调了云便利性和裸机服务器控制/性能之间的权衡。该基准测试还引发了关于统计显著性和测试中需要误差范围的争论。最终,共识倾向于优先考虑快速的本地存储,而不是仅仅依赖PostgreSQL版本升级来提升性能。
相关文章

原文

By Ben Dicken |

Postgres 18 released a few weeks ago, and there's plenty of hype around the improvements it's bringing. Most notably, Postgres 18 introduces the io_method configuration option, allowing users more control over how disk I/O is handled.

Setting this to sync results in the same behavior as 17 and earlier versions. With this, all I/O happens via synchronous requests.

18 introduces two alternatives: worker and io_uring. worker (the new default) causes Postgres to use dedicated background worker processes to handle all I/O operations. io_uring is the change many are excited about for performance reasons, as it uses the Linux io_uring interface to allow all disk reads to happen asynchronously. The hope is that this can lead to significantly better I/O performance.

We conducted detailed benchmarks to compare the performance on Postgres 17 and 18. Let's see if these improvements are all they're hyped up to be.

sysbench was used for executing the benchmarks. The io_uring improvements only apply to reads, so the focus here will be on the oltp_read_only benchmark. This includes both point selects and queries that do range scans and aggregations controlled by the --range_size argument. Though it would be interesting to also benchmark write and read/write combo performance as well, sticking to read-only helps focus this discussion. We set the data size to be TABLES=100 and SCALE=13000000 which produces a ~300 GB database (100 tables with 13 million rows each).

The benchmarks were conducted on four different EC2 instance configurations:

InstancevCPUsRAMDiskDisk typeIOPSThroughput
r7i.2xlarge864 GB700 GBgp33,000125 MB/s
r7i.2xlarge864 GB700 GBgp310,000500 MB/s
r7i.2xlarge864 GB700 GBio216,000-
i7i.2xlarge864 GB1,875 GBNVMe300,000-

All these instances run on the same (or extremely similar) Intel CPUs. We include the i7i instance to show what Postgres is capable of with fast, local NVMe drives. This is what we use for PlanetScale Metal and have seen amazing performance results for Postgres 17. Many other cloud providers only provide a form of network-attached storage, whereas we provide both options.

Each server is warmed with 10 minutes of query load prior to benchmarking.

On each one of these configurations, we ran the sysbench oltp_read_only benchmark with the following configurations for 5 minutes each:

  • Single connection and --range_size = 100
  • 10 connections and --range_size = 100
  • 50 connections and --range_size = 100
  • Single connection and --range_size = 10,000
  • 10 connections and --range_size = 10,000
  • 50 connections and --range_size = 10,000

This leads to a total of 24 unique 5-minute benchmark runs. These 24 configurations were run four times each! Once on Postgres 17 and once on each of Postgres 18 with io_method=worker, io_method=io_uring and io_method=sync. This makes a total of 96 benchmark combinations. Go ahead, tell me I'm crazy!

This is an extremely I/O-intensive workload. The data size (300 GB) far exceeds RAM size (64 GB), so there will be significant disk accesses for the queries being executed here.

Though a single connection is an unrealistic production workload, it offers a baseline for how the different I/O settings affect straight-line performance. Let's assess the QPS we can achieve here.

Below is the average QPS for all single-connection runs where the --range_size value is set to the default of 100. This means that the full read workload is composed of a combination of point-select queries, and queries that do scans / aggregations of 100-row sequences.

QPS benchmark results on 1 connection and scan=100

A few things are clear:

  1. On network-attached storage (gp3, io2) Postgres 18 in sync and worker modes perform noticeably better than 17 and 18 with io_uring. I'll admit, this surprised me! My expectation was that io_uring would perform as well as if not better than all these options.
  2. The latency of gp3 and even io2 is clearly a factor in this difference. On an instance with a low-latency local NVME drive, all options are much more evenly matched.
  3. The latency / IOPS of gp3 and even the very expensive io2 drive are a limiting factor. Local disks outperform in all configurations.
  4. For straight-line performance with short-lived queries, Postgres 18 is much faster. Welcome improvements!

Here is the same test except with --range_size=10000. This means the workload has much larger scans / aggregates, which means more sequential I/O and lower QPS:

QPS benchmark results on 1 connection and scan=10000

Local disks still clearly outperform, but the difference between the other three options is less stark. This is due to a combination of (a) more sequential I/O but more importantly (b) more CPU work (aggregating 10k rows is more CPU-intensive than aggregating 100 rows). Additionally, the delta between postgres 17 and 18 is much smaller.

Below is an interactive visual comparing the Postgres 17 results for all instance types with the best performer on Postgres 18, workers. Click on the thumbnails to add or remove lines from the graph and compare various combinations. This is based on a 10-second sample rate.

In real-world scenarios, we have many simultaneous connections and many reads happening at once. Let's look at how each of these servers handles the same benchmark but with much higher load across 50 connections.

Of course, oltp_read_only does not capture a realistic OLTP workload, especially since it does not include writes, but we use it as a proxy for workloads with high read demands. Below, we show the average QPS for all of the 50-connection oltp_read_only with --range_size=100.

QPS benchmark results on 50 connections and scan=100

Now with a high level of parallelism and increased I/O demand, several additional things are clear:

  1. IOPS and throughput are clear bottlenecks for each of the EBS-backed instances. The different versions / I/O settings don't make a huge difference in such cases.
  2. As we increase the EBS capabilities, the QPS grows in lockstep, and the local-NVME instance outperforms them all.
  3. Postgres 18 with sync and worker have the best performance on all the EBS-backed instances by a small margin.

Again, the same benchmark but with --range_size=10000.

QPS benchmark results on 50 connections and scan=10000

The gp3-10k and io2-16k instances get much closer to local-disk performance. However, this is because we have made the benchmark much more CPU-bound vs I/O-bound, so the low latency of local disks gives less of an advantage (though still the best!) But importantly, we finally have a scenario where io_uring wins! On the NVMe instance, it slightly outperforms the other options.

Below we again compare these results for Postgres 17 and Postgres 18, workers. Click on the thumbnails to add or remove lines from the graph and compare various combinations.

These same benchmarks were also executed with 10 concurrent connections. The results are pretty similar so they will not all be shown, but I do want to point out this graph where --range_size=100:

QPS benchmark results on 10 connections and scan=100

Look carefully at the first bar group (for gp3-3k). The io_uring setting performed significantly worse than the rest. But if you look at that same part of the graph when there were 50 connections, io_uring performs only slightly worse than the rest. To me, this indicates that io_uring performs well when there's lots of I/O concurrency, but in low-concurrency scenarios it isn't as beneficial.

Cost should always be a consideration when comparing infrastructure setups. Here are the on-demand costs of each server configuration in AWS:

  • r7i with gp3 3k IOPS and 125 Mbps: $442.32/mo
  • r7i with gp3 10k IOPS and 500 Mbps: $492.32/mo
  • r7i with io2 16k IOPS: $1,513.82/mo
  • i7i with local NVMe (no EBS): $551.15/mo

And keep in mind, the first three only have 700 GB of storage, whereas the i7i has a 1.8 TB volume! The server with a local NVMe disk is the clear price-performance winner.

Given my excitement over the new io_uring capabilities of Postgres 18, I was expecting it to win in many more scenarios. So what's going on here?

For one, this is a very specific type of workload. It is read only, and does a combination of point-selects, range scans, and range aggregations. io_uring surely has other workloads where it would shine. It's also possible that with different postgresql.conf tunings, we'd see improvements from io_uring.

While writing this, I stumbled across Tomas Vondra's excellent blog discussing the new io_method options, how to tune them, and the pros and cons of each. He makes several good points regarding why workers would outperform io_uring, and I recommend you read it. In short:

  • Index scans don't (yet) use AIO.
  • Though the I/O happens in the background with io_uring, the checksums / memcpy can still be a bottleneck.
  • workers allows better parallelism for I/O from the perspective of a single process.

So there are legitimate cases where io_uring won't always perform better! I'd love to see further benchmarks from others, spanning other workload types on configurations. You can find many of the configs used for these tests in the appendix.

Though narrow, this was a fun experiment to compare performance of Postgres versions and I/O settings. My key takeaways are:

  • Postgres 18 brings nice I/O improvements and configuration flexibility. Great job to the maintainer team!
  • Local disks are the clear winner. When you have low-latency I/O and immense IOPS, the rest matters less. This is why PlanetScale Metal makes for best-in-class database performance.
  • Using io_method=worker was a good choice as the new default. It comes with a lot of the "asynchronous" benefits of io_uring without relying on that specific kernel interface, and can be tuned by setting io_workers=X.
  • There's no one-size-fits-all best I/O configuration.
  • Though they do benefit, the new workers I/O configuration doesn't help network-attached storage scenarios as much as one might hope.

What else do you want to see benchmarked? Reach out to let us know.

Here are a selection of the critical custom-tuned Postgres configs used for this benchmark:

shared_buffers = 16GB          # 25% of RAM
effective_cache_size = 48GB    # 75% of RAM
work_mem = 64MB
maintenance_work_mem = 2GB

wal_level = replica
max_wal_size = 16GB
min_wal_size = 2GB
wal_buffers = 16MB
checkpoint_completion_target = 0.9

random_page_cost = 1.1
effective_io_concurrency = 200
default_statistics_target = 100

max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4

bgwriter_delay = 200ms
bgwriter_lru_maxpages = 100
bgwriter_lru_multiplier = 2.0

autovacuum = on
autovacuum_max_workers = 4
autovacuum_naptime = 10s
autovacuum_vacuum_scale_factor = 0.05
autovacuum_analyze_scale_factor = 0.025

logging_collector = on
...more log configs...

shared_preload_libraries = 'pg_stat_statements'
track_activity_query_size = 2048
track_io_timing = on

jit = on

# io_workers left at default = 3
联系我们 contact @ memedata.com