Postgres 荣登分析型数据库速度榜首
Postgres Just Cracked the Top Fastest Databases for Analytics

原始链接: https://www.mooncake.dev/blog/clickbench-v0.1

pg_mooncake是一个PostgreSQL扩展,它在熟悉的Postgres环境中实现了近乎顶级的分析性能。它通过实现针对分析工作负载优化的列存储格式来实现这一点,而这些工作负载通常受益于ClickHouse等专业数据库。与传统的基于行的存储不同,列存储允许顺序读取相关的列,从而提高速度和压缩率。 此外,pg_mooncake利用DuckDB进行矢量化执行,批量处理数据,通过SIMD获得显著的性能提升。至关重要的是,所有表元数据和管理都直接在PostgreSQL中处理,从而实现更快的查询规划和高级功能,例如文件跳过。 结果是Postgres分析能力的显著飞跃,使其能够在ClickBench等基准测试中与专业数据库竞争,同时保留Postgres生态系统的更广泛优势。该项目是开源的(MIT许可),代表着简化数据堆栈的一步。它可在Neon Postgres上使用,并即将登陆Supabase。

Hacker News 的讨论围绕着 Postgres 通过 Mooncake 扩展获得的新的分析能力展开。一位用户 bhouston 分享了使用标准 Postgres 进行大规模分析的困境,提到在一个 20 亿条记录的表上查询速度很慢,因此选择了 BigQuery,因为它速度快、灵活且经济高效。 其他人反驳道,带有 Timescale 或 Citus 等列式存储扩展的 Postgres 仍然有效,尤其是在较小的数据集上,而像 ClickHouse 这样的专用数据库在 BigQuery 的成本随着数据和查询量增长时可能更胜一筹。 pg_mooncake 的创建者强调了其开源特性,利用 DuckDB 进行矢量化查询执行,将数据以 Parquet 格式存储在 S3 上,并利用 Postgres 进行元数据管理。 讨论还涉及到平衡 Postgres 兼容性和性能的挑战,以及 ClickHouse 和 Snowflake 等替代方案,以及开源扩展背后的商业模式。用户质疑 Mooncake 的性能提升是否足以弥补其潜在的复杂性以及对 OLTP 工作负载的潜在影响。

原文

TLDR: We spent a few months optimizing PostgreSQL and made it to the Top 10 on ClickBench, a benchmark typically dominated by specialized analytics databases.

What’s more, all compute is within Postgres, and all tables are managed directly by PostgreSQL—it’s not a simple wrapper. This is the story of pg_mooncake.

Clickbench

ClickBench is the definitive benchmark for real-time analytics databases, originally designed to showcase the performance of ClickHouse. It evaluates databases on their ability to handle real-world analytics workloads, including high-volume table scans and complex aggregations.

Historically, ClickHouse and other purpose-built analytics databases have led this benchmark, while general-purpose databases like Postgres/MySQL have lagged behind by 100x. But we wanted to challenge that perception—and Postgres delivered.

How to Build Analytics in Postgres?

When most people think of PostgreSQL, they think of a rock-solid OLTP database, not a real-time analytics powerhouse. However, PostgreSQL’s extensibility makes it uniquely capable of punching above its weight class. Here’s how we approached the challenge:

1. Build a PG Extension

We leveraged PG's extensibility to build pg_mooncake as a native PG extension.

2. Storage Format: Columnstore

For analytics workloads, a columnstore format is essential. ClickBench workloads typically involve wide tables, but queries only access a small subset of columns.

  • In a row store (like PostgreSQL heap table), reading a single column means jumping through rows.
  • In a columnstore, reads are sequential, which is faster (and it also enables better compression and execution on compressed data).

3. Vectorized Execution

To enhance query execution speed, we embedded DuckDB as the execution engine for columnstore queries. This means across the execution pipeline, data is processed in batches instead of row by row, enabling SIMD, which is a lot more efficient for scans, groupbys, and aggregations.

4. Table Metadata & Management Directly in PostgreSQL

Efficient metadata handling is critical for real-time analytics, since fixed latency matters. Instead of fetching metadata or statistics from storage formats like Parquet, we store them directly in PG.

  • This enables faster query planning.
  • It also allows for advanced features like file skipping, significantly improving performance.

More details on the architecture.

What Does It Mean?

PostgreSQL is no longer just an OLTP workhorse. With careful tuning and engineering, it’s capable of delivering analytics performance on par with specialized databases while retaining the flexibility and ecosystem advantages of PostgreSQL.

After building advanced data systems for a decade, part of my core belief is: we can make the data stack a lot simpler.

pg_mooncake is MIT licensed, so if you don’t believe it, give it a try.

We launched v0.1 last week. And is now available on Neon Postgres and coming to Supabase.

🥮

联系我们 contact @ memedata.com