PostgreSQL 和 UUID 作为主键

PostgreSQL 和 UUID 作为主键
PostgreSQL and UUID as Primary Key

原始链接: https://maciejwalkowiak.com/blog/postgres-uuid-primary-key/

本文讨论使用通用唯一标识符 (UUID) 作为数据库中的主键，特别关注它们在 PostgreSQL 中的高效实现。虽然 UUID 提供了唯一的标识符并且易于跨系统分发，但它们的大尺寸引发了对其适用性的质疑。作者强调，本文并不是要争论 UUID 是否是键的理想格式，而是探讨在 PostgreSQL 中处理它们的有效方法。在 PostgreSQL 中，虽然 UUID 可以存储为字符串，但效率较低，因为 PostgreSQL 提供了一种名为“uuid”的特殊数据类型。将 UUID 存储为“文本”而不是其指定的数据类型会增加不必要的开销，特别是对于涉及许多行的较大数据集。通过创建两个相同的表进行实验 - 一个将 ID 存储为文本主键，另一个将 UUID 数据类型用于相同目的。在使用批量插入向两个表填充一千万行后，作者发现，与使用“text”数据类型的表相比，使用“text”数据类型的表的总大小显着更高（大约 54%），索引大小也更大（惊人地增加了 85%）。使用“uuid”数据类型的表。这些发现导致 PostgreSQL 存储和索引的页面使用量增加，最终影响执行查询时的性能，特别是考虑到索引大小经常超过可用 RAM。由于 UUID 不太适合传统的 B 树索引，因此最有效的方法是根据时间生成 UUID (UUID v7)，使它们成为基于时间的 B 树索引的合适候选者。使用 java-uuid-generator 等库有助于生成 UUID v7，以提高执行 INSERT 操作时的性能。但是，考虑到与 UUID 相关的缺点，建议在选择 UUID 作为主键之前探索替代方案，例如 TSID。

设计数据库模式时，请考虑以下准则： 1. 使用 biginteger 作为 B 树友好的主键。 2. 选择字符串编码的通用唯一标识符 (UUID) 作为替代记录定位器选项。 3. 优先考虑简单的选项，例如 PNR 式定位器，特别是对于处理 ID 的非技术用户。 4. 避免在特定服务或应用程序的架构中混合主键类型。 5. 仅将 UUIDv7 用于固有时间编码的数据，因为它可能会泄漏计时信息。 6. 避开 hashid，因为它们缺乏加密特性并且不利于与人类的交互。 7. 关注点分离和机械同情是创建有效数据库设计的关键因素。 8. 警惕 Stripe 所谓的“随机 ID”，因为它们包含元数据，包括时间戳、分片和引用键，限制了它们作为纯粹随机生成的标识符的适用性。 9. 鼓励使用数字主键与 UUID 相结合，从而改进内部排序和一致的外部表示。这些建议旨在提供稳定且有意义的数据库结构，能够适应不断变化的环境和技术，而不会影响客户记录或链接。

UUIDs are often used as database table primary keys. They are easy to generate, easy to share between distributed systems and guarantee uniqueness.

Considering the size of UUID it is questionable if it is a right choice, but often it is not up to us to decide.

This article does not focus on "if UUID is the right format for a key", but how to use UUID as a primary key with PostgreSQL efficiently.

Postgres Data Types for UUID

UUID can be seen as a string and it may be tempting to store them as such. Postgres has a flexible data type for storing strings: text and it is often used as a primary key to store UUID values.

Is it a right data type? Definitely not.

Postgres has a dedicated data type for UUIDs: uuid. UUID is a 128 bit data type, so storing single value takes 16 bytes. text data type has 1 or 4 bytes overhead plus storing the actual string.

These differences are not that important in small tables, but become an issue once you start storing hundreds of thousands or millions of rows.

I run an experiment to see what is the difference in practice. There are two tables that have just one column - an id as a primary key. First table uses text, second uuid:

I did not specify the type for primary key index, so Postgres uses the default one - B-tree.

Then I inserted 10 000 000 rows to each table using batchUpdate from Spring's JdbcTemplate:

I run a query to find the table size and the index size:

Table that uses text is 54% larger and the index size 85% larger. This is also reflected in number of pages Postgres uses to store these tables and indexes:

Larger size of tables, indexes and bigger number of tables means that Postgres must perform work to insert new rows and fetch rows - especially once index sizes are larger than available RAM memory, and Postgres must load indexes from disk.

UUID and B-Tree index

Random UUIDs are not a good fit for a B-tree indexes - and B-tree index is the only available index type for a primary key.

B-tree indexes work the best with ordered values - like auto-incremented or time sorted columns.

UUID - even though always looks similar - comes in multiple variants. Java's UUID.randomUUID() - returns UUID v4 - which is a pseudo-random value. For us the more interesting one is UUID v7 - which produces time-sorted values. It means that each time new UUID v7 is generated, a greater value it has. And that makes it a good fit for B-Tree index.

To use UUID v7 in Java we need a 3rd party library like java-uuid-generator:

Then we can generate UUID v7 with:

This theoretically should improve the performance of executing INSERT statements.

How UUID v7 affects INSERT performance

I created another table, exactly the same as bank_transfer_uuid but it will store only UUID v7 generated using the library mentioned above:

Then I run 10 rounds of inserting 10000 rows to each table and measured how long it takes:

The results look a little random especially when comparing times for a table with regular text column and uuid v4:

BUT we can clearly see, that inserting UUID v7 is ~2x faster and inserting regular UUID v4.

Summary

As mentioned at the beginning - due to UUID length - even with all these optimizations, it is not the best type for a primary key. If you have an option to choose, take a look at TSID maintained by Vlad Mihalcea.

But if you must or for some reason want to use UUIDs, take into account the optimizations I mentioned. Also keep in mind that such optimizations make a difference for large datasets. If you're storing hundreds or even few thousands of rows, and have a low traffic, you will likely not see any difference in the application performance. But if there's a chance you will have large dataset or big traffic - it is better to do it right from the beginning as changing primary keys can be quite a challenge.

At the end a disclarimer - I am not pretending to be a Postgres expert - I am rather sharing what I've learned.

As usual, I hope you found it useful! Feel free to drop a comment if you found any mistake or have a question. Also, feel free to reach out to me on twitter.com/maciejwalkowiak.