建立你自己的数据库
Build your own database

原始链接: https://www.nan.fyi/database

## 从零开始构建键值数据库 本指南详细介绍了构建键值数据库的过程,从简单的文件存储方法开始。最初,数据以键值对的形式存储在文本文件中。设置值会将该对追加到文件末尾,而检索则涉及遍历文件以查找匹配的键。更新会就地替换值,删除会移除记录——由于潜在的数据移动,这个过程效率低下。 为了改进这一点,采用了一种追加写入的方法,将更新视为新的插入。这避免了数据移动,但引入了重复的键,需要搜索找到*最后*一次出现。删除通过“墓碑”记录来处理。然而,这会导致文件增长和搜索速度变慢。 解决方案包括*压缩*——定期通过删除过时数据来缩小文件——以及使用*索引*。一个基本的索引存储键-偏移量对,以加快查找速度,但需要内存并会降低写入速度。更高级的技术,如*排序字符串表* (SST) 和*稀疏索引*,可以进一步优化。 最终,这个过程构建了一个利用内存组件来提高速度,并利用磁盘日志来持久化的数据库,形成一个*日志结构化合并树 (LSM 树)*——LevelDB 和 DynamoDB 等数据库的基础。

## 构建你自己的数据库 - 摘要 这个Hacker News讨论围绕着一个新的资源(nan.fyi),它指导读者构建自己的数据库,灵感来自《设计数据密集型应用》。 这篇文章引发了关于涉及的复杂性和理解数据库基础知识的价值的讨论。 几位评论者分享了构建键值数据库的经验,指出复杂度的灵活性——从简单的JSON序列化到完全复制的系统。一个反复出现的主题是练习的教育意义,即使最终“重新发明轮子”也能让人欣赏现有的解决方案,比如SQL。 讨论还涉及资源中的设计选择,例如使用*lorem ipsum*作为示例(被认为会分散注意力)以及关于LSM树和DynamoDB扩展的解释的准确性。 也有人请求为该网站提供RSS订阅源。 最终,这个帖子突出了“第一性原理”学习方法的吸引力以及数据库实现的令人惊讶的挑战性,即使对于经验丰富的开发人员来说也是如此。 有些人甚至戏谑地提出了极端的变体,比如只写数据库!
相关文章

原文

A step-by-step guide to building a key-value database from scratch.

If you were to build your own database today, not knowing that databases exist already, how would you do it? In this post, we'll explore how to build a key-value database from the ground up.

A key-value database works more or less like objects in JavaScript—you can store values using a key and retrieve them later using that same key:

$ db set 'hello' 'world'

$ db get 'hello'

world

Let's find out how they work!


The Humble File

Databases were made to solve one problem:

Problem

How do we store data persistently and then efficiently look it up later?

The typical way to store any kind of data persistently in a computer is to use a file . When we want to store data, we add the key-value pair to the file:

  1. $ db set 1 "Lorem ipsum"
  2. $ db set 18 "dolor sit"

db.txt

  • 001: Lorem ipsum001:Lorem ipsum
  • 018: dolor sit018:dolor sit

When we want to look for a specific key, we iterate through the pairs to see if there's a matching key:

  1. $ db set 1 "Lorem ipsum"
  2. $ db set 18 "dolor sit"

db.txt

  • 001: Lorem ipsum001:Lorem ipsum
  • 018: dolor sit018:dolor sit

For updates, we'll find the key and replace the value in-place:

  1. $ db set 1 "Lorem ipsum"
  2. $ db set 18 "dolor sit"

db.txt

  • 001: Lorem ipsum001:Lorem ipsum
  • 018: dolor sit018:dolor sit

And for deletes, we'll delete the record from the file:

  1. $ db set 1 "Lorem ipsum"
  2. $ db set 18 "dolor sit"

db.txt

  • 001: Lorem ipsum001:Lorem ipsum
  • 018: dolor sit018:dolor sit

Easy! We're done right?


Mutable Updates

This approach, simple as it is, doesn't actually work very well in practice. The problem lies with the way we're doing updates and deletes—they're wholly inefficient.

To a computer, our file looks something like this—nothing more than a long sequence of bytes:

001:Lorem␣ipsum\n018:dolor␣sit\n005:adipiscing␣elit.\n014:Vestibulum␣varius\n002:vel␣mauris\n007:consectetur␣adipiscing␣elit.\n010:Vestibulum␣varius\n016:vel␣mauris\n003:consectetur␣adipiscing␣elit.

When we go to update or delete a record, we're currently updating that record in-place, which means we potentially have to move all of the data that comes after that record:

001:Lorem␣ipsum\n018:dolor␣sit\n005:adipiscing␣elit.␣vel␣mauris\n014:Vestibulum␣varius\n002:vel␣mauris\n007:consectetur␣adipiscing␣elit.\n010:Vestibulum␣varius\n016:vel␣mauris\n003:consectetur␣adipiscing␣elit.

In this case, updating the record 005 to "adipiscing␣elit.␣vel␣mauris" means moving all of the records that come after it by 11 bytes (the length of the added string "␣vel␣mauris"). This can quickly get really costly, especially as our database grows in size!


Append-Only Files

One way to work around the update problem is to make records immutable. In other words, we add the constraint that we can only add new records to the end of the file and never update or delete existing ones.

With this approach, updates are treated the same as inserts—just add a new record to the end of the file:

  1. $ db set 1 "Lorem ipsum"
  2. $ db set 18 "dolor sit"

db.txt

  • 001: Lorem ipsum001:Lorem ipsum
  • 018: dolor sit018:dolor sit

But now we have another problem—there are duplicate keys in the file!

To work around this, we have to change our search algorithm to look for the last occurrence of the key instead of the first:

  1. $ db set 1 "Lorem ipsum"
  2. $ db set 18 "dolor sit"

db.txt

  • 001: Lorem ipsum001:Lorem ipsum
  • 018: dolor sit018:dolor sit

To delete records, we create a special "tombstone" record that marks the key as deleted. There's no single way to do this, but one way is to use a special value like null:

  1. $ db set 1 "Lorem ipsum"
  2. $ db set 18 "dolor sit"

db.txt

  • 001: Lorem ipsum001:Lorem ipsum
  • 018: dolor sit018:dolor sit

And there we have it! We have a key-value database that uses a file as its storage mechanism. Using it, we can store, find, update, and delete key-value pairs.


Now this implementation isn't perfect; right now, there are two major issues:

  1. The file can get very large. Since we're only appending to the file, the file will grow infinitely over time. Not good!

  2. Searching is slow. To search for a specific key, we have to potentially iterate through all records in the database. For a database with millions of records, this can take a while!

How can we fix these problems?


Keeping Files Small

Problem

How do we make sure the file doesn't grow indefinitely? Because we're using an append-only file, we need some mechanism to periodically "shrink" the file so it doesn't eventually take over our entire hard drive.

Take a look at our database here after a few updates and deletes:

  1. $ db set 1 "Lorem ipsum"
  2. $ db set 18 "dolor sit"
  3. $ db set 7 "adipiscing elit."
  4. $ db delete 7
  5. $ db set 10 "consectetur adipiscing elit."
  6. $ db delete 1

db.txt

  • 001: Lorem ipsum001:Lorem ipsum
  • 018: dolor sit018:dolor sit
  • 007: adipiscing elit.007:adipiscing elit.
  • 007: null007:null
  • 010: consectetur adipiscing elit.010:consectetur adipiscing elit.
  • 001: null001:null

Our database file has six entries, but only two represent actual records—the rest are either deleted or contain stale data. If we can clear all the irrelevant data, we can shrink the file by over 66%!


Segments and Compaction

Here's an idea: once a file exceeds a certain size, we'll close the file and create a new one. While the new file ingests new data (in the same way we've been doing so far), we'll compact the old file by deleting all of its irrelevant data.

Meaning, we stop adding new data to the file.

Here, we've set the maximum file size to seven records. Notice that the database is full—try clicking on "Add" to add a new record and notice what happens:

  • 001:

    Lorem ipsum

  • 018:

    dolor sit

  • 007:

    adipiscing elit.

  • 007:

    null

  • 010:

    consectetur adipiscing elit.

  • 001:

    null

  • 020:

    Vestibulum varius

waiting...

Now, our database consists of two different files which we'll call segments. Each segment will usually become a lot smaller after compaction, which means we can merge them together as part of the compaction process.

With that, we've made a mechanism to stop our database from growing indefinitely!


Your First Index

Our next problem is on search performance:

Problem

How do we make searching fast? Right now, we have to iterate through all of the records in the database to find a specific key. This is super slow!

What if we use objects? That's right, these little guys:

JavaScript objects, otherwise known as hash tables or dictionaries, are really efficient at storing and looking up key-value pairs:

const hashTable = {

hello: "world",

foo: "bar",

baz: "qux",

};

const value = hashTable["hello"]; // "world"

It doesn't matter how many records there are—the time it takes to look up and retrieve a value in a hash table is more or less constant. The catch is they must live in memory.


Aside: In-Memory vs. On-Disk

When you write a variable in your code, the computer will "remember" the value of that variable only for as long as the program is running.

script.js

let x = 1;

x = x + 1;

console.log(x); // 2

x = x + 1;

console.log(x); // 3

$node script.js

This program will always print 2 and 3 because the value of x "resets" every time we run the program. This is because x is stored in-memory , and any value stored in memory is discarded when the program stops.

If we want our data to "stick" between runs, we'll need to store it on-disk—in other words, a file.

script.js

import fs from "fs";

let x = Number(fs.readFileSync("data.txt", "utf8")) || 1;

x = x + 1;

console.log(x);

x = x + 1;

console.log(x);

fs.writeFileSync("data.txt", x);

$node script.js

This time, x will print 2 and 3 the first run, and 4 and 5 the second run.

The tradeoff to persistence is performance—accessing data from memory is about 80x faster on average than accessing it from disk.


Here's how the index will work. For every record that we have in our database, we'll store that record's offset—the number of bytes from the beginning of the file to the start of the record—in the index:

The second record, 18: dolor sit, for example, has an offset of 15 because:

  1. Each character is 1 byte large;

  2. The first record is 13 characters long (1:Lorem ipsum);

  3. The first record ends with a newline character, which is (at most) 2 bytes long;

This gives us an offset of 13 + 2 = 15.

One thing to note is that we need an index for each segment because the offset is relative to the start of the file—in other words, the start of each segment.


Searching With Indices

Using an index, our search algorithm can now run a lot more efficiently:

  1. Starting at the most recent segment, look up the key in the index;

  2. If the key is found, read the record at the offset;

  3. If the key is not found, move on to the next segment;

  4. Repeat (2) and (3) until the key is found or all segments have been searched.

    $ Waiting for commands...

001: 0018: 15007: 29010: 49020: 71

001: Lorem ipsum018: dolor sit007: adipiscing elit.010: consectetur elit.020: Vestibulum varius

004: dolor sit amet006: adipiscing elit.012: consectetur elit.


Updating Indices

An index is only useful if it's in sync with our data. Whenever we update, delete, or insert a record, we have to change the index accordingly:

  • 001: Lorem ipsum001:Lorem ipsum
  • 018: dolor sit018:dolor sit

Notice what this implies—writing to the database is slower with an index! This is one of the tradeoffs of using an index; we can search for data much faster at the cost of slower writes.


Tradeoffs

An index is great because it lets us query our database much faster, but there are some problems with our specific hash table implementation:

  1. Keys have to fit in memory. Since we're using an in-memory hash table as our index, all of the keys in our database must fit in memory. This means there's a limit on the number of keys we can store!

  2. Range queries are inefficient. Our index wouldn't help for search queries; if we wanted to find all the records between the keys 12 and 18, for example, we'd have to iterate through the entire database!


Sorted String Tables

Here's an idea: what if we ensure our database is always sorted by key? By sorting our data, we can immediately make range queries fast:

Unsorted

Find all values > 2 and < 6

7 is out of bounds, skipping...

Sorted

Find all values > 2 and < 6

1 is below lower bound, skipping...


Sparse Indices

One benefit of sorting our data is that we no longer need to store the offset of every record in memory.

Take a look at this database with four records. Since there's no logical order to the records, there's no way to determine where a record is without storing its key or searching through the entire database.

  • 001: Lorem ipsum001:Lorem ipsum
  • 007: amet, consectetur007:amet, consectetur
  • 010: adipiscing elit.010:adipiscing elit.
  • 018: dolor sit018:dolor sit

  • 001: 0
  • 007: 15
  • 010: 36
  • 018: 57

Knowing that 10 has an offset of 50 doesn't help us find where 18 is.


Now if these records were sorted, we could determine the location of each record using any of the keys in the index, even if it's not the key we're looking for.

Let's say our database is sorted but we only had the offset for the key 10:

  • 001: Lorem ipsum001:Lorem ipsum
  • 007: amet, consectetur007:amet, consectetur
  • 010: adipiscing elit.010:adipiscing elit.
  • 018: dolor sit018:dolor sit

Let's say we want to find the key 18. We know that 18 is greater than 10, which means it must be after 10 in the database. In other words, we can start searching for 18 from 10's offset—36 in this case.

  • 001: Lorem ipsum
  • 007: amet, consectetur
  • 0010: adipiscing elit.
  • 018: dolor sit0
  • 018: dolor sit

While this is certainly slower than having the offset for 18 directly, it's still faster than looping through the database in its entirety.

The real unlock here lies in being able to control the trade-off between memory and performance: a denser index means faster lookups, but more memory usage.


Sorting in Practice

Ensuring our database is always sorted is much easier said than done; by definition, sorting data requires moving around records as new ones get added—something that cannot be done efficiently when we're storing data on-disk. This brings us to our problem:

Problem

How do we keep our data sorted and append-only? It's too slow to sort the data on-disk every time we add a new record; is there another way?


The trick is to first sort the data in memory, and then write it to disk.

  1. When we add a new record, add it to a sorted in-memory list;

  2. When our in-memory list gets too large, we'll write it to disk;

  3. When we want to read a record, we'll read the in-memory list first, and then the disk if necessary.

Memory

  • 010: Lorem ipsum010: Lorem ipsum

The data structure used to store the in-memory list is usually one optimized for sorted data like a balanced binary search tree or more commonly, a skip list.


Of course, the main downside of having some of your data in-memory is that it's not persistent—if the program crashes or the computer shuts down, all of the data in the in-memory list is lost.

The fix here is thankfully pretty straightforward—every time we add a record to the list, we also write it to an append-only file on disk. This way, we have a backup in case a crash does happen (which it most certainly will).

Memory

  • 010: Lorem ipsum010: Lorem ipsum

On-Disk Log

  • 010: Lorem ipsum010:Lorem ipsum

The append-only file doesn't need to be sorted nor does it need to have every record in the database; only the ones that are currently in memory.


With that, we have our very own key-value database! Let's recap how it works.

Our database starts out empty. When we go to add a new record, we'll add it to a sorted in-memory list, keeping a copy in an append-only file in case of crashes.

Memory

  • 010: Lorem ipsum010: Lorem ipsum

On-Disk Log

  • 010: Lorem ipsum010:Lorem ipsum

When the in-memory list gets too large, we'll flush the list by writing all of the records to a file in sorted order. In the process, we'll keep note of each record's offset in an index so we can efficiently look them up later.

Memory

  • 004: dolor sit004: dolor sit
  • 010: Lorem ipsum010: Lorem ipsum
  • 012: consectetur elit.012: consectetur elit.

On-Disk Log

  • 010: Lorem ipsum010:Lorem ipsum
  • 004: dolor sit004:dolor sit
  • 012: consectetur elit.012:consectetur elit.

When we want to look up a record, we'll first check the in-memory list. If the record isn't there, we'll check the index to see if it's in the on-disk file.

Memory

  • 011: amet, consectetur011: amet, consectetur
  • 020: Vestibulum varius020: Vestibulum varius

On-Disk Database

  • 010: Lorem ipsum010:Lorem ipsum
  • 004: dolor sit004:dolor sit
  • 012: consectetur elit.012:consectetur elit.

Once a file is saved to disk, it's considered immutable which means we can only ever read from the file and never update it. To work around this, we'll treat updates and deletes the same as inserting new records—add them to the in-memory list.

Memory

  • 004: dolor sit004: dolor sit
  • 010: Lorem ipsum010: Lorem ipsum
  • 012: consectetur elit.012: consectetur elit.

Treating updates and deletes as new records means our file will only ever grow larger. To prevent this, we'll occassionally compact the on-disk files by deleting all duplicate records.

  • 001:

    Lorem ipsum

  • 018:

    dolor sit

  • 007:

    adipiscing elit.

  • 007:

    null

  • 010:

    consectetur adipiscing elit.

  • 001:

    null

  • 020:

    Vestibulum varius

waiting...


LSM Trees

What we just built actually exists in the real world—it's called an LSM or Log-Structured Merge Tree.

An LSM tree works by combining an in-memory list (often called a memtable) with an on-disk file (typically called a sorted string table or SST) to create a really fast key-value database.

LSM trees are the underlying data structure used for large-scale key-value databases like Google's LevelDB and Amazon's DynamoDB, and they have proven to perform really well at scale—on Prime Day 2020, DynamoDB peaked at 80 million requests per second!

Now, LSM trees aren't perfect, and they're certainly not the only way to structure a database. In fact, relational databases like PostgreSQL or MySQL use a completely different structure called a B-Tree to store their data—but that's a deep dive for another post.

For now, thanks for reading!

联系我们 contact @ memedata.com