建立你自己的数据库

原文

A step-by-step guide to building a key-value database from scratch.

If you were to build your own database today, not knowing that databases exist already, how would you do it? In this post, we'll explore how to build a key-value database from the ground up.

A key-value database works more or less like objects in JavaScript—you can store values using a key and retrieve them later using that same key:

$ db set 'hello' 'world'

$ db get 'hello'

world

Let's find out how they work!

The Humble File

Databases were made to solve one problem:

Problem

How do we store data persistently and then efficiently look it up later?

The typical way to store any kind of data persistently in a computer is to use a file . When we want to store data, we add the key-value pair to the file:

$ db set 1 "Lorem ipsum"
$ db set 18 "dolor sit"

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

When we want to look for a specific key, we iterate through the pairs to see if there's a matching key:

$ db set 1 "Lorem ipsum"
$ db set 18 "dolor sit"

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

For updates, we'll find the key and replace the value in-place:

$ db set 1 "Lorem ipsum"
$ db set 18 "dolor sit"

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

And for deletes, we'll delete the record from the file:

$ db set 1 "Lorem ipsum"
$ db set 18 "dolor sit"

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

Easy! We're done right?

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

Mutable Updates

This approach, simple as it is, doesn't actually work very well in practice. The problem lies with the way we're doing updates and deletes—they're wholly inefficient.

To a computer, our file looks something like this—nothing more than a long sequence of bytes:

001:Lorem␣ipsum\n018:dolor␣sit\n005:adipiscing␣elit.\n014:Vestibulum␣varius\n002:vel␣mauris\n007:consectetur␣adipiscing␣elit.\n010:Vestibulum␣varius\n016:vel␣mauris\n003:consectetur␣adipiscing␣elit.

When we go to update or delete a record, we're currently updating that record in-place, which means we potentially have to move all of the data that comes after that record:

001:Lorem␣ipsum\n018:dolor␣sit\n005:adipiscing␣elit.␣vel␣mauris\n014:Vestibulum␣varius\n002:vel␣mauris\n007:consectetur␣adipiscing␣elit.\n010:Vestibulum␣varius\n016:vel␣mauris\n003:consectetur␣adipiscing␣elit.

In this case, updating the record 005 to "adipiscing␣elit.␣vel␣mauris" means moving all of the records that come after it by 11 bytes (the length of the added string "␣vel␣mauris"). This can quickly get really costly, especially as our database grows in size!

Append-Only Files

One way to work around the update problem is to make records immutable. In other words, we add the constraint that we can only add new records to the end of the file and never update or delete existing ones.

With this approach, updates are treated the same as inserts—just add a new record to the end of the file:

$ db set 1 "Lorem ipsum"
$ db set 18 "dolor sit"

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

But now we have another problem—there are duplicate keys in the file!

To work around this, we have to change our search algorithm to look for the last occurrence of the key instead of the first:

$ db set 1 "Lorem ipsum"
$ db set 18 "dolor sit"

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

To delete records, we create a special "tombstone" record that marks the key as deleted. There's no single way to do this, but one way is to use a special value like null:

$ db set 1 "Lorem ipsum"
$ db set 18 "dolor sit"

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

And there we have it! We have a key-value database that uses a file as its storage mechanism. Using it, we can store, find, update, and delete key-value pairs.

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

Now this implementation isn't perfect; right now, there are two major issues:

The file can get very large. Since we're only appending to the file, the file will grow infinitely over time. Not good!
Searching is slow. To search for a specific key, we have to potentially iterate through all records in the database. For a database with millions of records, this can take a while!

How can we fix these problems?

Keeping Files Small

Problem

How do we make sure the file doesn't grow indefinitely? Because we're using an append-only file, we need some mechanism to periodically "shrink" the file so it doesn't eventually take over our entire hard drive.

Take a look at our database here after a few updates and deletes:

$ db set 1 "Lorem ipsum"
$ db set 18 "dolor sit"
$ db set 7 "adipiscing elit."
$ db delete 7
$ db set 10 "consectetur adipiscing elit."
$ db delete 1

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

007: adipiscing elit.007:adipiscing elit.

007: null007:null

010: consectetur adipiscing elit.010:consectetur adipiscing elit.

001: null001:null

Our database file has six entries, but only two represent actual records—the rest are either deleted or contain stale data. If we can clear all the irrelevant data, we can shrink the file by over 66%!

db.txt

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

007: adipiscing elit.007:adipiscing elit.

007: null007:null

010: consectetur adipiscing elit.010:consectetur adipiscing elit.

001: null001:null

Segments and Compaction

Here's an idea: once a file exceeds a certain size, we'll close the file and create a new one. While the new file ingests new data (in the same way we've been doing so far), we'll compact the old file by deleting all of its irrelevant data.

Meaning, we stop adding new data to the file.

Here, we've set the maximum file size to seven records. Notice that the database is full—try clicking on "Add" to add a new record and notice what happens:

001:
Lorem ipsum
018:
dolor sit
007:
adipiscing elit.
007:
null
010:
consectetur adipiscing elit.
001:
null
020:
Vestibulum varius

waiting...

Now, our database consists of two different files which we'll call segments. Each segment will usually become a lot smaller after compaction, which means we can merge them together as part of the compaction process.

With that, we've made a mechanism to stop our database from growing indefinitely!

Your First Index

Our next problem is on search performance:

Problem

How do we make searching fast? Right now, we have to iterate through all of the records in the database to find a specific key. This is super slow!

What if we use objects? That's right, these little guys:

JavaScript objects, otherwise known as hash tables or dictionaries, are really efficient at storing and looking up key-value pairs:

const hashTable = {
  hello: "world",
  foo: "bar",
  baz: "qux",
};
const value = hashTable["hello"]; // "world"

It doesn't matter how many records there are—the time it takes to look up and retrieve a value in a hash table is more or less constant. The catch is they must live in memory.

Aside: In-Memory vs. On-Disk

When you write a variable in your code, the computer will "remember" the value of that variable only for as long as the program is running.

script.js

let x = 1;
x = x + 1;
console.log(x); // 2
x = x + 1;
console.log(x); // 3

$node script.js

This program will always print 2 and 3 because the value of x "resets" every time we run the program. This is because x is stored in-memory , and any value stored in memory is discarded when the program stops.

If we want our data to "stick" between runs, we'll need to store it on-disk—in other words, a file.

script.js

import fs from "fs";
let x = Number(fs.readFileSync("data.txt", "utf8")) || 1;
x = x + 1;
console.log(x);
x = x + 1;
console.log(x);
fs.writeFileSync("data.txt", x);

$node script.js

This time, x will print 2 and 3 the first run, and 4 and 5 the second run.

The tradeoff to persistence is performance—accessing data from memory is about 80x faster on average than accessing it from disk.

Here's how the index will work. For every record that we have in our database, we'll store that record's offset—the number of bytes from the beginning of the file to the start of the record—in the index:

The second record, 18: dolor sit, for example, has an offset of 15 because:

Each character is 1 byte large;
The first record is 13 characters long (1:Lorem ipsum);
The first record ends with a newline character, which is (at most) 2 bytes long;

This gives us an offset of 13 + 2 = 15.

One thing to note is that we need an index for each segment because the offset is relative to the start of the file—in other words, the start of each segment.

Searching With Indices

Using an index, our search algorithm can now run a lot more efficiently:

Starting at the most recent segment, look up the key in the index;
If the key is found, read the record at the offset;
If the key is not found, move on to the next segment;
Repeat (2) and (3) until the key is found or all segments have been searched.

$ Waiting for commands...

001: 0018: 15007: 29010: 49020: 71

001: Lorem ipsum018: dolor sit007: adipiscing elit.010: consectetur elit.020: Vestibulum varius

004: dolor sit amet006: adipiscing elit.012: consectetur elit.

Updating Indices

An index is only useful if it's in sync with our data. Whenever we update, delete, or insert a record, we have to change the index accordingly:

001: Lorem ipsum001:Lorem ipsum

018: dolor sit018:dolor sit

Notice what this implies—writing to the database is slower with an index! This is one of the tradeoffs of using an index; we can search for data much faster at the cost of slower writes.

Tradeoffs

An index is great because it lets us query our database much faster, but there are some problems with our specific hash table implementation:

Keys have to fit in memory. Since we're using an in-memory hash table as our index, all of the keys in our database must fit in memory. This means there's a limit on the number of keys we can store!
Range queries are inefficient. Our index wouldn't help for search queries; if we wanted to find all the records between the keys 12 and 18, for example, we'd have to iterate through the entire database!

Sorted String Tables

Here's an idea: what if we ensure our database is always sorted by key? By sorting our data, we can immediately make range queries fast:

Unsorted

Find all values > 2 and < 6

7 is out of bounds, skipping...

Sorted

Find all values > 2 and < 6

1 is below lower bound, skipping...

Sparse Indices

One benefit of sorting our data is that we no longer need to store the offset of every record in memory.

Take a look at this database with four records. Since there's no logical order to the records, there's no way to determine where a record is without storing its key or searching through the entire database.

001: Lorem ipsum001:Lorem ipsum

007: amet, consectetur007:amet, consectetur

010: adipiscing elit.010:adipiscing elit.

018: dolor sit018:dolor sit

001: 0
007: 15
010: 36
018: 57

Knowing that 10 has an offset of 50 doesn't help us find where 18 is.

Now if these records were sorted, we could determine the location of each record using any of the keys in the index, even if it's not the key we're looking for.

Let's say our database is sorted but we only had the offset for the key 10:

001: Lorem ipsum001:Lorem ipsum

007: amet, consectetur007:amet, consectetur

010: adipiscing elit.010:adipiscing elit.

018: dolor sit018:dolor sit

Let's say we want to find the key 18. We know that 18 is greater than 10, which means it must be after 10 in the database. In other words, we can start searching for 18 from 10's offset—36 in this case.

001: Lorem ipsum
007: amet, consectetur
0010: adipiscing elit.

018: dolor sit0

018: dolor sit

While this is certainly slower than having the offset for 18 directly, it's still faster than looping through the database in its entirety.

The real unlock here lies in being able to control the trade-off between memory and performance: a denser index means faster lookups, but more memory usage.

Sorting in Practice

Ensuring our database is always sorted is much easier said than done; by definition, sorting data requires moving around records as new ones get added—something that cannot be done efficiently when we're storing data on-disk. This brings us to our problem:

Problem

How do we keep our data sorted and append-only? It's too slow to sort the data on-disk every time we add a new record; is there another way?

The trick is to first sort the data in memory, and then write it to disk.

When we add a new record, add it to a sorted in-memory list;
When our in-memory list gets too large, we'll write it to disk;
When we want to read a record, we'll read the in-memory list first, and then the disk if necessary.

Memory

010: Lorem ipsum010: Lorem ipsum

The data structure used to store the in-memory list is usually one optimized for sorted data like a balanced binary search tree or more commonly, a skip list.

Of course, the main downside of having some of your data in-memory is that it's not persistent—if the program crashes or the computer shuts down, all of the data in the in-memory list is lost.

The fix here is thankfully pretty straightforward—every time we add a record to the list, we also write it to an append-only file on disk. This way, we have a backup in case a crash does happen (which it most certainly will).

Memory

010: Lorem ipsum010: Lorem ipsum

On-Disk Log

010: Lorem ipsum010:Lorem ipsum

The append-only file doesn't need to be sorted nor does it need to have every record in the database; only the ones that are currently in memory.

With that, we have our very own key-value database! Let's recap how it works.

Our database starts out empty. When we go to add a new record, we'll add it to a sorted in-memory list, keeping a copy in an append-only file in case of crashes.

Memory

010: Lorem ipsum010: Lorem ipsum

On-Disk Log

010: Lorem ipsum010:Lorem ipsum

When the in-memory list gets too large, we'll flush the list by writing all of the records to a file in sorted order. In the process, we'll keep note of each record's offset in an index so we can efficiently look them up later.

Memory

004: dolor sit004: dolor sit
010: Lorem ipsum010: Lorem ipsum
012: consectetur elit.012: consectetur elit.

On-Disk Log

010: Lorem ipsum010:Lorem ipsum

004: dolor sit004:dolor sit

012: consectetur elit.012:consectetur elit.

When we want to look up a record, we'll first check the in-memory list. If the record isn't there, we'll check the index to see if it's in the on-disk file.

Memory

011: amet, consectetur011: amet, consectetur
020: Vestibulum varius020: Vestibulum varius

On-Disk Database

010: Lorem ipsum010:Lorem ipsum

004: dolor sit004:dolor sit

012: consectetur elit.012:consectetur elit.

Once a file is saved to disk, it's considered immutable which means we can only ever read from the file and never update it. To work around this, we'll treat updates and deletes the same as inserting new records—add them to the in-memory list.

Memory

004: dolor sit004: dolor sit
010: Lorem ipsum010: Lorem ipsum
012: consectetur elit.012: consectetur elit.

Treating updates and deletes as new records means our file will only ever grow larger. To prevent this, we'll occassionally compact the on-disk files by deleting all duplicate records.

001:
Lorem ipsum
018:
dolor sit
007:
adipiscing elit.
007:
null
010:
consectetur adipiscing elit.
001:
null
020:
Vestibulum varius

waiting...

Memory

010: Lorem ipsum010: Lorem ipsum

On-Disk Log

010: Lorem ipsum010:Lorem ipsum

LSM Trees

What we just built actually exists in the real world—it's called an LSM or Log-Structured Merge Tree.

An LSM tree works by combining an in-memory list (often called a memtable) with an on-disk file (typically called a sorted string table or SST) to create a really fast key-value database.

LSM trees are the underlying data structure used for large-scale key-value databases like Google's LevelDB and Amazon's DynamoDB, and they have proven to perform really well at scale—on Prime Day 2020, DynamoDB peaked at 80 million requests per second!

Now, LSM trees aren't perfect, and they're certainly not the only way to structure a database. In fact, relational databases like PostgreSQL or MySQL use a completely different structure called a B-Tree to store their data—but that's a deep dive for another post.

For now, thanks for reading!