数据库备份、转储文件和 restic

数据库备份、转储文件和 restic
Database backups, dump files and restic

原始链接: https://strugglers.net/posts/2025/database-backups-dump-files-and-restic/

## 重新思考数据库备份：使用Restic 本文详细介绍了数据库备份策略的转变，从传统的中间转储文件转向一种更高效的系统，利用restic备份程序。过去，作者使用`mysqldump`创建每日压缩的SQL文件，然后备份这些文件——这种方法容易造成冗余存储，即使数据库没有修改也会产生不必要的更改。新的方法利用restic的标准输入备份模式。这消除了数据库主机上的额外存储空间，受益于restic内置的去重、压缩和加密功能，并避免了文件元数据更改触发完整备份的问题。关键在于使用`restic backup --stdin-from-command -- mysqldump…`以确保适当的错误处理，并使用`--stdin-filename`指定文件名。主要经验包括标记备份以进行组织和保留策略的重要性，以及`gzip --rsyncable`在源数据稳定时最大限度地减少更改的惊人有效性。作者发现restic可以有效地去重`gzip --rsyncable`备份，而使用`xz`压缩的备份则显示出显著的每日变化。最终，这种新方法可以产生更小、更高效的备份，并更清晰地了解数据更改。

这个Hacker News讨论围绕高效的数据库备份策略。发帖者链接了一篇文章关于数据库备份和restic。用户分享了经验，强调了替代方案和优化方法。许多人推荐专门的数据库备份工具，如Percona XtraBackup (MySQL) 和 PgBackRest (Postgres)。其他人发现 **borg** 由于其去重功能而成功，尤其是在数据库拥有大量静态历史数据的情况下，结合像S3 Intelligent Tiering这样具有成本效益的云存储。 **Kopia** 到 Backblaze B2也被提及为大型远程共享的快速选项。一个关键的收获是根据*数据本身*调整备份方法。一位用户甚至成功地通过每日CSV导出将SQLite数据库备份到Git，强调了人类可读备份在更简单差异化和恢复方面的优势。最后，分享了一个申请YC Winter 2026批次的链接。

原文

In the previous article about rethinking my backups I had a TODO item regarding moving away from using intermediary dumps of database content. Here's some notes about that.

The old way

What I used to do in order to back up some MariaDB databases for example was to have a script something like this called regularly:


set -euf
set -o pipefail

umask 0066

/usr/bin/mysqldump \
    --defaults-extra-file=/etc/mysql/backup_credentials.cnf \
    --single-transaction \
    --databases mysql dss_wp dev_dss_wp \
    | /bin/gzip --best --rsyncable -c \
    > /srv/backup/mariadb/all.sql.gz.new \
    && mv /srv/backup/mariadb/all.sql.gz.new /srv/backup/mariadb/all.sql.gz

So, every day the databases get dumped out to /srv/backup/mariadb/all.sql.gz and then at some point that day the backup system picks that file up.

Not ideal

That worked but has a few downsides.

Redundant data storage

The data that's in the database also ends up on disk again, although in a quite well compressed form.

Constant change

Even if nothing in the database has changed, the dump file will always change.

gzip and many other compression tools are (or can be set to be) deterministic, in that they will always produce the same output for a given input, so it wasn't necessarily that. More that the metadata of the file such as the inode and modification time would change, and that would be enough for rsnapshot to store an entire extra copy.

There's various things that could be done to mitigate this but I never felt like there was much point in spending time making the "no changes at all" case highly efficient because there usually was some change in the data, even if it was small.

One of the mitigations I used was to switch to btrfs for the backup repository and use reflinks. The --rsyncable flag to gzip then did help a little. The gzip manual explains:

Cater better to the rsync program by periodically resetting the internal structure of the compressed data stream. This lets the rsync program take advantage of similarities in the uncompressed input when synchronizing two files compressed with this flag. The cost: the compressed output is usually about one percent larger.

I figured that if it helped for rsync then use of that flag should help in minimising changes in the compressed file generally. More on that later.

The new way

Since I switched to using restic, I noted the recommendation to use its standard input backup mode for things like this. This would address the above shortcomings as:

It doesn't store anything extra on the database host, and;
it deduplicates, compresses and encrypts the data itself anyway.

The replacement mysqldump script now looks a bit like:


do_dump() {
    local dbnames=("$@")
        local extra_tags=( "${dbnames[@]/#/db_}" )
        local extra_tags_string=$(IFS=,; printf "%s" "${dbnames[*]}")

    local ignored_tables=(rt5.sessions)

    printf "Starting mysqldump / backup for DBs: %s…\n" "${dbnames[*]}"
    printf "  Ignoring tables: %s\n" "${ignored_tables[*]}"

            /usr/local/sbin/restic \
        --retry-lock 1h \
        backup \
        --no-scan \
        --group-by host,tags \
        --tag "db,db_mariadb,${extra_tags_string}" \
        --stdin-filename "mariadb.sql" \
        --stdin-from-command -- \
        /usr/bin/mysqldump \
        --defaults-extra-file=/etc/mysql/backup_credentials.cnf \
        --default-character-set=utf8mb4 \
        --skip-dump-date \
        --databases "${dbnames[@]}" \
        --ignore-table "${ignored_tables[@]}" \
        --single-transaction

    printf "Finished mysqldump / backup for DBs: %s\n" "${dbnames[*]}"
}

do_dump rt5

Learnings

I have now converted all of my database backup scripts to this way of doing things and discovered a few things along the way.

Avoid pipe confusion with `--stdin-from-command`

Not much of a "discovery" since it's right there in the documentation, but I did do a web search for other people's scripts for database backups using restic and oh boy let me tell you, many of them are still doing the equivalent of:

mysqldump … | restic backup --stdin

That will work most of the time, but does have some caveats regarding the exit code of piped commands. If not careful you can end up making empty backups of a failed mysqldump and not noticing.

You can try using set -o pioefail and/or you can use bash's PIPESTATUS array to examine the exit code of any part of a pipeline. But really, it is much easier to sidestep the issue by not using a pipe at all:

restic backup … --stdin-from-command -- mysqldump …

In that form restic will fail if the command it's executing fails.

Specifying the filename

Naturally, stdin doesn't have a filename. You can get around this with the -stdin-filename flag. If you use --stdin-filename foo then whatever your command outputs will appear as the file /foo inside the backup snapsshot.

NOTE

There's a bug in the currently-released version of restic where it doesn't allow the / character anywhere in the filename, so you can't fake the existence of subdirectories. It doesn't really matter since there's only one file and if you did restore it to the filesystem it would be in a relative path anyway.

The bug has been fixed but the fix isn't in a release yet at the time I write this.

Use tags to differentiate and group backups

It's a good idea to tag these backups in some way.

With restic, backups are by default grouped by host and the set of paths that were specified to backup. I found however that a backup using --stdin-from-command has an empty set of paths for grouping purposes even though --stdin-filename is used. I don't know if this is a bug.

The consequence here is that if you have multiple of these types of backups for a single host, by default restic will use the most recent one as the parent for the current one even if it has a different command and/or --stdin-filename. This doesn't cause too many problems, it's just a bit confusing and will result in a later diff command showing one database dump file being removed and the other added, every time.

It can easily be avoided though by setting --group-by to include tags and making sure different database backups are tagged differently. It is probably a good idea to use some tags anyway so you can programmatically identify what the backups are. This will be useful later if for example you want to have different retention periods for different kinds of data, or for different databases.

I tag my general host backups as auto, and all the database backups as db. Then, there is db_mariadb, db_postgresql and db_sqlite for backups that have come from MariaDB, PostgreSQL and SQLite respectively. Finally I am also adding a tag for each named database.

I think I probably will want to retain most databases for the same time period as general host backups, but I know there are a few less important databases that I will retain for a shorter time. Having those tagged will be helpful for writing the forget policy later.

I learned about `mysqldump --skip-dump-date`

Using --skip-dump-date removes some timestamps from comments which helps to reduce churn.

It really is storing less churn

Having had this running for a few days now I can see it really is storing less of a delta. In the case where I do have databases that haven't changed at all, the snapshot ends up just being a couple of hundred bytes which I assume is just metadata.

You can tell restic not to store a backup with no changes at all, but I like doing so as a record of a successful but unchanged backup.

That `--rsyncable` really does work

My old mysqldump scripts all used gzip --rsyncable but at some point it seems I decided that better compression was more important, so some of them ended up using xz.

I never really examined in detail what the churn was like because rsnapshot made that quite awkward to do, especially after I switched it to using reflinks. I have been able to look at it now though, because I've been doing restic backups for some time before adjusting those mysqldump scripts.

What I can tell you is that restic is able to effectively deduplicate a database backup file made with gzip --rsyncable whereas the ones that are compressed with xz show huge amounts of daily churn even when the database had little.

My conclusions:

gzip --rsyncable really does work for minimising changes if the source file doesn't change much.
zstd now has a similar --rsyncable flag.
xz was a bad choice
If you don't want to do all this --stdin-from-command malarkey or can't because you're doing backups another way, --rsyncable is well worth using. It's nearly as good as just letting restic deduplicate the raw SQL.