独立验证Go的可重现构建

独立验证Go的可重现构建
Independently verifying Go's reproducible builds

原始链接: https://www.agwa.name/blog/post/verifying_go_reproducible_builds

Go 1.21 引入了一项功能，允许自动下载和使用较新的工具链，以简化新功能的采用。然而，这引发了关于潜在供应链攻击（通过恶意二进制文件）的安全担忧。为了解决这些问题，Go 项目实现了可以从源代码可重复构建的工具链——始终生成相同的输出——并在公共的“Go 校验和数据库”中发布这些构建的校验和。这允许验证下载的二进制文件是否与从源代码构建的文件匹配。认识到需要独立验证，开发者 Filippo Valsorda 创建了“Source Spotter”，一个审计工具，它通过从源代码构建工具链并将其校验和与数据库进行比较，持续验证工具链的可重复性。到目前为止，Source Spotter 已经成功复现了超过 2,672 个工具链。 Valsorda 使用较旧的、经过独立验证的工具链启动了该过程，现在利用先前验证的二进制文件。虽然出现了一些挑战（例如 macOS 上的签名剥离和细微的构建不一致），但 Source Spotter 一直如实地验证了 Go 对透明度和可重复性的承诺，从而显著增强了 Go 生态系统的安全性。该项目不断发展，正在探索基于 Git 的验证以增加安全性。

## Go 可复现构建：一则黑客新闻讨论一则黑客新闻帖子强调了独立验证 Go 的可复现构建的重要性。一位用户分享了他们自己编译 Go 十年的经验，引发了对潜在的、未被发现的“信任陷阱”式漏洞的质疑。这场讨论引发了关于可复现构建价值的争论。一些人认为这对于供应链安全至关重要——允许在*不*信任整个分发过程的情况下验证代码，而另一些人则认为这是一种资源错配，专注于理论而非实际威胁。最近的 xz 工具供应链攻击被提及，对于可复现构建是否能够阻止它，存在不同的意见。一些人认为不能，因为漏洞存在于输入代码本身，而另一些人认为它可以检测到被修改的 tar 包。最终，这场对话强调了独立验证的必要性，即使完美字节对字节的匹配并非总是可行的，并强调了供应链攻击的持续风险。

原文

When you try to compile a Go module that requires a newer version of the Go toolchain than the one you have installed, the go command automatically downloads the newer toolchain and uses it for compiling the module. (And only that module; your system's go installation is not replaced.) This useful feature was introduced in Go 1.21 and has let me quickly adopt new Go features in my open source projects without inconveniencing people with older versions of Go.

However, the idea of downloading a binary and executing it on demand makes a lot of people uncomfortable. It feels like such an easy vector for a supply chain attack, where Google, or an attacker who has compromised Google or gotten a misissued SSL certificate, could deliver a malicious binary. Many developers are more comfortable getting Go from their Linux distribution, or compiling it from source themselves.

To address these concerns, the Go project did two things:

They made it so every version of Go starting with 1.21 could be easily reproduced from its source code. Every time you compile a Go toolchain, it produces the exact same Zip archive, byte-for-byte, regardless of the current time, your operating system, your architecture, or other aspects of your environment (such as the directory from which you run the build).
They started publishing the checksum of every toolchain Zip archive in a public transparency log called the Go Checksum Database. The go command verifies that the checksum of a downloaded toolchain is published in the Checksum Database for anyone to see.

These measures mean that:

You can be confident that the binaries downloaded and executed by the go command are the exact same binaries you would have gotten had you built the toolchain from source yourself. If there's a backdoor, the backdoor has to be in the source code.
You can be confident that the binaries downloaded and executed by the go command are the same binaries that everyone else is downloading. If there's a backdoor, it has to be served to the whole world, making it easier to detect.

But these measures mean nothing if no one is checking that the binaries are reproducible, or that the Checksum Database isn't presenting inconsistent information to different clients. Although Google checks reproducibility and publishes a report, this doesn't help if you think Google might try to slip in a backdoor themselves. There needs to be an independent third party doing the checks.

Why not me? I was involved in Debian's Reproducible Builds project back in the day and developed some of the core tooling used to make Debian packages reproducible (strip-nondeterminism and disorderfs). I also have extensive experience monitoring Certificate Transparency logs and have detected misbehavior by numerous logs since 2017. And I do not work for Google (though I have eaten their food).

In fact, I've been quietly operating an auditor for the Go Checksum Database since 2020 called Source Spotter (à la Cert Spotter). Source Spotter monitors the Checksum Database, making sure it doesn't present inconsistent information or publish more than one checksum for a given module and version. I decided to extend Source Spotter to also verify toolchain reproducibility.

The Checksum Database was originally intended for recording the checksums of Go modules. Essentially, it's a verifiable, append-only log of records which say that a particular version (e.g. v0.4.0) of a module (e.g. src.agwa.name/snid) has a particular SHA-256 hash. Go repurposed it for recording toolchain checksums. Toolchain records have the pseudo-module golang.org/toolchain and versions that look like v0.0.1-goVERSION.GOOS-GOARCH. For example, the Go1.24.2 toolchain for linux/amd64 has the module version v0.0.1-go1.24.2.linux-amd64.

When Source Spotter sees a new version of the golang.org/toolchain pseudo-module, it downloads the corresponding source code, builds it in an AWS Lambda function by running make.bash -distpack, and compares the checksum of the resulting Zip file to the checksum published in the Checksum Database. Any mismatches are published on a webpage and in an Atom feed which I monitor.

So far, Source Spotter has successfully reproduced every toolchain since Go 1.21.0, for every architecture and operating system. As of publication time, that's 2,672 toolchains!

Bootstrap Toolchains

Since the Go toolchain is written in Go, building it requires an earlier version of the Go toolchain to be installed already.

When reproducing Go 1.21, 1.22, and 1.23, Source Spotter uses a Go 1.20.14 toolchain that I built from source. I started by building Go 1.4.3 using a C compiler. I used Go 1.4.3 to build Go 1.17.13, which I used to build Go 1.20.14. To mitigate Trusting Trust attacks, I repeated this process on both Debian and Amazon Linux using both GCC and Clang for the Go 1.4 build. I got the exact same bytes every time, which I believe makes a compiler backdoor vanishingly unlikely. The scripts I used for this are open source.

When reproducing Go 1.24 or higher, Source Spotter uses a binary toolchain downloaded from the Go module proxy that it previously verified as being reproducible from source.

Problems Encountered

Compared to reproducing a typical Debian package, it was really easy to reproduce the same bytes when building the Go toolchains. Nevertheless, there were some bumps along the way:

First, the Darwin (macOS) toolchains published by Google contain signatures produced by Google's private key. Obviously, Source Spotter can't reproduce these. Instead, Source Spotter has to download the toolchain (making sure it matches the checksum published in the Checksum Database) and strip the signatures to produce a new checksum that is verified against the reproduced toolchain. I reused code written by Google to strip the signatures and I honestly have no clue what it's doing and whether it could potentially strip a backdoor. A review from someone versed in Darwin binaries would be very helpful!

Second, to reproduce the linux-arm toolchains, Source Spotter has to set GOARM=6 in the environment... except when reproducing Go 1.21.0, which Google accidentally built using GOARM=7. I don't understand why cmd/dist (the tool used to build the toolchain) doesn't set this environment variable along with the many other environment variables it sets.

Finally, the Checksum Database contains a toolchain for Go 1.9.2rc2, which is not a valid version number. It turns out this version was released by mistake. To avoid raising an error for an invalid version number, Source Spotter has to special case it. Not a huge deal, but I found it interesting because it demonstrates one of the downsides of transparency logs: you can't fix or remove entries that were added by mistake!

Source Code Transparency

Although the toolchain binaries are published in the Checksum Database, the source code is not. This means Google could serve Source Spotter, and only Source Spotter, source code which contains a backdoor. To mitigate this, Source Spotter publishes the checksums of every source tarball it builds.

Filippo suggested that Source Spotter build from Go's Git repository and publish the Git commit IDs instead, since lots of Go developers have the Go Git repository checked out and it would be relatively easy for them to compare the state of their repos against what Source Spotter has seen. Regrettably, Git commit IDs are SHA-1, but this is mitigated by Git's use of Marc Stevens' collision detection, so the benefits may be worth the risk. I think building from Git is a good idea, and to bootstrap it, Filippo used Magic Wormhole to send me the output of git show-ref --tags from his repo while we were both at the Transparency.dev Summit last week.

Ultimately, I would like to see the Go project publish source tarballs in the Checksum Database.

Conclusion

Thanks to Go's Checksum Database and reproducible toolchains, Go developers get the usability benefits of a centralized package repository and binary toolchains without sacrificing the security benefits of decentralized packages and building from source. The Go team deserves enormous credit for making this a reality, particularly for building a system that is not too hard for a third party to verify. They've raised the bar, and I hope other language and package ecosystems can learn from what they've done.

Learn more by visiting the Source Spotter website or the GitHub repo.