使用 DuckDB-WASM 在浏览器中查询 TB 级数据

使用 DuckDB-WASM 在浏览器中查询 TB 级数据
Use DuckDB-WASM to query TB of data in browser

原始链接: https://lil.law.harvard.edu/blog/2025/10/24/rethinking-data-discovery-for-libraries-and-digital-humanities/

法律创新实验室（LIL）最近推出了Data.gov Archive Search，该项目旨在解决可访问数据发现与维护强大搜索基础设施高成本之间的传统权衡。历史上，丰富的数据浏览和过滤需要昂贵的服务器和专门人员，这常常导致项目难以为继。虽然静态文件托管成本较低，但会限制数据的可发现性。 LIL选择了一种新的方法，利用了客户端数据分析的最新进展。 Data.gov Archive 的 18TB 目录元数据存储为 Parquet 文件在静态托管上，并且一个数据库引擎（DuckDB-Wasm）*在用户浏览器内*运行。这使得无需专用服务器即可进行动态、可扩展的搜索和过滤——浏览器有效地检索和处理必要的数据。这种模式可以降低运营成本，减少技术开销，并提高图书馆、数字人文项目和档案馆的长期可访问性。虽然性能优化仍在进行中，LIL 鼓励其他人尝试这种方法，并分享他们的经验以促进更广泛的采用和协作。您可以通过 [email protected] 与他们联系。

## DuckDB-WASM：直接在浏览器中查询大数据最近一篇Hacker News上的帖子强调了一种引人入胜的数据查询方法：使用DuckDB-WASM直接查询存储在S3等服务上的TB级数据，*完全在Web浏览器内 – 无需后端服务器*。这是通过利用廉价的S3存储、DuckDB将S3用作存储的能力以及WebAssembly (WASM) 在客户端运行数据库代码来实现的。虽然功能强大，但这种方法也需要考虑一些问题。S3的带宽成本可能很高，并且正在探索Cloudflare R2等替代方案（没有出口费用）。用户还讨论了客户端处理和服务器端处理之间的权衡，以及考虑用户群规模与数据量的重要性。一些用户报告了DuckDB的内存和线程管理方面的问题，导致崩溃，而另一些用户则成功地将其与systemd-run等工具集成以进行资源限制。尽管存在这些担忧，但直接在浏览器中进行经济高效、无服务器数据分析的潜力正在引起人们的兴奋。

原文

Illustration of a woman using a Macey vertical filing cabinet, from a 1903 catalogue — Woman using a Macey vertical filing cabinet (detail, 1903). Source: Wikimedia Commons.

As part of our Public Data Project, LIL recently launched Data.gov Archive Search. In this post, we look under the hood and reflect on how and why we built this project the way we did.

Rethinking the Old Trade-Off: Cost, Complexity, and Access

Libraries, digital humanities projects, and cultural heritage organizations have long had to perform a balancing act when sharing their collections online, negotiating between access and affordability. Providing robust features for data discovery, such as browsing, filtering, and search, has traditionally required dedicated computing infrastructure such as servers and databases. Ongoing server hosting, regular security and software updates, and consistent operational oversight are expensive and require skilled staff. Over years or decades, budget changes and staff turnover often strand these projects in an unmaintained or nonfunctioning state.

The alternative, static file hosting, requires minimal maintenance and reduces expenses dramatically. For example, storing gigabytes of data on Amazon S3 may cost $1/month or less. However, static hosting often diminishes the capacity for rich data discovery. Without a dynamic computing layer between the user’s web browser and the source files, data access may be restricted to brittle pre-rendered browsing hierarchies or search functionality that is impeded by client memory limits. Under such barriers, the collection’s discoverability suffers.

For years, online collection discovery has been stuck between a rock and a hard place: accept the complexity and expense required for a good user experience, or opt for simplicity and leave users to contend with the blunt limitations of a static discovery layer.

Why We Explored a New Approach

When LIL began thinking about how to provide discovery for the Data.gov Archive, we decided that building a lightweight and easily maintained access point from the beginning would be worth our team’s effort. We wanted to provide low-effort discovery with minimal impact on our resources. We also wanted to ensure that whatever path we chose would encourage, rather than impede, long-term access.

This approach builds on our recent experience when the Caselaw Access Project (CAP) hit a transition moment. At that time, we elected to switch case.law to a static site and to partner with others dedicated to open legal data to provide more feature-rich access.

CAP includes some 11 TB of data; the Data.gov Archive represents nearly 18 TB, with the catalog metadata alone accounting for about 1 GB. Manually browsing the archive data in its repository, even for a user who knows what she’s looking for, is laborious and time-consuming. Thus we faced a challenge. Could we enable dynamic, scalable discovery of the Data.gov Archive while enjoying the frugality, simplicity, and maintainability of static hosting?

Our Experiment: Rich Discovery, No Server Required

Recent advancements in client-side data analysis led us to try something new. Tools like DuckDB-Wasm, sql.js-httpvfs, and Protomaps, powered by standards such as WebAssembly, web workers, and HTTP range requests, allow users to efficiently query large remote datasets in the browser. Rather than downloading a 2 GB data file into memory, these tools can incrementally retrieve only the relevant parts of the file and process query results locally.

We developed Data.gov Archive Search on the same model. Here’s how it works:

Data storage: We store Data.gov Archive catalog metadata as sorted, compressed Parquet files on Source.coop, taking advantage of performant static file hosting.
In-browser query engine: Our client-side web application loads DuckDB-Wasm, a fully functional database engine running inside the user’s browser.
On-demand data access: When a user navigates to a resource or submits a search, our DuckDB-Wasm client executes a targeted retrieval of the data needed to fulfill the request. No dedicated server is required; queries run entirely in the browser.

This experiment has not been without obstacles. Getting good performance out of this model demands careful data engineering, and the large DuckDB-Wasm binary imposes a considerable latency cost. As of this writing, we’re continuing to explore speedy alternatives like hyparquet and Arquero to further improve performance.

Still, we’re pleased with the result: an inexpensive, low-maintenance static discovery platform that allows users to browse, search, and filter Data.gov Archive records entirely in the browser.

Why This Matters for Libraries, Digital Humanities Projects, and Beyond

This new pattern offers a compelling model for libraries, academic archives, and DH projects of all sizes:

Lower operating costs: By shifting from an expensive server to lower cost static storage, projects can sustainably offer their users access to data.
Reduced technical overhead: With no dedicated backend server, security risks are reduced, no patching or upgrades are needed, and crashing servers are not a concern.
Sustained access: Projects can be set up with care, but without demanding constant attention. Organizations can be more confident that their archive and discovery interfaces remain usable and accessible, even as staffing or funding changes over time.

Knowing that we are not the only group interested in approaching access in this way, we’re sharing our generalized learnings. We see a few ways forward for others in the knowledge and information world:

Prototype or pilot: If your organization has large, relatively static datasets, consider experimenting with a browser-based search tool using static hosting.
Share and collaborate: Template applications, workflows, and lessons learned can help this new pattern gain adoption and maturity across the community.

This project is still evolving, and we invite others—particularly those in libraries and digital cultural heritage—to explore these possibilities with us. We’re committed to open sharing as we refine our tools, and we welcome collaboration or feedback at [email protected].