Show HN:Hyperparam:在浏览器中本地探索数据集的开源工具
Show HN: Hyperparam: OSS tools for exploring datasets locally in the browser

原始链接: https://hyperparam.app/about/opensource

Hyperparam 解决了用户对易于使用、可扩展的工具来探索和管理大型数据集的关键需求,使数据科学家能够构建更好的训练集。其“浏览器中的数据中心 AI”方法专注于交互式探索、AI 辅助管理和本地优先、私有设计,绕过了传统的复杂基础设施。 Hyperparam 生态系统包含开源工具:Hyparquet(浏览器内 Parquet 数据访问)、Hyparquet-Writer(Parquet 文件导出)、HighTable(可扩展的 React 数据表)、Icebird(Apache Iceberg 表格读取器)和 Hyllama(LLM 元数据解析器)。这些用 TypeScript/JavaScript 编写的库,方便了浏览器和 Node.js 的无缝使用。 Hyperparam 命令行界面 (CLI) 提供了一个一键式解决方案来启动用于数据集查看的本地 Web 应用程序,将所有内容整合在一起。通过实现高效的数据处理、可扩展的可视化、直观的数据导出和模型元数据检查,Hyperparam 旨在彻底改变数据中心机器学习工作流程,提供更快、更容易和更可扩展的替代方案,用于迭代改进数据质量。

Hyperparam 是一套开源、无依赖的 JavaScript 库集合,旨在帮助数据科学家和机器学习工程师在本地浏览器中探索数据。它包含 Hyparquet(Parquet 文件读取器)、Icebird(Iceberg 表格浏览器)、HighTable(虚拟滚动)、Hyparquet-Writer(Parquet 导出器)和 Hyllama(LLM 元数据读取器)等工具。一个 CLI 工具 `npx hyperparam` 允许查看本地文件。该项目强调无需云上传和后端服务器,方便前端数据应用程序的开发。 创建者回应了关于名称与超参数(hyperparameters)之间关系的担忧,解释了它从最初关注超参数的演变过程。他们强调 Hyperparam 比 DuckDB 和 Datafusion 等大型 WASM 解决方案的优势在于其体积更小(例如,Hyparquet 只有 10kb),从而加载速度更快,尤其是在利用 Parquet 的元数据通过 HTTP range requests 进行部分文件加载时。Iceberg 读取器等工具的认证机制正在改进中,并且 `npx` 命令没有遥测功能。

原文
Hyperparam OSS Universe

Hyperparam was founded to address a critical gap in the machine learning ecosystem: the lack of a user-friendly, scalable UI for exploring and curating massive datasets.

Our mission is grounded in the belief that data quality is the most important factor in ML success, and that better tools are needed to build better training sets. In practice, this means enabling data scientists and engineers to “look at your data” – even terabyte-scale text corpora – interactively and entirely in-browser without heavy infrastructure. By combining efficient data formats, high-performance JavaScript libraries, and emerging AI assistance, Hyperparam's vision is to put data quality front and center in model development. Our motto “the missing UI for AI data” reflects its goal to make massive data exploration, labeling, and quality management as intuitive as modern web apps, all while respecting privacy and compliance through a local-first design.

Mission and Vision: Data-Centric AI in the Browser

Our mission is to empower ML practitioners to create the best training datasets for the best models. This stems from an industry-wide realization that model performance is ultimately bounded by data quality, not just model architecture or hyperparameters. Hyperparam envisions a new workflow where:

  • Interactive Data Exploration at Scale: Users can freely explore huge datasets (millions or billions of records) with fast, free-form interactions to uncover insights. Unlike traditional Python notebooks that struggle with large data (often requiring downsampling or clunky pagination), Hyperparam leverages browser technology for a smooth UI.
  • AI-Assisted Curation: Hyperparam integrates ML models to help label, filter, and transform data at a scale that would be impractical to review manually. By combining a highly interactive UI with model assistance, we make it possible for the user to use data to express exactly what they want from the model.
  • Local-First and Private: Hyperparam runs entirely client-side, with no server dependency. This design not only simplifies setup (no complex pipeline or cloud needed) but also addresses enterprise compliance and security concerns, since sensitive data need not leave the user's machine. Fully browser-contained tools can bypass major adoption hurdles.

Experts across data engineering and MLOps widely agree on the need for better data exploration and labeling tools to tackle today's bottlenecks. We believe that the way to do that is to make data-centric AI workflows that are faster, easier to deploy, and more scalable – enabling users to iteratively improve data quality, which in turn yields better models.

The Hyperparam OSS Universe

Hyperparam OSS Universe Flowchart

Hyperparam delivers on our vision through a suite of open-source tools that tackle different aspects of data curation. These tools are built in TypeScript/JavaScript for seamless browser and Node.js usage.

We care about performance, minimal dependencies, and standards compliance.

Hyparquet GitHub

Hyparquet: In-Browser Parquet Data Access

Hyparquet is a lightweight, pure-JS library for reading Apache Parquet files directly in the browser. Parquet is a popular columnar format for large datasets, and Hyparquet enables web applications to tap into that efficiency without any server.

Hyparquet allows data scientists to open large dataset files instantly in a browser UI for examination, without needing Python scripts, servers, or cloud databases. It's useful for quick dataset validation (e.g. checking a sample of a new data for quality issues) and for powering web-based data analysis tools. Because it's pure JS, developers can integrate Hyparquet into any web app or Electron application that needs to read Parquet. It is the core engine behind Hyperparam's own dataset viewer, enabling what was previously thought impossible: client-side big data exploration.

  • Browser-Native & Dependency-Free: Hyparquet has zero external dependencies and is designed to run in both modern browsers and Node.js. At ~9.7 KB gzipped, it's extremely lightweight. It implements the full Parquet specification, aiming to be the “world's most compliant Parquet parser” that can open more files (all encodings and types) than other libraries.
  • Efficient Streaming of Massive Data: Built with performance in mind, Hyparquet only loads the portions of data needed for a given query or view. It leverages Parquet's built-in indexing to fetch just the required rows or columns on the fly. This “load just in time” approach makes it feasible to interactively explore multi-gigabyte or even billion-row datasets in a web app.
  • Complete Compression Support: Parquet files often use compression (Snappy, Gzip, ZSTD, etc.). Hyparquet by default handles common cases (uncompressed, Snappy), and with a companion library Hyparquet-Compressors, it supports all Parquet compression codecs. This is achieved with WebAssembly-optimized decompressors – notably HySnappy, a WASM Snappy decoder that accelerates parsing with minimal footprint.
Hyparquet-Writer GitHub

Hyparquet-Writer: Export Parquet Files from JavaScript

To complement Hyparquet's reading capabilities, Hyparquet-Writer provides a way to write or export data to Parquet format in JavaScript. It is designed to be as lightweight and efficient as its reading counterpart.

After exploring or filtering a dataset with Hyperparam's tools, a user might want to save a subset or annotations. Hyparquet-Writer makes it possible to export those results in-browser as a Parquet file (or in Node.js without needing Python/Java libraries). This is valuable for creating shareable “refined datasets” or for moving data between systems while staying in Parquet (avoiding expensive CSV conversions).

  • Fast Parquet Writing in JS: Hyparquet-Writer can take in JavaScript data (arrays of values per column) and output a binary Parquet file. It provides high efficiency and compact storage, so that even in-browser data manipulation results can be saved in a columnar format. It is especially efficient at representing sparse annotation data.
  • Extreme Data Compression: Parquet can represent large datasets very efficiently. It is especially efficient at representing sparse annotation data, exactly what we need for annotating and curating datasets.
  • Tiny and easy to deploy: Before Hyparquet-Writer the only way to write parquet files from the browser was huge wasm bundles (duckdb, datafusion). Hyparquet-Writer is less than 100kb of pure JavaScript, so it’s trivial to include with modern frontend applications.
HighTable GitHub

HighTable: Scalable React Data Table Component

Hyperparam HighTable table component

HighTable is a React-based virtualized table component for viewing extremely large tables in the browser. It is the UI workhorse that displays data fetched by Hyparquet or other sources.

HighTable is crucial for visual data exploration. In Hyperparam's dataset viewer, HighTable renders the content of Parquet files, allowing you to scroll through data that far exceeds memory limitations. You can also embed HighTable in custom web apps where a large results table is needed (for example, viewing logs, telemetry, or any big tabular data) without losing interactivity. By handling only what's visible, it bridges the gap between big data backends and a smooth front-end experience.

HighTable provides:

  • Virtual Scrolling for Large Data: Instead of rendering thousands or millions of rows (which would choke the browser), HighTable only renders the rows in the current viewport, dynamically loading more as you scroll. This ensures smooth performance even with datasets that have millions of entries.
  • Asynchronous Data Loading: HighTable works with a flexible data model that can fetch data on-the-fly. The table requests rows for a given range (e.g., 100–200) through a provided function. This means the data could come from an in-memory array, an IndexedDB store, or a remote source via Hyparquet. HighTable is agnostic as long as it can retrieve slices. This design allows infinite scrolling through data of “any size”.
  • Rich Table Features: Despite focusing on scale, HighTable offers convenient features expected in a spreadsheet-like interface: optional column sorting, adjustable column widths, and event hooks (e.g., double-click on a cell). It even displays per-cell loading placeholders to indicate when data is being fetched, maintaining a responsive feel.
Icebird GitHub

Icebird: JavaScript Apache Iceberg Table Reader

Icebird extends Hyperparam's reach into data stored in Apache Iceberg format. Iceberg is a popular table format for data lakes (often used on Hadoop/S3 storage) which contain Parquet files under the hood. Importantly, Iceberg allows you to efficiently evolve large datasets (add/remove rows, add columns, etc). Icebird is essentially a JavaScript Iceberg client that can read Iceberg table metadata and retrieve data files, built on top of Hyparquet.

If you are using Data Lake/Lakehouse architectures, Icebird makes it possible to inspect large Iceberg tables without a big data engine. A data engineer can point Hyperparam's viewer at an S3 path of an Iceberg table and quickly peek at a few rows or columns for validation. This is dramatically simpler than launching Spark or Trino for a small inspection task. Icebird brings our “no backend” philosophy to another major data format.

  • Iceberg Table Access: Given a pointer to an Iceberg table (for example, a directory or catalog entry on cloud storage), Icebird can read the table's schema and metadata, then use Hyparquet to read the actual parquet file fragments that make up the table. It supports Iceberg's features like schema evolution (rename columns) and position deletes, with a roadmap to cover more features as needed.
  • Time Travel Queries: Icebird allows users to retrieve data from older snapshots of the dataset (a feature of Iceberg) by specifying a metadata version to read. This is useful for auditing changes in data over time or reproducing an experiment on a previous dataset state – all from a browser environment.
HyLlama GitHub

Hyllama: Llama.cpp Model Metadata Parser

Hyllama is a slightly different tool in Hyperparam's suite – it's focused on model files rather than dataset files. Specifically, Hyllama is a JavaScript library to parse llama.cpp .gguf files (a format for LLaMA and related large language model weights) and extract their metadata.

Hyllama's primary use case is to allow users to inspect an LLM model's content (architecture parameters, vocab size, layer counts, etc.) and potentially even query its listed tokens or other metadata in the browser. For instance, you can drag-and-drop a .gguf model file onto a web page using Hyllama and quickly see what architecture and quantization it has, without running the model. You can use Hyllama to introspect model files easily or verify that model files match a datasets scheme expectations.

  • Efficient Metadata Extraction: LLM model files in GGUF format can be tens of gigabytes, which is impractical to load entirely in memory. Hyllama is designed to read just the metadata (and tensor indexes) from the file without loading full weights, by using partial reads (e.g., reading the first few MBs that contain the header and index).
  • No Dependencies & Web-Friendly: Like Hyparquet, Hyllama is dependency-free and can run in both Node and browser environments. For browser use, it suggests employing HTTP range requests to fetch just the needed bytes of a model file.
Hyperparam CLI GitHub

Hyperparam CLI: Local Dataset Viewer

The Hyperparam CLI ties everything together into a user-facing application. It is a command-line tool that, when run (npx hyperparam), launches a local web application for dataset viewing. Essentially, it's a one-command way to spin up the Hyperparam browser UI on your own local data.

  • Scalable Local Dataset Viewer: By running the CLI, users can point it to a file, folder, or URL containing data and open an interactive browser view. For example, npx hyperparam mydataset.parquet will open the Hyperparam web UI and display the contents of that Parquet file in a scrollable table. If a directory is given, it provides a file browser to pick a dataset. Under the hood, the CLI uses Node.js to serve the static app and utilizes Hyparquet/Icebird libraries (via a built-in API) to fetch data from local disk or remote URLs, then displays it with HighTable in the browser.

How the Tools Work Together

Hyperparam's suite of open-source tools is the backbone of a cohesive ecosystem tailored specifically for machine learning data workflows, enabling interactive exploration and management directly in the browser. By integrating efficient in-browser data handling (Hyparquet and Icebird), scalable visualization (HighTable), intuitive data export capabilities (Hyparquet-Writer), and model metadata inspection (Hyllama), we hope to show that there is a better way to build data-centric ML tools. We are releasing this work as open source because we believe that everyone benefits from having a strong ecosystem of AI data tools.

If you find these free open source tools useful, please show it! We love GitHub Stars ⭐

Enter your email to hear about new Hyperparam tools and libraries:

 
联系我们 contact @ memedata.com