Bauplan——基于对象存储的 Git 数据管道

Bauplan——基于对象存储的 Git 数据管道
Bauplan – Git-for-data pipelines on object storage

原始链接: https://docs.bauplanlabs.com/en/latest/

Bauplan是一个优先使用Python的數據平台，简化了在S3数据湖上大规模数据管道创建和管理的工作。由对基础设施复杂性感到沮丧的机器学习和数据工程师构建，Bauplan提供了一个无服务器环境，您可以在其中使用简单的Python函数构建管道，无需容器化和Spark等复杂框架。主要功能包括直接操作S3表格（Parquet/CSV到具有ACID事务和模式演变的Apache Iceberg）、面向数据的Git实现即时数据湖分支和安全协作，以及用于快速、无状态执行的无服务器管道。Bauplan提供对版本化数据的SQL访问和数据管道的CI/CD。其核心创新是“Refs”，它自动对每次管道运行、表格和模型进行版本控制，确保可重复性、可审计性和回滚能力。Bauplan使用户能够运行AI应用程序、ML工作负载和数据转换管道，而无需承担管理底层数据基础设施的负担。

Bauplan是一个新的数据管道平台，旨在用代码优先的方法取代自定义框架和Notebook。用户可以在云端，基于对象存储，从他们的IDE运行SQL/Python函数。该系统强调版本控制、可组合性、时间旅行能力以及类似Git的分支功能。该平台具有共同设计的抽象层和运行时，可在FaaS和数据操作方面实现优化，例如比AWS快15倍的函数重建速度。它为人工交互（CLI）和机器交互（SDK）提供了简单的API。开发人员正在寻求对其将数据工程工作流与熟悉的软件开发抽象（如表、函数、分支和CI/CD）对齐的方法的反馈。一位评论者要求提供一个10分钟的YouTube视频演示该产品的功能。

天花 - 建立在DuckDB和3FS上的轻巧数据处理框架 2025-03-04

Supabase Storage 现在支持 S3 协议 2024-04-20

Amazon 在 EC2 上从 Apache Spark 到 Ray 的 EB 级迁移 2024-08-02

Apache Iceberg用于地理空间数据分析的优势 2025-04-12

原文

Bauplan is a Pythonic data platform that provides functions as a service for large-scale data pipelines and git-for-data over S3 data lakes. Bauplan handles tasks that would typically require an entire infrastructure team. Our goal is to allow you and your team to run large-scale ML workflows, AI applications and data transformation pipelines in the cloud without managing any data infrastructure.

Why we built it. We are a team of ML and data engineers and we built Bauplan because we’ve experienced firsthand the frustration of spending too much time wrestling with cloud infrastructure. Bauplan was built to offer a Python-first platform that is both extremely simple and robust at the same time.

Simple. Our serverless functions allow you to write pipelines as simple Python functions chained together without dealing with containerization, runtime configuration and specialized big-data frameworks like Spark.

Robust. Using Git-for-data and our unique system of Refs, we make sure that every pipeline run and every table and every model is automatically versioned, reproducible and auditable.

Main features

Pythonic by design. Build workflows using native Python in your favorite IDE—no DSLs, no YAML, no Spark required.
Work with tables directly in S3. Convert your Parquet and CSV files into Apache Iceberg tables with a single line of code. Get ACID transactions, schema and partition evolution, time travel, and optimized queries—without leaving your S3 bucket.
Git-for-data. Create zero-copy branches of your data lake instantly. Safely collaborate on real data without risking downstream breakage.
Serverless pipelines. Run fast, stateless Python functions in the cloud. Chain them together to build full pipelines—no containers, no runtime headaches.
SQL everywhere. Run interactive or async SQL queries across branches and tables in S3, with full support for versioned data.
CI/CD for data. Automate testing and deployment of data pipelines using data branches and our Python SDK—just like your code, with instant feedback loops.
Version and reproduce with Refs. Every pipeline run is tracked through data and code versioning. Use Refs to reproduce results, audit changes, and roll back with confidence.

Use cases

Run AI applications, ML workloads and data pipelines. Here, you’ll find numerous examples demonstrating how our customers use the platform to solve real-world problems.