展示 HN：Arc – 使用 DuckDB 分析的高吞吐量时间序列数据仓库

原文

High-performance time-series data warehouse built on DuckDB, Parquet, and MinIO.

⚠️ Alpha Release - Technical Preview Arc Core is currently in active development and evolving rapidly. While the system is stable and functional, it is not recommended for production workloads at this time. We are continuously improving performance, adding features, and refining the API. Use in development and testing environments only.

High-Performance Ingestion: MessagePack binary protocol (recommended), InfluxDB Line Protocol (drop-in replacement), JSON
DuckDB Query Engine: Fast analytical queries with SQL
Distributed Storage with MinIO: S3-compatible object storage for unlimited scale and cost-effective data management (recommended). Also supports local disk, AWS S3, and GCS
Data Import: Import data from InfluxDB, TimescaleDB, HTTP endpoints
Query Caching: Configurable result caching for improved performance
Production Ready: Docker deployment with health checks and monitoring

Performance Benchmark 🚀

Arc achieves 1.89M records/sec with MessagePack binary protocol!

Metric	Value	Notes
Throughput	1.89M records/sec	MessagePack binary protocol
p50 Latency	21ms	Median response time
p95 Latency	204ms	95th percentile
Success Rate	99.9998%	Production-grade reliability
vs Line Protocol	7.9x faster	240K → 1.89M RPS

Tested on Apple M3 Max (14 cores), native deployment with MinIO

🎯 Optimal Configuration:

Workers: 3x CPU cores (e.g., 14 cores = 42 workers)
Deployment: Native mode (2.4x faster than Docker)
Storage: MinIO native (not containerized)
Protocol: MessagePack binary (/write/v2/msgpack)

Quick Start (Native - Recommended for Maximum Performance)

Native deployment delivers 1.89M RPS vs 570K RPS in Docker (2.4x faster).

# One-command start (auto-installs MinIO, auto-detects CPU cores)
./start.sh native

# Alternative: Manual setup
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

# Start MinIO natively (auto-configured by start.sh)
brew install minio/stable/minio minio/stable/mc  # macOS
# OR download from https://min.io/download for Linux

# Start Arc (auto-detects optimal worker count: 3x CPU cores)
./start.sh native

Arc API will be available at http://localhost:8000 MinIO Console at http://localhost:9001 (minioadmin/minioadmin)

# Start Arc Core with MinIO
docker-compose up -d

# Check status
docker-compose ps

# View logs
docker-compose logs -f arc-api

# Stop
docker-compose down

Note: Docker mode achieves ~570K RPS. For maximum performance (1.89M RPS), use native deployment.

Deploy Arc Core to a remote server:

# Docker deployment
./deploy.sh -h your-server.com -u ubuntu -m docker

# Native deployment
./deploy.sh -h your-server.com -u ubuntu -m native

Arc Core uses a centralized arc.conf configuration file (TOML format). This provides:

Clean, organized configuration structure
Environment variable overrides for Docker/production
Production-ready defaults
Comments and documentation inline

Primary Configuration: arc.conf

Edit the arc.conf file for all settings:

# Server Configuration
[server]
host = "0.0.0.0"
port = 8000
workers = 8  # Adjust based on load: 4=light, 8=medium, 16=high

# Authentication
[auth]
enabled = true
default_token = ""  # Leave empty to auto-generate

# Query Cache
[query_cache]
enabled = true
ttl_seconds = 60

# Storage Backend (MinIO recommended)
[storage]
backend = "minio"

[storage.minio]
endpoint = "http://minio:9000"
access_key = "minioadmin"
secret_key = "minioadmin123"
bucket = "arc"
use_ssl = false

# For AWS S3
# [storage]
# backend = "s3"
# [storage.s3]
# bucket = "arc-data"
# region = "us-east-1"

# For Google Cloud Storage
# [storage]
# backend = "gcs"
# [storage.gcs]
# bucket = "arc-data"
# project_id = "my-project"

Configuration Priority (highest to lowest):

Environment variables (e.g., ARC_WORKERS=16)
arc.conf file
Built-in defaults

Environment Variable Overrides

You can override any setting via environment variables:

# Server
ARC_HOST=0.0.0.0
ARC_PORT=8000
ARC_WORKERS=8

# Storage
STORAGE_BACKEND=minio
MINIO_ENDPOINT=minio:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin123
MINIO_BUCKET=arc

# Cache
QUERY_CACHE_ENABLED=true
QUERY_CACHE_TTL=60

# Logging
LOG_LEVEL=INFO

Legacy Support: .env files are still supported for backward compatibility, but arc.conf is recommended.

After starting Arc Core, create an admin token for API access:

# Docker deployment
docker exec -it arc-api python3 -c "
from api.auth import AuthManager
auth = AuthManager(db_path='/data/historian.db')
token = auth.create_token('my-admin', description='Admin token')
print(f'Admin Token: {token}')
"

# Native deployment
cd /path/to/arc-core
source venv/bin/activate
python3 -c "
from api.auth import AuthManager
auth = AuthManager()
token = auth.create_token('my-admin', description='Admin token')
print(f'Admin Token: {token}')
"

Save this token - you'll need it for all API requests.

All endpoints require authentication via Bearer token:

# Set your token
export ARC_TOKEN="your-token-here"

curl http://localhost:8000/health

Ingest Data (MessagePack - Recommended)

MessagePack binary protocol offers 3x faster ingestion with zero-copy PyArrow processing:

import msgpack
import requests
from datetime import datetime

# Prepare data in MessagePack format
data = {
    "database": "metrics",
    "table": "cpu_usage",
    "records": [
        {
            "timestamp": int(datetime.now().timestamp() * 1e9),  # nanoseconds
            "host": "server01",
            "cpu": 0.64,
            "memory": 0.82
        },
        {
            "timestamp": int(datetime.now().timestamp() * 1e9),
            "host": "server02",
            "cpu": 0.45,
            "memory": 0.71
        }
    ]
}

# Send via MessagePack
response = requests.post(
    "http://localhost:8000/write/v2/msgpack",
    headers={
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/msgpack"
    },
    data=msgpack.packb(data)
)
print(response.json())

Batch ingestion (for high throughput):

# Send 10,000 records at once
records = [
    {
        "timestamp": int(datetime.now().timestamp() * 1e9),
        "sensor_id": f"sensor_{i}",
        "temperature": 20 + (i % 10),
        "humidity": 60 + (i % 20)
    }
    for i in range(10000)
]

data = {
    "database": "iot",
    "table": "sensors",
    "records": records
}

response = requests.post(
    "http://localhost:8000/write/v2/msgpack",
    headers={
        "Authorization": f"Bearer {token}",
        "Content-Type": "application/msgpack"
    },
    data=msgpack.packb(data)
)

Ingest Data (Line Protocol - InfluxDB Compatibility)

For drop-in replacement of InfluxDB - compatible with Telegraf and InfluxDB clients:

# InfluxDB 1.x compatible endpoint
curl -X POST "http://localhost:8000/write/line?db=mydb" \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: text/plain" \
  --data-binary "cpu,host=server01 value=0.64 1633024800000000000"

# Multiple measurements
curl -X POST "http://localhost:8000/write/line?db=metrics" \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: text/plain" \
  --data-binary "cpu,host=server01,region=us-west value=0.64 1633024800000000000
memory,host=server01,region=us-west used=8.2,total=16.0 1633024800000000000
disk,host=server01,region=us-west used=120.5,total=500.0 1633024800000000000"

Telegraf configuration (drop-in InfluxDB replacement):

[[outputs.influxdb]]
  urls = ["http://localhost:8000"]
  database = "telegraf"
  skip_database_creation = true

  # Authentication
  username = ""  # Leave empty
  password = "$ARC_TOKEN"  # Use your Arc token as password

  # Or use HTTP headers
  [outputs.influxdb.headers]
    Authorization = "Bearer $ARC_TOKEN"

curl -X POST http://localhost:8000/query \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "database": "mydb",
    "query": "SELECT * FROM cpu_usage WHERE host = '\''server01'\'' ORDER BY timestamp DESC LIMIT 100"
  }'

Advanced queries with DuckDB SQL:

# Aggregations
curl -X POST http://localhost:8000/query \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "database": "metrics",
    "query": "SELECT host, AVG(cpu) as avg_cpu, MAX(memory) as max_memory FROM cpu_usage WHERE timestamp > now() - INTERVAL 1 HOUR GROUP BY host"
  }'

# Time-series analysis
curl -X POST http://localhost:8000/query \
  -H "Authorization: Bearer $ARC_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "database": "iot",
    "query": "SELECT time_bucket(INTERVAL '\''5 minutes'\'', timestamp) as bucket, AVG(temperature) as avg_temp FROM sensors GROUP BY bucket ORDER BY bucket"
  }'

┌─────────────────────────────────────────────────────────────┐
│                     Client Applications                      │
│  (Telegraf, Python, Go, JavaScript, curl, etc.)             │
└──────────────────┬──────────────────────────────────────────┘
                   │
                   │ HTTP/HTTPS
                   ▼
┌─────────────────────────────────────────────────────────────┐
│                   Arc API Layer (FastAPI)                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────┐  │
│  │ Line Protocol│  │  MessagePack │  │  Query Engine    │  │
│  │   Endpoint   │  │   Binary API │  │   (DuckDB)       │  │
│  └──────────────┘  └──────────────┘  └──────────────────┘  │
└──────────────────┬──────────────────────────────────────────┘
                   │
                   │ Write Pipeline
                   ▼
┌─────────────────────────────────────────────────────────────┐
│              Buffering & Processing Layer                    │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  ParquetBuffer (Line Protocol)                       │  │
│  │  - Batches records by measurement                    │  │
│  │  - Polars DataFrame → Parquet                        │  │
│  │  - Snappy compression                                │  │
│  └──────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────┐  │
│  │  ArrowParquetBuffer (MessagePack Binary)             │  │
│  │  - Zero-copy PyArrow RecordBatch                     │  │
│  │  - Direct Parquet writes (3x faster)                 │  │
│  │  - Columnar from start                               │  │
│  └──────────────────────────────────────────────────────┘  │
└──────────────────┬──────────────────────────────────────────┘
                   │
                   │ Parquet Files
                   ▼
┌─────────────────────────────────────────────────────────────┐
│              Storage Backend (Pluggable)                     │
│  ┌────────────────────────────────────────────────────────┐ │
│  │  MinIO (Recommended - S3-compatible)                   │ │
│  │  ✓ Unlimited scale          ✓ Distributed             │ │
│  │  ✓ Cost-effective           ✓ Self-hosted             │ │
│  │  ✓ High availability        ✓ Erasure coding          │ │
│  │  ✓ Multi-tenant             ✓ Object versioning       │ │
│  └────────────────────────────────────────────────────────┘ │
│                                                              │
│  Alternative backends: Local Disk, AWS S3, Google Cloud     │
└─────────────────────────────────────────────────────────────┘
                   │
                   │ Query Path (Direct Parquet reads)
                   ▼
┌─────────────────────────────────────────────────────────────┐
│              Query Engine (DuckDB)                           │
│  - Direct Parquet reads from object storage                 │
│  - Columnar execution engine                                │
│  - Query cache for common queries                           │
│  - Full SQL interface (Postgres-compatible)                 │
└─────────────────────────────────────────────────────────────┘

Arc Core is designed with MinIO as the primary storage backend for several key reasons:

Unlimited Scale: Store petabytes of time-series data without hitting storage limits
Cost-Effective: Commodity hardware or cloud storage at fraction of traditional database costs
Distributed Architecture: Built-in replication and erasure coding for data durability
S3 Compatibility: Works with any S3-compatible storage (AWS S3, GCS, Wasabi, etc.)
Performance: Direct Parquet reads from object storage with DuckDB's efficient execution
Separation of Compute & Storage: Scale storage and compute independently
Self-Hosted Option: Run on your own infrastructure without cloud vendor lock-in

The MinIO + Parquet + DuckDB combination provides the perfect balance of cost, performance, and scalability for analytical time-series workloads.

Arc Core has been benchmarked using ClickBench - the industry-standard analytical database benchmark with 100M row dataset (14GB) and 43 analytical queries.

Hardware: AWS c6a.4xlarge (16 vCPU AMD EPYC 7R13, 32GB RAM, 500GB gp2)

Cold Run Total: 35.18s (sum of 43 queries, first execution)
Hot Run Average: 0.81s (average per query after caching)
Aggregate Performance: ~2.8M rows/sec cold, ~123M rows/sec hot (across all queries)
Storage: MinIO (S3-compatible)
Success Rate: 43/43 queries (100%)

Hardware: Apple M3 Max (14 cores ARM, 36GB RAM)

Cold Run Total: 23.86s (sum of 43 queries, first execution)
Hot Run Average: 0.52s (average per query after caching)
Aggregate Performance: ~4.2M rows/sec cold, ~192M rows/sec hot (across all queries)
Storage: Local NVMe SSD
Success Rate: 43/43 queries (100%)

Key Performance Characteristics

Columnar Storage: Parquet format with Snappy compression
Query Engine: DuckDB with default settings (ClickBench compliant)
Result Caching: 60s TTL for repeated queries (production mode)
End-to-End: All timings include HTTP/JSON API overhead

Query	Time (avg)	Description
Q1	0.021s	Simple aggregation
Q8	0.034s	String parsing
Q27	0.086s	Complex grouping
Q41	0.048s	URL parsing
Q42	0.044s	Multi-column filter

Query	Time (avg)	Description
Q29	7.97s	Heavy string operations
Q19	1.69s	Multiple joins
Q33	1.86s	Complex aggregations

Benchmark Configuration:

Dataset: 100M rows, 14GB Parquet (ClickBench hits.parquet)
Protocol: HTTP REST API with JSON responses
Caching: Disabled for benchmark compliance
Tuning: None (default DuckDB settings)

See full results and methodology at ClickBench Results (Arc submission pending).

The docker-compose.yml includes:

arc-api: Main API server (port 8000)
minio: S3-compatible storage (port 9000, console 9001)
minio-init: Initializes MinIO buckets on startup

# Run with auto-reload
uvicorn api.main:app --reload --host 0.0.0.0 --port 8000

# Run tests (if available in parent repo)
pytest tests/

Health check endpoint:

curl http://localhost:8000/health

Logs:

# Docker
docker-compose logs -f arc-api

# Native (systemd)
sudo journalctl -u arc-api -f

Public Endpoints (No Authentication Required)

GET / - API information
GET /health - Service health check
GET /ready - Readiness probe
GET /docs - Swagger UI documentation
GET /redoc - ReDoc documentation
GET /openapi.json - OpenAPI specification

Note: All other endpoints require Bearer token authentication.

MessagePack Binary Protocol (Recommended - 3x faster):

POST /write/v2/msgpack - Write data via MessagePack
POST /api/v2/msgpack - Alternative endpoint
GET /write/v2/msgpack/stats - Get ingestion statistics
GET /write/v2/msgpack/spec - Get protocol specification

Line Protocol (InfluxDB compatibility):

POST /write - InfluxDB 1.x compatible write
POST /api/v1/write - InfluxDB 1.x API format
POST /api/v2/write - InfluxDB 2.x API format
POST /api/v1/query - InfluxDB 1.x query format
GET /write/health - Write endpoint health check
GET /write/stats - Write statistics
POST /write/flush - Force flush write buffer

POST /query - Execute DuckDB SQL query
POST /query/estimate - Estimate query cost
POST /query/stream - Stream large query results
GET /query/{measurement} - Get measurement data
GET /query/{measurement}/csv - Export measurement as CSV
GET /measurements - List all measurements/tables

GET /auth/verify - Verify token validity
GET /auth/tokens - List all tokens
POST /auth/tokens - Create new token
GET /auth/tokens/{id} - Get token details
PATCH /auth/tokens/{id} - Update token
DELETE /auth/tokens/{id} - Delete token
POST /auth/tokens/{id}/rotate - Rotate token (generate new)

GET /health - Service health check
GET /ready - Readiness probe
GET /metrics - Prometheus metrics
GET /metrics/timeseries/{type} - Time-series metrics
GET /metrics/endpoints - Endpoint statistics
GET /metrics/query-pool - Query pool status
GET /metrics/memory - Memory profile
GET /logs - Application logs

InfluxDB Connections:

GET /connections/influx - List InfluxDB connections
POST /connections/influx - Create InfluxDB connection
PUT /connections/influx/{id} - Update connection
DELETE /connections/{type}/{id} - Delete connection
POST /connections/{type}/{id}/activate - Activate connection
POST /connections/{type}/test - Test connection

Storage Connections:

GET /connections/storage - List storage backends
POST /connections/storage - Create storage connection
PUT /connections/storage/{id} - Update storage connection

GET /jobs - List all export jobs
POST /jobs - Create new export job
PUT /jobs/{id} - Update job configuration
DELETE /jobs/{id} - Delete job
GET /jobs/{id}/executions - Get job execution history
POST /jobs/{id}/run - Run job immediately
POST /jobs/{id}/cancel - Cancel running job
GET /monitoring/jobs - Monitor job status

POST /api/http-json/connections - Create HTTP/JSON connection
GET /api/http-json/connections - List connections
GET /api/http-json/connections/{id} - Get connection details
PUT /api/http-json/connections/{id} - Update connection
DELETE /api/http-json/connections/{id} - Delete connection
POST /api/http-json/connections/{id}/test - Test connection
POST /api/http-json/connections/{id}/discover-schema - Discover schema
POST /api/http-json/export - Export data via HTTP

GET /cache/stats - Cache statistics
GET /cache/health - Cache health status
POST /cache/clear - Clear query cache

Interactive API Documentation

Arc Core includes auto-generated API documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc
OpenAPI JSON: http://localhost:8000/openapi.json

Arc Core is under active development. Current focus areas:

Performance Optimization: Further improvements to ingestion and query performance
API Stability: Finalizing core API contracts
Enhanced Monitoring: Additional metrics and observability features
Documentation: Expanded guides and tutorials
Production Hardening: Testing and validation for production use cases

We welcome feedback and feature requests as we work toward a stable 1.0 release.

Arc Core is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

This means:

✅ Free to use - Use Arc Core for any purpose
✅ Free to modify - Modify the source code as needed
✅ Free to distribute - Share your modifications with others
⚠️ Share modifications - If you modify Arc and run it as a service, you must share your changes under AGPL-3.0

AGPL-3.0 ensures that improvements to Arc benefit the entire community, even when run as a cloud service. This prevents the "SaaS loophole" where companies could take the code, improve it, and keep changes proprietary.

For organizations that require:

Proprietary modifications without disclosure
Commercial support and SLAs
Enterprise features and managed services

Please contact us at: enterprise[at]basekick[dot]net

We offer dual licensing and commercial support options.

Community Support: GitHub Issues
Enterprise Support: enterprise[at]basekick[dot]net
General Inquiries: support[at]basekick[dot]net

Arc Core is provided "as-is" in alpha state. While we use it extensively for development and testing, it is not yet production-ready. Features and APIs may change without notice. Always back up your data and test thoroughly in non-production environments before considering any production deployment.