展示 HN:OctaneDB – 快速、开源的 Python 向量数据库
Show HN: OctaneDB – Fast, Open-Source Vector Database for Python

原始链接: https://github.com/RijinRaju/octanedb

## OctaneDB:高性能向量数据库 OctaneDB是一个轻量级的Python库,为AI/ML应用提供闪电般快速的向量数据库解决方案。它拥有比Pinecone、ChromaDB和Qdrant等流行替代方案**快10倍的性能**,查询时间**低于一毫秒**,插入速率达到**每秒3,000多个向量**。 主要特性包括通过HDF5压缩优化内存使用,灵活的索引选项(HNSW用于速度,FlatIndex用于准确性),以及可配置的性能调优参数。OctaneDB原生支持使用sentence-transformers的文本嵌入,提供多种模型选择和**GPU加速**。它提供与ChromaDB兼容的API,方便迁移,并支持内存和持久化存储。 开发者可以受益于简单的API、全面的文档和广泛的测试。OctaneDB在语义搜索、推荐系统和图像检索等应用中表现出色,为相似性搜索任务提供强大而高效的解决方案。基准测试显示,与现有解决方案相比,在速度、内存使用和索引构建时间方面具有显著优势。

## OctaneDB:一个新的 Python 向量数据库 OctaneDB 是一个全新的开源向量数据库,专为对高维数据进行快速相似性搜索而设计,目标是语义搜索和嵌入检索等 AI/ML 应用。它拥有亚毫秒级的查询时间,并支持内存和持久化(HDF5)存储,可与 sentence-transformers 无缝集成。 开发者声称 OctaneDB 在向量搜索和批量插入方面比 Pinecone 或 ChromaDB 快 10 倍,利用了 HNSW 等高级索引和 GPU 加速。 然而,该项目立即受到了 Hacker News 社区的关注。担忧集中在 README 的质量(怀疑是 AI 生成的)、缺乏支持性能声明的基准测试,以及鉴于代码库的简单性和有限的提交历史,对开发者理解数据库基础知识的质疑。开发者已经承认了 README 的问题并承诺更新。
相关文章

原文
CTANE (1)

PyPI version Python 3.8+ License: MIT

OctaneDB is a lightweight, high-performance Python vector database library that provides 10x faster performance than existing solutions like Pinecone, ChromaDB, and Qdrant. Built with modern Python and optimized algorithms, it's perfect for AI/ML applications requiring fast similarity search.

  • 10x faster than existing vector databases
  • Sub-millisecond query response times
  • 3,000+ vectors/second insertion rate
  • Optimized memory usage with HDF5 compression
  • HNSW (Hierarchical Navigable Small World) for ultra-fast approximate search
  • FlatIndex for exact similarity search
  • Configurable parameters for performance tuning
  • Automatic index optimization

📚 Text Embedding Support 🆕

  • ChromaDB-compatible API for easy migration
  • Automatic text-to-vector conversion using sentence-transformers
  • Multiple embedding models (all-MiniLM-L6-v2, all-mpnet-base-v2, etc.)
  • GPU acceleration support (CUDA)
  • Batch processing for improved performance
  • In-memory for maximum speed
  • Persistent file-based storage
  • Hybrid mode for best of both worlds
  • HDF5 format for efficient compression
  • Multiple distance metrics: Cosine, Euclidean, Dot Product, Manhattan, Chebyshev, Jaccard
  • Advanced metadata filtering with logical operators
  • Batch search operations
  • Text-based search with automatic embedding

🛠️ Developer Experience

  • Simple, intuitive API similar to ChromaDB
  • Comprehensive documentation and examples
  • Type hints throughout
  • Extensive testing suite
from octanedb import OctaneDB

# Initialize with text embedding support
db = OctaneDB(
    dimension=384,  # Will be auto-set by embedding model
    embedding_model="all-MiniLM-L6-v2"
)

# Create a collection
collection = db.create_collection("documents")
db.use_collection("documents")

# Add text documents (ChromaDB-compatible!)
result = db.add(
    ids=["doc1", "doc2"],
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    metadatas=[
        {"category": "tropical", "color": "yellow"},
        {"category": "citrus", "color": "orange"}
    ]
)

# Search by text query
results = db.search_text(
    query_text="fruit",
    k=2,
    filter="category == 'tropical'",
    include_metadata=True
)

for doc_id, distance, metadata in results:
    print(f"Document: {db.get_document(doc_id)}")
    print(f"Distance: {distance:.4f}")
    print(f"Metadata: {metadata}")

📚 Text Embedding Examples

Here's a complete working example that demonstrates OctaneDB's core functionality:

from octanedb import OctaneDB

# Initialize database with text embeddings
db = OctaneDB(
    dimension=384,  # sentence-transformers default dimension
    storage_mode="in-memory",
    enable_text_embeddings=True,
    embedding_model="all-MiniLM-L6-v2"  # Lightweight model
)

# Create a collection
db.create_collection("fruits")
db.use_collection("fruits")

# Add some fruit documents
fruits_data = [
    {"id": "apple", "text": "Apple is a sweet and crunchy fruit that grows on trees.", "category": "temperate"},
    {"id": "banana", "text": "Banana is a yellow tropical fruit rich in potassium.", "category": "tropical"},
    {"id": "mango", "text": "Mango is a sweet tropical fruit with a large seed.", "category": "tropical"},
    {"id": "orange", "text": "Orange is a citrus fruit with a bright orange peel.", "category": "citrus"}
]

for fruit in fruits_data:
    db.add(
        ids=[fruit["id"]],
        documents=[fruit["text"]],
        metadatas=[{"category": fruit["category"], "type": "fruit"}]
    )

# Simple text search
results = db.search_text(query_text="sweet", k=2, include_metadata=True)
print("Sweet fruits:")
for doc_id, distance, metadata in results:
    print(f"  • {doc_id}: {metadata.get('document', 'N/A')[:50]}...")

# Text search with filter
results = db.search_text(
    query_text="fruit", 
    k=2, 
    filter="category == 'tropical'",
    include_metadata=True
)
print("\nTropical fruits:")
for doc_id, distance, metadata in results:
    print(f"  • {doc_id}: {metadata.get('document', 'N/A')[:50]}...")

If you're using ChromaDB, migrating to OctaneDB is seamless:

# Old ChromaDB code
# collection.add(
#     ids=["id1", "id2"],
#     documents=["doc1", "doc2"]
# )

# New OctaneDB code (identical API!)
db.add(
    ids=["id1", "id2"],
    documents=["doc1", "doc2"]
)
# Batch text search
query_texts = ["machine learning", "artificial intelligence", "data science"]
batch_results = db.search_text_batch(
    query_texts=query_texts,
    k=5,
    include_metadata=True
)

# Change embedding models
db.change_embedding_model("all-mpnet-base-v2")  # Higher quality, 768 dimensions

# Get available models
models = db.get_available_models()
print(f"Available models: {models}")
# Use pre-computed embeddings
custom_embeddings = np.random.randn(100, 384).astype(np.float32)
result = db.add(
    ids=[f"vec_{i}" for i in range(100)],
    embeddings=custom_embeddings,
    metadatas=[{"source": "custom"} for _ in range(100)]
)
# Optimize for speed vs. accuracy
db = OctaneDB(
    dimension=384,
    m=8,              # Fewer connections = faster, less accurate
    ef_construction=100,  # Lower = faster build
    ef_search=50      # Lower = faster search
)
# Persistent storage
db = OctaneDB(
    dimension=384,
    storage_path="./data",
    embedding_model="all-MiniLM-L6-v2"
)

# Save and load
db.save("./my_database.h5")
loaded_db = OctaneDB.load("./my_database.h5")
# Complex filters
results = db.search_text(
    query_text="technology",
    k=10,
    filter={
        "$and": [
            {"category": "tech"},
            {"$or": [
                {"year": {"$gte": 2020}},
                {"priority": "high"}
            ]}
        ]
    }
)
  1. Empty search results: Make sure to call include_metadata=True in your search methods to get metadata back.

  2. Query engine warnings: The query engine for complex filters is under development. For now, use simple string filters like "category == 'tropical'".

  3. Index not built: The index is automatically built when needed, but you can manually trigger it with collection._build_index() if needed.

  4. Text embeddings not working: Ensure you have sentence-transformers installed: pip install sentence-transformers

# This will work correctly:
results = db.search_text(
    query_text="fruit", 
    k=2, 
    filter="category == 'tropical'",
    include_metadata=True  # Important!
)

# Process results correctly:
for doc_id, distance, metadata in results:
    print(f"ID: {doc_id}, Distance: {distance:.4f}")
    if metadata:
        print(f"  Document: {metadata.get('document', 'N/A')}")
        print(f"  Category: {metadata.get('category', 'N/A')}")

📊 Performance Benchmarks

Operation OctaneDB ChromaDB Pinecone Qdrant
Insert (vectors/sec) 3,200 320 280 450
Search (ms) 0.8 8.2 15.1 12.3
Memory Usage 1.2GB 2.8GB 3.1GB 2.5GB
Index Build Time 45s 180s 120s 95s

Benchmarks performed on 100K vectors, 384 dimensions, Intel i7-12700K, 32GB RAM

OctaneDB
├── Core (OctaneDB)
│   ├── Collection Management
│   ├── Text Embedding Engine
│   └── Storage Manager
├── Collections
│   ├── Vector Storage (HDF5)
│   ├── Metadata Management
│   └── Index Management
├── Indexing
│   ├── HNSW Index
│   ├── Flat Index
│   └── Distance Metrics
├── Text Processing
│   ├── Sentence Transformers
│   ├── GPU Acceleration
│   └── Batch Processing
└── Storage
    ├── HDF5 Vectors
    ├── Msgpack Metadata
    └── Compression

🔌 Installation Options

pip install octanedb[gpu]
git clone https://github.com/RijinRaju/octanedb.git
cd octanedb
pip install -e .
  • Python: 3.8+
  • Core: NumPy, SciPy, h5py, msgpack
  • Text Embeddings: sentence-transformers, transformers, torch
  • Optional: CUDA for GPU acceleration
  • AI/ML Applications: Fast similarity search for embeddings
  • Document Search: Semantic search across text documents
  • Recommendation Systems: Find similar items quickly
  • Image Search: Vector similarity for image embeddings
  • NLP Applications: Text clustering and similarity
  • Research: Fast prototyping and experimentation

We welcome contributions! Please see our Contributing Guide for details.

git clone https://github.com/RijinRaju/octanedb.git
cd octanedb
pip install -e ".[dev]"
pytest tests/

This project is licensed under the MIT License - see the LICENSE file for details.

  • HNSW Algorithm: Based on the Hierarchical Navigable Small World paper
  • Sentence Transformers: For text embedding capabilities
  • HDF5: For efficient vector storage
  • NumPy: For fast numerical operations

Made with ❤️ by the OctaneDB Team

OctaneDB: Where speed meets simplicity in vector databases.

联系我们 contact @ memedata.com