在32GB Mac上,通过流式传输NVMe中的张量来运行1T参数的模型。
Hypura – A storage-tier-aware LLM inference scheduler for Apple Silicon

原始链接: https://github.com/t8/hypura

## Hypura:在Mac上运行大型语言模型 Hypura 是一款专为 Apple Silicon Mac 设计的 LLM 推理调度器,它能够通过智能地将张量分配到 GPU、RAM 和 NVMe 存储器中,执行超过可用 RAM 的模型。它克服了在使用像 llama.cpp 这样的工具尝试加载过大的模型(例如在 32GB Mac Mini 上加载 31GB Mixtral)时遇到的崩溃问题。 Hypura 会分析硬件并优化张量放置,优先将经常访问的数据(范数、嵌入)放在 GPU 上。对于 Mixtral 等混合专家 (MoE) 模型,它仅从 NVMe 流式传输活跃的专家权重,将 I/O 减少 75%,并以 99.5% 的命中率使用神经元缓存。像 Llama 70B 这样的密集模型,也为 FFN 层采用了类似的流式传输方法。 该系统会根据可用内存自动调整预取和池大小,无需手动调整。Hypura 对于能够放入内存的模型没有额外开销,并为更大的模型提供可用的体验,甚至可以达到 Mixtral 2.2 tok/s 和 Llama 70B 0.3 tok/s 的速度。它通过 Cargo 提供,并包含一个与 Ollama 兼容的 API,以便与 OpenClaw 等工具轻松集成。重要的是,Hypura 主要*读取* SSD,从而最大限度地减少磨损。

## 在内存有限的Mac上运行大型语言模型 一个新项目 ([github.com/t8](https://github.com/t8)) 实现了在拥有仅32GB内存的Mac上运行大型(最多1万亿参数)语言模型。它通过“智能地”从NVMe存储驱动器流式传输张量(模型权重)来实现,充当扩展且优化的交换内存。 核心思想,用Rust实现并利用`llama.cpp`,是根据访问频率将张量智能地放置在GPU、RAM或NVMe上。对于Mixtral等混合专家(MoE)模型,只有激活的专家权重保存在GPU上,而其他权重则从NVMe流式传输。这避免了标准`llama.cpp`实现中遇到的内存不足错误。 性能各异:Mixtral 8x7B 达到每秒2.2个token,而Llama 3 70B 运行速度为每秒0.3个token。该项目还提供与Ollama兼容的API。作者指出,该方法处于早期阶段,对于密集模型而言受I/O限制,但展示了在Apple Silicon上将NVMe用作可行内存层级的潜力。值得注意的是,代码大部分是由作者引导的LLM编写的。
相关文章

原文
 _   _
| | | |_   _ _ __  _   _ _ __ __ _
| |_| | | | | '_ \| | | | '__/ _` |
|  _  | |_| | |_) | |_| | | | (_| |
|_| |_|\__, | .__/ \__,_|_|  \__,_|
       |___/|_|
   Run models too big for your Mac's memory

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon. It places model tensors across GPU, RAM, and NVMe tiers based on access patterns, bandwidth costs, and hardware capabilities — enabling models that exceed physical memory to run without crashing the system.

Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s. A 40 GB Llama 70B at 0.3 tok/s. Vanilla llama.cpp crashes on both.

Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory and NVMe storage, but limited capacity. A 32 GB M1 Max cannot naively load a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.

Hypura solves this by understanding the model architecture:

  • Norms and embeddings are tiny but accessed every token — pinned to GPU
  • MoE expert routing exploits sparsity — only 2 of 8 experts fire per token. Router interception identifies selected experts in the eval callback, then loads only the needed expert strides from NVMe (75% I/O reduction). A neuron cache tracks loaded expert slices across tokens, achieving 99.5% hit rate from temporal locality. Co-activation tracking predicts which experts will fire next for speculative prefetch.
  • Dense FFN weights (gate, up, down — ~60% of model size) stream from NVMe through a dynamically-sized pool buffer while attention + norms stay GPU-resident. Prefetch lookahead depth scales automatically with available memory.

The result: models that would crash your machine under naive mmap become runnable. Models that fit in memory run at full Metal GPU speed with zero overhead.

Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth), and solves a placement optimization that assigns every tensor to a tier:

  • GPU (Metal) — Attention layers, norms, embeddings. Fastest access, limited by recommendedMaxWorkingSetSize.
  • RAM — Overflow layers that don't fit in the GPU working set. Accessed via mmap.
  • NVMe — Remaining layers loaded on-demand via direct I/O (F_NOCACHE + pread), prefetched ahead of the forward pass.

Hypura selects the best inference mode automatically based on model size, architecture, and available memory:

  • Full-resident — Model fits in GPU+RAM. No NVMe I/O. Full Metal speed.
  • Expert-streaming — For MoE models (Mixtral). Only non-expert tensors (~1 GB) stay on GPU. Expert tensors stream from NVMe through a pool buffer on demand, with a neuron cache (99.5% hit rate) that eliminates most I/O after warmup.
  • Dense FFN-streaming — For dense models too large for GPU (Llama 70B). Attention + norms stay on GPU (~8 GB). FFN tensors (~32 GB) stream from NVMe through a dynamically-sized pool buffer, with scaled prefetch lookahead.

Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.

All benchmarks on M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read.

Model Size GPU NVMe Mode Hypura llama.cpp Notes
Qwen 2.5 14B Q4_K_M 8.4 GB 8.4 GB full-resident 21 tok/s ~21 tok/s Fits in GPU; no overhead
Mixtral 8x7B Q5_K_M 30.9 GB 1.1 GB 29.8 GB expert-streaming 2.2 tok/s OOM All layers on Metal; 99.5% cache hit rate
Llama 3.3 70B Q4_K_M 39.6 GB 7.8 GB 31.8 GB dense-FFN-streaming 0.3 tok/s OOM All layers on Metal; dynamic 24-slot pool, 7-layer prefetch

Key takeaway: For models that fit in memory, Hypura adds zero overhead. For models that don't fit, Hypura is the difference between "runs" and "crashes." Expert-streaming on Mixtral achieves usable interactive speeds by keeping only non-expert tensors on GPU and exploiting MoE sparsity (only 2/8 experts fire per token). Dense FFN-streaming extends this to non-MoE models like Llama 70B. Pool sizes and prefetch depth scale automatically with available memory.

Hypura builds from source with Cargo. You'll need Rust 1.75+ and CMake (for the vendored llama.cpp).

git clone --recurse-submodules https://github.com/hypura/hypura.git
cd hypura
cargo build --release

The binary is at target/release/hypura.

Homebrew tap coming soon.

# Profile your hardware (runs once, cached)
hypura profile

# Run inference on a GGUF model
hypura run ./model.gguf --prompt "Hello, world"

# Interactive chat
hypura run ./model.gguf --interactive

# Benchmark: Hypura scheduling vs naive baseline
hypura bench ./model.gguf

# Inspect model placement plan without loading
hypura inspect ./model.gguf

Start with --max-tokens 10 on untested models before scaling up.

Hypura exposes an Ollama-compatible HTTP API, making it a drop-in replacement for any tool that talks to Ollama — including OpenClaw.

hypura serve ./model.gguf
# Hypura serving Mixtral 8x7B Instruct v0.1
#   Endpoint: http://127.0.0.1:8080
#   Ollama-compatible API: /api/generate, /api/chat, /api/tags
Endpoint Description
GET / Health check
GET /api/tags List loaded model
GET /api/version Server version
POST /api/show Model metadata
POST /api/generate Text completion (streaming NDJSON or single response)
POST /api/chat Chat completion (streaming NDJSON or single response)

Point OpenClaw at Hypura by setting the Ollama base URL in ~/.openclaw/openclaw.json:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://127.0.0.1:8080",
        "api": "ollama"
      }
    }
  }
}

Or via the CLI:

openclaw config set models.providers.ollama.baseUrl "http://127.0.0.1:8080"

Hypura speaks native Ollama protocol (/api/chat with NDJSON streaming), so no compatibility shims are needed.

hypura serve <MODEL> [OPTIONS]

Options:
  --host <HOST>        Host to bind to [default: 127.0.0.1]
  --port <PORT>        Port to bind to [default: 8080]
  --context <N>        Maximum context length [default: 4096]

Hypura is a Cargo workspace with two crates:

  • hypura — Main binary and library. CLI in src/main.rs, all logic in src/lib.rs modules.
  • hypura-sys — FFI bindings to llama.cpp (vendored at vendor/llama.cpp/, built via CMake).
Module Purpose
scheduler/placement.rs LP + greedy tensor placement across GPU/RAM/NVMe tiers
compute/inference.rs Inference engine: generate_blocking, generate_with_nvme_scheduling, server-oriented load_model / generate_from_loaded
compute/nvme_backend.rs Custom GGML buffer type, pool-based expert/FFN streaming, neuron cache, eval callback
server/routes.rs Axum HTTP handlers for Ollama-compatible API
profiler/ Hardware detection (CPU, GPU, memory bandwidth, NVMe throughput)
cli/bench.rs A/B benchmark harness
model/tensor_role.rs Tensor classification for placement scoring (norms, attention, MoE experts)

No. Hypura only reads from your SSD during inference — it never writes to it.

SSD wear is caused by write cycles (program/erase cycles on NAND flash cells). Reads do not degrade flash cells. Hypura's entire NVMe I/O path uses read-only pread() calls with F_NOCACHE to stream tensor weights from the GGUF file into RAM/GPU memory pools, where all computation happens. The SSD is used as cold storage, not as working memory.

The only writes Hypura performs are negligible: benchmark result JSON files (~KB), co-activation statistics (~KB to ~/.hypura/), and the one-time hypura optimize command if you choose to run it. Normal inference generates zero SSD writes.

  • bench --baseline is blocked when the model exceeds RAM minus 4 GB headroom. Use --force to override at your own risk.
  • Always start with --max-tokens 10 on untested models.
  • Test models belong in ./test-models/ (not checked in).

MIT

I feel morally obligated to say I did not write the code in this repository myself. This project is an exploration of using LLMs to carry out tasks based on my direction. The majority of prompts I used to get here were derived using the socratic method, genuine curiosity, and a hunch that NVMe supporting inference is underutilized despite being a (slow but) perfectly valid form of memory.

联系我们 contact @ memedata.com