展示HN：RunAnwhere – 在Apple Silicon上更快的AI推理

展示HN：RunAnwhere – 在Apple Silicon上更快的AI推理
Show HN: RunAnwhere – Faster AI Inference on Apple Silicon

原始链接: https://github.com/RunanywhereAI/rcli

## RCLI：macOS 上的本地语音 AI RCLI 是一款强大的、注重隐私的 macOS 语音 AI，完全在 Apple Silicon 设备上运行。它提供完整的语音转文本 (STT)、大型语言模型 (LLM) 和文本转语音 (TTS) 流程——无需云端或 API 密钥。用户可以使用 43 条语音命令（例如 Spotify 控制或截图），进行自然的语音对话，并以约 4 毫秒的延迟对文档进行本地检索增强生成 (RAG)。RCLI 借助专有的 MetalRT GPU 推理引擎，拥有低于 200 毫秒的端到端延迟，并支持在各种开源模型（Qwen3、LFM2、Whisper 等）之间热插拔。安装通过一条命令即可完成。终端仪表板提供模型管理、硬件监控和一键语音接口。RCLI 采用 MIT 许可证开源，MetalRT 采用单独的专有许可证分发。它需要 macOS 13+ 和 Apple Silicon 芯片（M1 或更高版本）。更多信息和安装说明请访问：[https://github.com/RunanywhereAI/RCLI](https://github.com/RunanywhereAI/RCLI)

## RunAnywhereAI：苹果芯片上更快的本地AI Sanchit和Shubham (YC W26) 开发了 **MetalRT**，这是一种专为苹果芯片设计的全新推理引擎，在LLM、语音转文本 (STT) 和文本转语音 (TTS) 方面，显著优于现有的解决方案，如llama.cpp、Apple的MLX和Ollama。他们通过使用定制Metal着色器并消除框架开销来实现这种速度。他们还开源了 **RCLI**，这是一个完整的端到端语音AI管道——从麦克风输入到语音响应——完全在本地运行，无需云连接或API密钥。基准测试表明，RCLI在实时STT方面达到了高达714倍的速度，并且LLM解码和TTS速度比竞争对手更快。开发者专注于最小化语音管道中的延迟累积，优化*每个*阶段的速度。MetalRT通过定制Metal计算着色器和预分配内存实现对GPU的直接访问。RCLI具有TUI、本地RAG功能以及对众多模型的支持。资源：[GitHub (RCLI)](https://github.com/RunanywhereAI/RCLI)，[演示](https://www.youtube.com/watch?v=eTYwkgNoaKg)，[博客](https://www.runanywhere.ai/)

原文

Talk to your Mac, query your docs, no cloud required.

RCLI is an on-device voice AI for macOS. A complete STT + LLM + TTS pipeline running natively on Apple Silicon — 43 macOS actions via voice, local RAG over your documents, sub-200ms end-to-end latency. No cloud, no API keys.

Real-time screen recordings on Apple Silicon — no cloud, no edits, no tricks.

Voice Conversation Talk naturally — RCLI listens, understands, and responds on-device. _{Click for full video with audio}	App Control Control Spotify, adjust volume — 43 macOS actions by voice. _{Click for full video with audio}
Models & Benchmarks Browse models, hot-swap LLMs, run benchmarks — all from the TUI. _{Click for full video with audio}	Document Intelligence (RAG) Ingest docs, ask questions by voice — ~4ms hybrid retrieval. _{Click for full video with audio}

Requires macOS 13+ on Apple Silicon (M1 or later).

One command:

curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

Or via Homebrew:

brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli
rcli setup

rcli                             # interactive TUI (push-to-talk + text)
rcli listen                      # continuous voice mode
rcli ask "open Safari"           # one-shot command
rcli ask "play some jazz on Spotify"

A full STT + LLM + TTS pipeline running on Metal GPU with three concurrent threads:

VAD — Silero voice activity detection
STT — Zipformer streaming + Whisper / Parakeet offline
LLM — Qwen3 / LFM2 / Qwen3.5 with KV cache continuation and Flash Attention
TTS — Double-buffered sentence-level synthesis (next sentence renders while current plays)
Tool Calling — LLM-native tool call formats (Qwen3, LFM2, etc.)
Multi-turn Memory — Sliding window conversation history with token-budget trimming

Control your Mac by voice or text. The LLM routes intent to actions executed locally via AppleScript and shell commands.

Category	Examples
Productivity	`create_note`, `create_reminder`, `run_shortcut`
Communication	`send_message`, `facetime_call`
Media	`play_on_spotify`, `play_apple_music`, `play_pause`, `next_track`, `set_music_volume`
System	`open_app`, `quit_app`, `set_volume`, `toggle_dark_mode`, `screenshot`, `lock_screen`
Web	`search_web`, `search_youtube`, `open_url`, `open_maps`

Run rcli actions to see all 43, or toggle them on/off in the TUI Actions panel.

Tip: If tool calling feels unreliable, press X in the TUI to clear the conversation and reset context. With small LLMs, accumulated context can degrade tool-calling accuracy — a fresh context often fixes it.

Index local documents, query them by voice. Hybrid vector + BM25 retrieval with ~4ms latency over 5K+ chunks. Supports PDF, DOCX, and plain text.

rcli rag ingest ~/Documents/notes
rcli ask --rag ~/Library/RCLI/index "summarize the project plan"

A terminal dashboard with push-to-talk, live hardware monitoring, model management, and an actions browser.

Key	Action
SPACE	Push-to-talk
M	Models — browse, download, hot-swap LLM/STT/TTS
A	Actions — browse, enable/disable macOS actions
B	Benchmarks — run STT, LLM, TTS, E2E benchmarks
R	RAG — ingest documents
X	Clear conversation and reset context
T	Toggle tool call trace
ESC	Stop / close / quit

MetalRT is a high-performance GPU inference engine built by RunAnywhere, Inc. specifically for Apple Silicon. It delivers the fastest on-device inference for LLM, STT, and TTS — up to 550 tok/s LLM throughput and sub-200ms end-to-end voice latency.

Apple M3 or later required. MetalRT uses Metal 3.1 GPU features available on M3, M3 Pro, M3 Max, M4, and later chips. M1/M2 support is coming soon. On M1/M2, RCLI automatically falls back to the open-source llama.cpp engine.

MetalRT is automatically installed during rcli setup (choose "MetalRT" or "Both"). Or install separately:

rcli metalrt install
rcli metalrt status

Supported models: Qwen3 0.6B, Qwen3 4B, Llama 3.2 3B, LFM2.5 1.2B (LLM) · Whisper Tiny/Small/Medium (STT) · Kokoro 82M with 28 voices (TTS)

MetalRT is distributed under a proprietary license. For licensing inquiries: [email protected]

RCLI supports 20+ models across LLM, STT, TTS, VAD, and embeddings. All run locally on Apple Silicon. Use rcli models to browse, download, or switch.

LLM: LFM2 1.2B (default), LFM2 350M, LFM2.5 1.2B, LFM2 2.6B, Qwen3 0.6B, Qwen3.5 0.8B/2B/4B, Qwen3 4B

STT: Zipformer (streaming), Whisper base.en (offline, default), Parakeet TDT 0.6B (~1.9% WER)

TTS: Piper Lessac/Amy, KittenTTS Nano, Matcha LJSpeech, Kokoro English/Multi-lang

Default install (rcli setup): ~1GB — LFM2 1.2B + Whisper + Piper + Silero VAD + Snowflake embeddings.

rcli models                  # interactive model management
rcli upgrade-llm             # guided LLM upgrade
rcli voices                  # browse and switch TTS voices
rcli cleanup                 # remove unused models

Mic → VAD → STT → [RAG] → LLM → TTS → Speaker
                            |
                     Tool Calling → 43 macOS Actions

Three dedicated threads in live mode, synchronized via condition variables:

Thread	Role
STT	Captures audio, runs VAD, detects speech endpoints
LLM	Generates tokens, dispatches tool calls
TTS	Double-buffered sentence-level synthesis and playback

Key design decisions:

64 MB pre-allocated memory pool — zero runtime malloc during inference
Lock-free ring buffers for zero-copy audio transfer
System prompt KV caching across queries
Hardware profiling at startup for optimal config
Token-budget conversation trimming
Live model hot-swap without restarting

src/
  engines/     STT, LLM, TTS, VAD, embedding engine wrappers
  pipeline/    Orchestrator, sentence detector, text sanitizer
  rag/         Vector index, BM25, hybrid retriever
  core/        Types, ring buffer, memory pool, hardware profiler
  audio/       CoreAudio mic/speaker I/O
  tools/       Tool calling engine with JSON schema definitions
  actions/     43 macOS action implementations
  api/         C API (rcli_api.h)
  cli/         TUI dashboard (FTXUI), CLI commands
  models/      Model registries with on-demand download

CPU-only build using llama.cpp + sherpa-onnx (no MetalRT):

git clone https://github.com/RunanywhereAI/RCLI.git && cd RCLI
bash scripts/setup.sh
bash scripts/download_models.sh
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(sysctl -n hw.ncpu)
./rcli

All dependencies are vendored or CMake-fetched. Requires CMake 3.15+ and Apple Clang (C++17).

CLI Reference

rcli                          Interactive TUI (push-to-talk + text + trace)
rcli listen                   Continuous voice mode
rcli ask <text>               One-shot text command
rcli actions [name]           List actions or show detail
rcli rag ingest <dir>         Index documents for RAG
rcli rag query <text>         Query indexed documents
rcli models [llm|stt|tts]    Manage AI models
rcli voices                   Manage TTS voices
rcli bench [--suite ...]      Run benchmarks
rcli setup                    Download default models
rcli info                     Show engine and model info

Options:
  --models <dir>      Models directory (default: ~/Library/RCLI/models)
  --rag <index>       Load RAG index for document-grounded answers
  --gpu-layers <n>    GPU layers for LLM (default: 99 = all)
  --ctx-size <n>      LLM context size (default: 4096)
  --no-speak          Text output only (no TTS)
  --verbose, -v       Debug logs

Contributions welcome. See CONTRIBUTING.md for build instructions and how to add new actions, models, or voices.

RCLI is open source under the MIT License.

MetalRT is proprietary software by RunAnywhere, Inc., distributed under a separate license.

Built by RunAnywhere, Inc.

展示HN：RunAnwhere – 在Apple Silicon上更快的AI推理 Show HN: RunAnwhere – Faster AI Inference on Apple Silicon

展示HN：RunAnwhere – 在Apple Silicon上更快的AI推理
Show HN: RunAnwhere – Faster AI Inference on Apple Silicon