面向初学者的语音AI – 为开发者量身定制的学习路径

面向初学者的语音AI – 为开发者量身定制的学习路径
Voice-AI-for-Beginners – A curated learning path for developers

原始链接: https://github.com/mahimairaja/voiceai

## 构建实时语音AI代理：学习路径本指南概述了开发者构建和部署语音AI代理的流程，从初始设置到生产规模化。该领域发展迅速，正趋向于一种核心模式：实时传输（WebRTC/电话）馈送语音转文本（STT）→大型语言模型（LLM）→文本转语音（TTS）的流式管道，由轮流模型管理。 **推荐学习顺序：** 1. **基础：** 理解管道、延迟考虑因素和核心概念。 2. **框架：** 选择像LiveKit Agents或Pipecat这样的平台进行快速原型设计。 3. **组件：** 深入研究STT、TTS、LLM、语音活动检测（VAD）和轮流检测——试验不同的提供商。 4. **传输和电话：** 使用SIP干线连接到真实的电话号码。 5. **生产和伦理：** 实施评估、监控，并解决安全/监管问题。 **关键资源：** 探索像Whisper用于STT、Coqui TTS用于TTS、Groq用于LLM推理的选项。优先考虑低延迟（低于200毫秒）的流式解决方案。 **保持更新：** 关注博客（LiveKit, Deepgram）、新闻通讯（Latent Space, Voice AI Newsletter）和社区，以跟上这个快速发展的领域。 **此精选列表优先考虑免费、官方文档和供应商中立的指南，并明确标注商业利益。资源按难度标记：🟢 初级，🟡 中级，🔴 高级。**

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录初学者语音AI – 开发者精选学习路径 (github.com/mahimairaja) 6点由 mahimai 1小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助考虑申请YC 2026年夏季项目！申请截止至5月4日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系方式搜索：

A curated, developer friendly learning path for building real-time voice AI agents from your first STT call to scaling production telephony.

Voice AI has moved from research demos into shipping product in under three years. The modern stack is converging around a clear pattern: a real-time transport layer (WebRTC or telephony), a streaming pipeline of speech-to-text → LLM → text-to-speech, and a turn-taking model that decides when the agent should speak. This list is structured to mirror that learning order start with the foundations, pick a framework, then drill into individual components and production concerns.

Resources are tagged 🟢 Beginner, 🟡 Intermediate, or 🔴 Advanced. Prefer free official docs and vendor-neutral guides; flag where authors have commercial interests.

Read top-to-bottom if you're brand new. The recommended path:

Foundations → understand the pipeline and latency budget
Frameworks → pick one (LiveKit Agents or Pipecat are the safest open-source bets) and ship a hello-world
Components (STT, TTS, LLM, VAD, turn detection) → swap pieces to learn what each layer does
Transport & telephony → connect to a real phone number
Evaluation, production, ethics → make it safe enough to ship

Foundational concepts and learning paths
Frameworks and orchestration platforms
Speech-to-text (STT / ASR)
Text-to-speech (TTS)
LLMs for voice and real-time AI
Voice activity detection and turn-taking
WebRTC fundamentals
Telephony and SIP
Tutorials and hands-on projects
GitHub starter repos and awesome lists
Datasets and benchmarks
Beginner-accessible research papers
Evaluation and testing
Production, deployment, and scaling
Ethics, safety, and regulation
Blogs and newsletters
Podcasts
Communities
Conferences and events
Hackathons and competitions

1. Foundational concepts and learning paths

Start here. These resources establish the mental model of the voice agent pipeline and the latency budget you'll fight for the rest of your career.

Voice AI & Voice Agents An Illustrated Primer Kwindla Hultman Kramer's free, regularly-updated long-form primer. The de facto textbook for the field. 🟢 Beginner
Voice Agent Architecture: STT, LLM, and TTS Pipelines Explained (LiveKit) Visual walkthrough of streaming patterns, turn detection, and where latency accumulates. 🟢 Beginner
Everything You Need to Know About Voice AI Agents (Deepgram) End-to-end primer covering feature extraction, ASR, LLM reasoning, and synthesis. 🟢 Beginner
AI Voice Agents (LiveKit Docs) The canonical "what is a voice agent" reference, covering pipeline vs multimodal and agent state. 🟢 Beginner
Core Latency in AI Voice Agents (Twilio) Visual explanation of end-of-turn detection, silence thresholds, and smart endpointing. 🟢 Beginner
Advice on Building Voice AI in June 2025 (Daily.co) Practical P50/P95 latency-budget guidance from Pipecat's creators. 🟡 Intermediate
How Intelligent Turn Detection Solves the Biggest Challenge in Voice Agents (AssemblyAI) Endpointing is the most underestimated problem; this is the clearest deep-dive. 🟡 Intermediate

2. Frameworks and orchestration platforms

The frameworks below all let you wire STT, an LLM, and TTS together. For open-source production work, LiveKit Agents and Pipecat are the two safest bets; for managed dashboards, Vapi, Retell, and Bland win on time-to-first-call.

LiveKit Agents Voice AI Quickstart Working assistant in <10 min via Python or TypeScript, runs on top of WebRTC. 🟢 Beginner
Pipecat Quickstart Scaffolds a Deepgram + OpenAI + Cartesia pipeline you can talk to in the browser in 5 minutes. 🟢 Beginner
Ultravox (fixie-ai/ultravox) Open-weight multimodal speech LLM (Llama/Gemma/Qwen variants) that skips the separate ASR stage for ~150 ms TTFT. 🔴 Advanced

Realtime / speech-to-speech APIs

Vendor-neutral comparisons

3. Speech-to-text (STT / ASR)

Pick one streaming STT and learn it deeply before shopping around. Deepgram, AssemblyAI, and Whisper-derivatives cover most use cases.

openai/whisper The original repo and the de facto starting point for any DIY ASR project. 🟢 Beginner
SYSTRAN/faster-whisper CTranslate2 reimplementation up to 4× faster with INT8; recommended for self-hosted Whisper. 🟡 Intermediate
NVIDIA NeMo (Parakeet / Canary) Top-of-leaderboard open ASR models with streaming inference recipes. 🔴 Advanced
Moonshine Tiny on-device ASR (~190 MB) optimized for live streaming on edge devices. 🟡 Intermediate

Benchmarks and explainers

Latency, not raw quality, is what kills voice agents prioritize providers offering true streaming with first-byte under 200 ms.

Coqui TTS (idiap fork) Maintained fork of Coqui-TTS / XTTS v2; the most battle-tested OSS TTS toolkit. 🟡 Intermediate
Piper (OHF-Voice/piper1-gpl) Fast local neural TTS optimized for Raspberry Pi; perfect for offline projects. 🟢 Beginner
Kokoro 82M Tiny Apache-licensed model that tops community ELO arenas; runs on CPU. 🟢 Beginner
F5-TTS Diffusion-transformer TTS with high-quality zero-shot voice cloning. 🟡 Intermediate
Orpheus-TTS Llama-3B-based emotive TTS with ~200 ms streaming and emotion tags. 🟡 Intermediate
Sesame CSM Conversational, context-aware multi-speaker TTS using a Llama backbone with the Mimi codec. 🔴 Advanced

5. LLMs for voice and real-time AI

A voice agent's perceived intelligence is bounded by how fast the LLM streams its first token. Sub-300 ms TTFT changes the conversation feel entirely.

Groq LPU-based inference cloud delivering ~10× faster Llama tokens/sec than commodity GPUs. 🟢 Beginner
Cerebras Inference Wafer-scale chip inference with very high throughput on Llama models. 🟢 Beginner
SambaNova Cloud Reconfigurable Dataflow inference; stable throughput at low latency. 🟢 Beginner

OpenAI Realtime API guide Flagship S2S product with WebRTC/WebSocket transport. 🟡 Intermediate
Google Gemini Live Real-time multimodal voice/video with barge-in and 70-language support. 🟡 Intermediate
Moshi (kyutai-labs) Open-source full-duplex speech-text foundation model with 200 ms latency the premier OSS S2S model to study. 🔴 Advanced

Voice-specific prompting and tools

6. Voice activity detection and turn-taking

Pure VAD is no longer enough modern agents combine acoustic VAD with a small semantic model that predicts end-of-utterance from words and prosody.

WebRTC is the default transport for voice agents that don't run over the phone network. Understanding ICE, STUN, TURN, and SFU architecture is non-negotiable for production work.

The phone network has its own physics. Once you know which SIP trunk provider to point at LiveKit or Pipecat, you can ship.

9. Tutorials and hands-on projects

Pick one tutorial and finish it before starting another. Voice AI is unforgiving of half-built pipelines.

LiveKit Voice AI Quickstart Official 10-minute walkthrough in Python or Node with starter templates. 🟢 Beginner
Build Your First AI Voice Agent in Python (LiveKit) End-to-end Python tutorial covering streaming, latency, and deployment. 🟢 Beginner
Pipecat Quickstart Build and deploy a Deepgram + OpenAI + Cartesia bot in roughly 10 minutes. 🟢 Beginner
How to Build a Real-Time Voice Agent with Pipecat (AssemblyAI) Production-oriented walkthrough including local testing and Pipecat Cloud deployment. 🟡 Intermediate
Deepgram Build a Voice AI Agent Step-by-step guide wiring Deepgram STT, GPT, and Aura TTS. 🟢 Beginner
Build a Voice Assistant with Twilio ConversationRelay + LiteLLM Provider-agnostic tutorial supporting OpenAI, Anthropic, or DeepSeek. 🟡 Intermediate
freeCodeCamp Build Advanced AI Agents (LiveKit, Exa, LangChain) Free 3-part video course covering interactive voice agents end-to-end. 🟢 Beginner
freeCodeCamp Private On-Device Voice Assistant Hands-on local stack with Whisper, a local LLM, and system TTS. 🟡 Intermediate

10. GitHub starter repos and awesome lists

Clone these instead of writing boilerplate from scratch.

livekit/agents The flagship open-source Python/Node framework for production voice agents. 🟢 → 🔴
pipecat-ai/pipecat Vendor-neutral framework with 40+ STT/LLM/TTS service plugins. 🟢 → 🔴
livekit-examples/agent-starter-python Production-ready starter with Dockerfile, eval suite, turn detector, and core plugins. 🟢 Beginner
livekit-examples (org) Official collection of LiveKit Python/React/Swift/Android starters. 🟢 Beginner
pipecat-ai/pipecat-examples Sample apps for push-to-talk, websocket, telephony, and multimodal use cases. 🟢 → 🟡
elevenlabs/elevenlabs-examples Runnable Next.js and Python examples for TTS, STT, and real-time agents. 🟢 Beginner
vocodedev/vocode-core Open-source modular framework for voice-LLM agents on phone, Zoom, or system audio. 🟡 Intermediate (less actively maintained than LiveKit/Pipecat)
kwindla/macos-local-voice-agents Pipecat example hitting sub-800 ms voice-to-voice latency entirely on M-series Macs. 🟡 Intermediate
zzw922cn/awesome-speech-recognition-speech-synthesis-papers Comprehensive curated index of ASR, TTS, voice conversion, and speech-LLM papers. 🟡 Intermediate
wildminder/awesome-ai-voice Up-to-date 2025–2026 list of open-source TTS and voice-cloning models.
CorentinJ/Real-Time-Voice-Cloning Classic 5-second voice cloning project for understanding TTS fundamentals. 🟡 Intermediate

11. Datasets and benchmarks

You'll rarely train from scratch, but knowing which dataset a model was trained on explains its accents, languages, and failure modes.

LibriSpeech ASR Corpus ~1,000 hours of English audiobooks; nearly every ASR paper benchmarks against it. 🟢 Beginner
Mozilla Common Voice Crowdsourced multilingual dataset (100+ languages); the easiest legal way to fine-tune ASR. 🟢 Beginner
Common Voice on HuggingFace One-line load_dataset() access for hands-on experiments. 🟢 Beginner
Open ASR Leaderboard Live comparison of 60+ ASR models on WER and real-time factor. 🟢 Beginner
Artificial Analysis Speech Independent benchmarks of commercial STT and TTS providers. 🟢 Beginner
LJSpeech Dataset ~24 hours of single-speaker English audio; baseline corpus for Tacotron 2 and VITS. 🟢 Beginner
VCTK Corpus ~110 English speakers with diverse accents; widely used for multi-speaker TTS. 🟡 Intermediate
VoxCeleb (Oxford VGG) Million-utterance "in the wild" dataset for speaker identification and verification. 🟡 Intermediate

12. Beginner-accessible research papers

These are the landmark papers behind the models you'll actually use. Read the Whisper and Common Voice papers first they're unusually approachable.

13. Evaluation and testing

You can't ship what you can't measure. Voice-agent evaluation is fundamentally probabilistic a single transcript can pass and fail across runs, so simulation and statistics matter more than fixed test cases.

Coval Voice AI Testing Platform Defines the core voice-agent metrics: TTFB, WER, resolution rate, simulated accents, and interruptions. 🟢 Beginner
Coval How to Evaluate Voice Agents (Practical Guide) One of the most cited 2025 guides on probabilistic vs deterministic evaluation. 🟢 Beginner
Cekura Metrics Overview Predefined metrics, instruction-following checks, and simulation framework. 🟢 Beginner
Cekura Performance Testing for Voice Agents Practical 2025 guide on multi-turn simulation and edge-case generation. 🟡 Intermediate
Hamming AI Production-focused QA platform with simulation, load testing, and 50+ metrics. 🟡 Intermediate
Hamming Voice Agent Evaluation Metrics Guide Reference of latency percentiles, WER, MOS-style quality, and task completion with formulas. 🟡 Intermediate
LiveKit Understand and Improve Agent Latency Per-turn latency metrics (e2e, LLM TTFT, TTS TTFB) and where to optimize. 🟡 Intermediate
Twilio How Do You Know if Your Voice AI Agents Are Working? Vendor-neutral 2025 guide arguing for business-outcome metrics over raw WER/latency. 🟢 Beginner

14. Production, deployment, and scaling

Real production voice infrastructure is the hardest unsolved problem in this space. Read these before quoting anyone a per-minute price.

15. Ethics, safety, and regulation

If you're shipping a voice agent in 2026, disclosure and consent are no longer optional. The FCC and EU AI Act both have teeth.

16. Blogs and newsletters

Subscribe to two or three to stay current the field moves quickly.

LiveKit Blog Engineering deep-dives on WebRTC, agents framework releases, and production patterns.
Deepgram Learn Tutorials on STT/TTS, voice agent design, evals, and pipeline architecture.
Cartesia Blog State-space TTS models, Sonic releases, and yearly "State of Voice AI" reports.
ElevenLabs Blog Product and research announcements with implementation notes.
Daily.co Blog (Pipecat) Posts from Pipecat's maintainers covering scaling and feature releases.
Voice AI & Voice Agents Illustrated Primer Free, regularly-updated long-form primer.
Latent Space (swyx & Alessio) AI Engineer newsletter and podcast with frequent voice-AI episodes.
Voice AI Newsletter (Krisp) "Future of Voice AI" interview series with founders; published weekly in 2025.
Voice AI Weekly (Vapi) Weekly Substack rounding up news, products, and tools.
Voicebot.ai (Synthedia) Long-running daily news and paid newsletter on industry trends.

19. Conferences and events

AI Engineer World's Fair Biggest AI-engineering conference; the Voice track has hosted major launches from ElevenLabs, Vapi, LiveKit, Pipecat, and Cartesia. 🟢 Beginner
AI Engineer YouTube channel All World's Fair and Summit talks are posted free; the best library of recent voice-AI talks. 🟢 Beginner
AI Engineer Summit Online Voice playlist Curated playlist including voice-track sessions from leading labs. 🟢 Beginner
AIEWF 2025 Recap (Latent Space) Written deep-dive into 2025's voice-track talks and major launches. 🟢 Beginner
VOICE & AI (Modev) Long-running voice technology conference with broader CX and voicebot focus. 🟢 Beginner
Project Voice Main U.S. event for conversational AI across voice, text, and chat. 🟢 Beginner
Interspeech Top academic speech-science conference; intimidating but worth knowing most landmark papers debut here. 🔴 Advanced

20. Hackathons and competitions

Week 1 Foundations: Read the LiveKit pipeline post and Voice AI Illustrated Primer (sections 1, 7).
Week 2 First agent: Finish the LiveKit or Pipecat quickstart end-to-end (sections 2, 9).
Week 3 Components: Swap STT, TTS, and LLM providers; benchmark latency (sections 3, 4, 5).
Week 4 Turn-taking & telephony: Add Silero VAD and a turn detector; connect a SIP trunk (sections 6, 8).
Week 5 Production: Add evaluation, observability, and read the FCC/EU AI Act material (sections 13, 14, 15).
Ongoing: Subscribe to two newsletters and join voice ai community in linkedin (sections 16, 17, 18).

Pull requests welcome. Resources must be active in the last 12 months, accessible to developers, and vendor-neutral or clearly labeled when authored by a commercial party. Open an issue to suggest additions or removals.