构建你自己的Siri。本地化。设备端运行。无需云端
Build your own Siri locally and on-device

原始链接: https://thehyperplane.substack.com/p/build-your-own-siri-locally-on-device

厌倦了云端AI?本系列教程将指导开发者构建私有的、在设备上运行的语音助手,能够理解自然语言、执行功能并尊重隐私。 它关注实际的实现、可扩展性和数据安全,而不是泛泛的AI炒作。本系列涵盖了为本地使用微调LLaMA 3.1,创建函数调用数据集,以及使用GGUF在本地运行推理。 它还将讲解如何使用Whisper连接语音输入/输出,并强调MLOps原则以实现强大的设备上性能,包括数据集版本控制、实验跟踪和压力测试。目标是构建微型利基、AI驱动的最小可行产品(MVP),这些产品可在本地私下运行,面向开发者、关注隐私的移动应用程序以及需要安全应用程序的内部团队。

Hacker News 用户正在讨论一个构建本地、设备端Siri替代品的教程。许多评论者称赞Whisper语音转文本模型的准确性和速度,尤其是在使用CoreML或whisper.cpp优化后。一位用户甚至表示能够流畅运行中等大小的模型。 讨论还涉及到由于硬件级别的唤醒词处理由单独的协处理器负责,因此替换iOS上的Siri的局限性。虽然直接替换很困难,但有人提到可以使用自定义快捷键操作按钮,或者使用Perplexity之类的应用程序作为替代方案。 用户们还讨论了当前语音助手的功能和不足。一些人认为Siri功能不足,另一些人则指出他们通常只用它来执行设置定时器或查看天气等基本命令,这些功能自定义助手很容易就能实现。普遍观点是,如果现有的权限模型允许,社区最终可能会创建一个Siri的替代品。

原文

The edge is back. This time, it speaks.

Let’s be honest.
Talking to ChatGPT is fun.
But do you really want to send your "lock my screen" or "write a new note” request to a giant cloud model?


…Just to have it call back an API you wrote?

🤯 What if the model just lived on your device?


What if it understood you, called your functions directly, and respected your privacy?

We live in an era where everything is solved with foundational LLM APIs, privacy is a forgotten concept, and everybody is hyped about any new model. However, nobody discusses how that model can be served, how the price will scale, and how data privacy is respected.

It’s all noise.
Experts everywhere.
And yet... no one shows you how to deliver something real.

We’re done with generic AI advice.
Done with writing for engagement metrics.
Done with pretending open-source LLMs are production-ready "out of the box."

We’re here to build.

Only micro-niche, AI-powered MVPs.
Only stuff that runs. In prod. Locally. Privately.

You’re in a meeting. You whisper:

“Turn off my volume. Search when the thermal motor was invented.”

Your laptop obeys.
No API call to the cloud.
No OpenAI logs.
Just you, a speech-to-text model, a lightweight LLM, and a bit of voice magic.

This isn’t sci-fi.
It’s actually easier than ever to build your own local voice assistant that:

  • Understands natural language

  • Executes your own app functions

  • Works offline on macOS, Linux, and even mobile

  • Keeps all data private, stored on your deivce

I’m building this and’ll teach you how to do it too.

This isn’t for chatbot tinkerers.
This is for:

  • Devs building on the edge

  • Privacy-first mobile apps

  • Teams deploying apps in sensitive environments (health, legal, internal tools)

  • R&D teams passionate about on-device AI

This 5-part hands-on series is 100% FREE.

In this hands-on course, you'll:

  • Fine-tune LLaMA 3.1 8B with LoRA for local use

  • Create a function-calling dataset (like https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)

  • Run inference locally on your laptop using GGUF

  • Connect everything to voice input/output (with Whisper or other custom model for speech-to-text)

Oh, it also has a GitHub repository!

Why is now the time to run voice assistants locally? Complete system overview with function mapping.

We’ll generate a custom function that calls the dataset using prompt templates, real API call formats, and verification logic.

You’ll learn how to fine-tune LLaMA 3.1 (or TinyLlama) using Unsloth, track it with W&B, and export as a GGUF model for edge inference.

We’ll use Whisper (tiny model) to transcribe speech, send it through the LLM, parse the response, and call the actual function on-device.

Final UX polish: make it a menubar app on Mac, a background service on Linux, or integrate it into your mobile app.


🚀 Want to follow along, or build your own?

📩 Subscribe to the series


🔧 Need help building one? Book a free strategy call

Let’s make local AI useful again.
Let’s build something that works.

Before we get excited about whispering to our laptops/mobiles and running function calls on-device, we need to pause and ask:

“Where in my system can things go wrong and impact the user?”

That’s the heart of MLOps. And even if this is a “local-only, no-cloud” assistant, the principles still apply.

Here’s why:

  • Models drift — even locally. If you fine-tune, you need to track it.

  • Prompt engineering is messy — your logic might change weekly.

  • Debugging hallucinations? Good luck without logs or prompt versions.

  • Dataset validation? — how do we really know that dataset we’ve just created is good enough?

  • Evaluation the fine-tuned model

Let’s not forget that the building part of this system is happening on the cloud. Building the dataset, fine-tuning the model, deploy the model the model to a model registry

The irony of building an “on-device Siri” is that… it starts online.

Not cloud-based inference — but online development.
This is the part where MLOps earns its name.

When you build a voice assistant that runs locally, you still need:

  • A fine-tuning dataset with function-calling principles in mind

  • A testing suite for common interactions

  • An evaluation strategy for real-world usage

This is the first place people cut corners. They scrape a few prompts, convert them into JSON, and call it a dataset. Then they fine-tune, and wonder why the model breaks on anything slightly off-pattern.

A better approach is to version the dataset, test multiple edge cases, and label failure modes clearly. Ask:

  • How diverse are your command phrasings?

  • Do you include rare or ambiguous intents?

  • Do you cover common errors like repetitions, hesitations, or mic glitches?

Fine-tuning is deceptively simple. It works. Until it doesn’t.
The model improves on your examples — but gets worse everywhere else.

This is where experiment tracking matters. Use simple MLOps principles like:

  • Version every checkpoint

  • Test on unseen commands, not just your eval set

  • Compare against your zero-shot baseline

And most importantly, validate the hybrid system — LLM + function caller + speech parser — not just the LLM alone.

You don’t ship a local agent before stress-testing it. Build a script that runs through:

  • 100+ common voice commands

  • Wrong mic inputs

  • Conflicting functions

  • Empty or partial user phrases

  • Multiple accents and speech patterns (if you use Whisper or STT)

Catch regressions before you put the model on-device.

This is where people relax. Don’t.

Deploying a model offline doesn’t mean it’s safe from bugs. It means you lose visibility.

So the only way to survive this phase is to prepare for it:

  • Run your system on multiple devices, with different specs and OS versions

  • Use test users (not just friends) and ask them to break it

  • Track logs locally and offer a way for testers to export logs manually (opt-in)

Here’s the system in 3 major phases:

Before we can train a model to call functions like lock_screen() or search_google("ML Vanguards"), we need to teach it how those calls look in a natural conversation. This part handles:

  • Building a dataset using prompts and LLMs

  • Using templates to simulate human-like voice requests, simulating different core functions of your system

  • Automatically verifying outputs with a test engine

  • Creating a clean dataset for function-calling fine-tuning

This is the most overlooked part in most LLM tutorials: how do you teach a model to behave in your context? Not by downloading alpaca data. You have to create your own, structured, specific, and validated.

We don’t want “chatbot vibes.” We want deterministic, parseable function calls from real intent.

Once we have the dataset, we fine-tune a small base model (like LLaMA 3.1 8B) using LoRA adapters. The goal is not general reasoning, it’s precision on our task: map spoken intent to exact API calls.

We use:

  • Unsloth for fast, GPU-efficient fine-tuning

  • Supervised instruction tuning (SFT) with a custom loss

  • Weights & Biases to track experiments

  • Export to GGUF for 4-bit quantized inference

This step allows us to deploy the model efficiently on consumer hardware, laptops, phones, even Raspberry PIs (will be a BONUS chapter about this) , without needing a GPU at inference time.

The final piece ties it all together. We connect:

  • Whisper for speech-to-text

  • Our fine-tuned LLM for function parsing

  • A small toolset of real functions: lock_screen(), get_battery_status(), etc.

The result? A working agent that:

  • Listens to your voice

  • Converts it into structured function calls

  • Executes them on your machine

  • Does all of this without touching the cloud

This system can run in real time, without network access, with full control and observability.

Building AI systems that run locally doesn’t mean leaving rigor. In fact, it demands more of it.

You don’t have the fallback of logging everything to some remote server. You can’t ship half-baked models and patch them later with “just a new prompt.” Once it’s on-device, it’s on you.

So we start with MLOps. Not dashboards. Not tooling. Just a thinking framework:

  • How do we avoid silent failures?

  • How do we make changes traceable?

  • How do we catch issues before the user does?

This first lesson was about that thinking process. The invisible part that makes everything else possible. And we will see in the next lessons how we will apply those MLOPS principles to each component.

Next up: how to actually generate the function-calling dataset.

We’ll write prompts, simulate user requests, auto-verify outputs, and build the data we need to fine-tune the model. No scraping. Just structured, validated data that teaches the model how to behave.

Want the next part?
Subscribe to follow the series as it drops.

🧠 Need help designing your own local AI system?
Book a call— we help R&D teams and startups ship real, on-device products.

联系我们 contact @ memedata.com