小型语言模型究竟是什么?
What even is a small language model now?

原始链接: https://jigsawstack.com/blog/what-even-is-a-small-language-model-now--ai

“小型模型”的定义已经发生了巨大的变化。以前它指的是可以在CPU上运行的微型模型,现在则指可以在单个GPU(甚至消费级GPU)上部署的模型。这种转变是由量化和优化技术的进步推动的,使得像Llama 3 70B这样的模型能够高效地运行在单个4090显卡上。 主要有两类小型模型:针对移动设备的边缘优化模型和针对内部应用的GPU友好型模型。专业化是关键,这使得小型模型由于其更小的体积和目标导向的训练,能够在特定领域超越通用大型语言模型。虽然像GPT-4这样的模型追求通用人工智能,但小型模型针对特定任务进行了优化。这种实用性使得它们对初创企业、开发者和大型企业都具有价值,能够实现经济高效的部署、注重隐私的应用以及特定任务的微调。最终,“小型”与其说是参数数量,不如说是可部署性和高效的资源利用率,这展现了它们在人工智能领域日益增长的力量。

This Hacker News thread discusses the definition of "small language models" (SLMs) in the current AI landscape. Antirez proposes a size classification based on hardware requirements, ranging from Raspberry Pi-compatible models to those requiring high-end workstations. The conversation debates the practicality of using LLMs for simple tasks like plant watering, with some arguing that traditional methods or specialized neural networks like LSTMs are more efficient. Others defend the use of SLMs for zero-shot learning, citing their ease of use and adaptability. The discussion touches on the trade-offs between model size, compute requirements, and performance. There's interest in models that can run locally on browsers via WASM, prioritizing data privacy and reduced cost. Others desire to see models capable of fitting on a gaming GPU such as a GeForce 3080. There's recognition that small models need to be fine-tuned for more specific tasks or domains to compensate for their constraints. Ultimately, the definition of "small" is relative and evolving alongside advancements in model capabilities and hardware.

原文

If you asked someone in 2018 what a "small model" was, they'd probably say something with a few million parameters that ran on a Raspberry Pi or your phone. Fast-forward to today, and we're calling 30B parameter models "small"—because they only need one GPU to run.

So yeah, the definition of "small" has changed.

Small Used to Mean... Actually Small

Back in the early days of machine learning, a "small model" might've been a decision tree or a basic neural net that could run on a laptop CPU. Think scikit-learn, not LLMs.

Then came transformers and large language models (LLMs). As these got bigger and better, anything not requiring a cluster of A100s suddenly started to feel... small by comparison.

Today, small is more about how deployable the model is, not just its size on paper.

Types of Small Models (By 2025 Standards)

We now have two main flavors of small language models:

1. Edge-Optimized Models

These are the kind of models you can run on mobile devices or edge hardware. They're optimized for speed, low memory, and offline use.

  • Examples: Phi-3-mini (3.8B), Gemma 2B, TinyLlama (1.1B)
  • Use cases: voice assistants, translation on phones, offline summarization, chatbots embedded in apps

2. GPU-Friendly Models

These still require a GPU, but just one GPU—not a whole rack. In this category, even 30B or 70B models can qualify as "small".

  • Examples: Meta Llama 3 70B (quantized), MPT-30B
  • Use cases: internal RAG pipelines, chatbot endpoints, summarizers, code assistants

The fact that you can now run a 70B model on a single 4090 and get decent throughput? That would've been science fiction a few years ago.

Specialization: The Real Power Move

One big strength of small models is that they don't need to do everything. Unlike GPT-4 or Claude that try to be general-purpose brains, small models are often narrow and optimized.

That gives them a few key advantages:

  • They stay lean — no need to carry weights for tasks they’ll never do.
  • They’re more accurate in-domain — a small legal model will outperform a general-purpose LLM on legal docs.
  • They’re easier to fine-tune — less data, faster iteration.

Small models shine when you know what you want. Think: summarizing medical records, identifying security vulnerabilities, parsing invoices—stuff that doesn't need general reasoning across the internet.

30B+ Models: Still Small?

Sounds weird, but yes. The bar for what’s considered "small" keeps shifting.

With the right quantization and engineering, even a 70B model can run comfortably on a high-end consumer GPU:

  • Llama 3.1 70B can be shrunk from 140GB (FP16) to 21GB (2-bit), running on a single 24GB VRAM card.
  • Throughput? ~60 tokens/sec — totally usable for many production workloads.

So now we talk about models being "small" if they’re:

  • Deployable without distributed inference
  • Runnable on one GPU (especially consumer-grade)
  • Tunable without a lab full of TPUs

It’s less about size, more about practicality.

Everyday Small Models: The Unsung Heroes

Not all small models are new. Some of the most widely used models today have been around for years, quietly powering everyday tools we rely on.

  • Google Translate: Since 2006, it's been translating billions of words daily. In 2016, Google switched to a neural machine translation system, GNMT, which uses an encoder-decoder architecture with long short-term memory (LSTM) layers and attention mechanisms. This system, with over 160 million parameters, significantly improved translation fluency and accuracy.

  • AWS Textract: This service extracts text and data from scanned documents. It's been a staple in automating document processing workflows, handling everything from invoices to medical records.

These models may not be cutting-edge by today's standards, but they've been instrumental in shaping the AI landscape and continue to serve millions daily.

Why This Matters

Small models are becoming a huge deal:

  • Startups can deploy LLMs without spending six figures on infra.
  • Developers can run local models for privacy-focused apps.
  • Enterprises can fine-tune task-specific LLMs without massive overhead.

And when a "small model" can hold its own against GPT-3.5 in benchmarks? The game has officially changed.

TL;DR

  • Small models used to mean tiny. Now they mean "runs without drama."
  • You’ve got edge models, GPU-ready models, and everything in between.
  • Specialization is where small models shine.
  • 30B and 70B models can be small—if they’re optimized well.
  • Practicality > parameter count.

In a world chasing ever-bigger models, small ones are quietly doing more with less—and that's exactly what makes them powerful.

👥 Join the JigsawStack Community

Have questions or want to show off what you’ve built? Join the JigsawStack developer community on Discord and X/Twitter. Let’s build something amazing together!

联系我们 contact @ memedata.com