微调大型语言模型是浪费时间

微调大型语言模型是浪费时间
Fine-tuning LLMs is a waste of time

原始链接: https://codinginterviewsmadesimple.substack.com/p/fine-tuning-llms-is-a-huge-waste

微调高级大型语言模型以注入知识通常是一个浪费且可能具有破坏性的过程。虽然看起来很直观，但这可能会覆盖网络神经元中编码的有价值的现有知识。这些神经元并非白板；它们密集互连并存储着关键信息。更新它们可能会抹去已建立的模式并导致意外的下游影响。相反，检索增强生成 (RAG)、适配器模块 (LoRA) 和提示工程等模块化方法提供了更安全的替代方案。RAG 使用外部数据库动态增强知识，而适配器模块则通过隔离的子网络注入新信息。提示工程引导大型语言模型得出更好的答案。这些技术能够保持模型现有知识库的完整性。应该谨慎地进行微调，认识到神经元是宝贵且有限的资源。采用模块化解决方案是构建适应性强、可扩展且强大的 AI 系统的关键，避免破坏精心构建的知识生态系统。

这个Hacker News帖子讨论了一篇文章，该文章声称微调大型语言模型（LLM）是浪费时间，因为它会覆盖已有的知识，并提倡使用LoRA等替代方案。一些评论者对这种观点提出了质疑。一位评论者认为，虽然LoRA效率很高，但它从根本上与完全微调类似，作者误解了它的功能。另一位评论者强调，针对特定任务的微调比通用模型能产生更好的结果，尤其是在专业领域。他们认为，为了优化单一任务而“损害”模型的更广泛能力是可以接受的。对此的反驳包括：处理复杂的业务流程和工具需要最佳、最强大的模型，但对于简单的任务（如报表格式化）可能并不需要。评论者们就微调模型是否可以用来注入新知识，或者最适合用于风格迁移进行了辩论。许多人同意，使用LoRA适配器进行微调是一种常见的做法，因为它能更快、更可靠地获得结果。

原文

It takes time to create work that’s clear, independent, and genuinely useful. If you’ve found value in this newsletter, consider becoming a paid subscriber. It helps me dive deeper into research, reach more people, stay free from ads/hidden agendas, and supports my crippling chocolate milk addiction. We run on a “pay what you can” model—so if you believe in the mission, there’s likely a plan that fits (over here).

Every subscription helps me stay independent, avoid clickbait, and focus on depth over noise, and I deeply appreciate everyone who chooses to support our cult.

Help me buy chocolate milk

PS – Supporting this work doesn’t have to come out of your pocket. If you read this as part of your professional development, you can use this email template to request reimbursement for your subscription.

Every month, the Chocolate Milk Cult reaches over a million Builders, Investors, Policy Makers, Leaders, and more. If you’d like to meet other members of our community, please fill out this contact form here (I will never sell your data nor will I make intros w/o your explicit permission)- https://forms.gle/Pi1pGLuS1FmzXoLr6

Recently, I was on call with an investor who wanted my help in doing due diligence on a startup. During our conversation, they casually mentioned that the startup would be relying on fine-tuning to ensure that their systems were always updated with new information. I was surprised to see the myth of fine-tuning alive and kicking, but I guess Fine Tuning has been chugging on that same immortality juice as GOAT-naldo.

Fine-tuning large language models (LLMs) is frequently sold as a quick, powerful method for injecting new knowledge. On the surface, it makes intuitive sense: feed new data into an already powerful model, tweak its weights, and improve performance on targeted tasks.

But this logic breaks down for advanced models, and badly so. At high performance, fine-tuning isn’t merely adding new data — it’s overwriting existing knowledge. Every neuron updated risks losing information that’s already intricately woven into the network. In short: neurons are valuable, finite resources. Updating them isn’t a costless act; it’s a dangerous trade-off that threatens the delicate ecosystem of an advanced model.

In today’s article, we’ll be talking about why Fine-Tuning LLMs is a giant waste of time for Knowledge Injection (90% of what people and think off).

Fine-tuning advanced LLMs isn’t knowledge injection — it’s destructive overwriting. Neurons in trained language models aren’t blank slates; they’re densely interconnected and already encode crucial, nuanced information. When you fine-tune, you risk erasing valuable existing patterns, leading to unexpected and problematic downstream effects.

Instead, use modular methods like retrieval-augmented generation, adapters, or prompt-engineering — these techniques inject new information without damaging the underlying model’s carefully built ecosystem.

I provide various consulting and advisory services. If you‘d like to explore how we can work together, reach out to me through any of my socials over here or reply to this email.

To grasp why fine-tuning advanced language models isn’t as straightforward as it sounds, let’s first consider how neural networks, particularly language models, are trained from scratch.

At their core, neural networks are immense collections of interconnected neurons, each holding numerical values (weights) that determine their behavior. Initially, these weights are set randomly — no encoded meaning, no stored knowledge, just mathematical noise.

When training starts, the network receives input (words, sentences, documents), makes predictions (next word, sentence completions), and calculates how far off these predictions are from reality. This difference is called the loss. The network then uses a process known as backpropagation to adjust each neuron’s weights incrementally, reducing this loss. Early in training, this is easy — the neurons store essentially random values, so updating them incurs minimal loss of useful information. The whole process is visualized below-

With more training, the network progressively encodes meaningful patterns: linguistic nuances, syntax rules, semantic relationships, and context-dependent meanings. The neurons evolve from background character A into important side characters like Kirishima, with some evolving to Kacchan status in the Network.

The “cost” of updating a neuron goes up in trained LLMs since the neurons include important information.

At the level of modern LLMs (which is what most suckers try to tune), most neurons are densely packed with critical insights. Fine-tuning/running any updates on them is more likely to hit a few of your important neurons, completely changing your expected behavior.

You can see this in the research around Safety. As we saw earlier, alignment changes the distribution of biases in the outputs, creating new, unexpected biases that were significantly different from your baseline model. Take for example this case-

“The base model generates a wide range of nationalities, with American, British, and German being the top three. In contrast, the aligned model only generates three nationalities: American (highest percentage), Chinese, and a small percentage of Mexican.” The base model likely has some interesting training data for Germany and British that the alignment flags as unsafe. Wonder what that could be?

Given that no one I’ve ever met likes the Brits, one could argue that the alignment dropping them is doing its job (since it also dropped the French, I think we’ve attained AGI), but the dramatic reduction of diversity, and the changed rankings of data points are both unexpected. The most dramatic example of this is shown here- “Finally, the distribution of customer gender (Figure 6) shows that the base model generates approximately 80% male and 20% female customers, while the aligned model generates nearly 100% female customers, with a negligible number of males.”

All that to show you that alignment has all kinds of implications that we haven’t explored in depth yet, and this ignorance about it makes red-teaming that much harder (can’t hit a target you don’t understand).

This is the crux: neurons are no longer neutral — each update risks overwriting existing, valuable information, leading to unintended consequences across the network. A neuron might be important in more than one task, so updating it will lead to unexpected downstream implications.

Understanding this is key to recognizing the hidden costs of fine-tuning advanced language models. Unless you have invested a lot of money in AWS and you want to make sure that their stock goes up, you’re better off spending your time on better things.

If fine-tuning is a risky solution, what’s the alternative? The answer lies in modularity and augmentation. Techniques such as retrieval-augmented generation (RAG), external memory banks, and adapter modules provide more robust ways to incorporate new information without overwriting the existing network’s knowledge base.

Retrieval-Augmented Generation (RAG) uses external databases to augment knowledge dynamically at inference time. A lot of people proclaim stupid things like RAG is dead (we’ll address this eventually), but this is still by far the most reliable technique when processing large knowledge stores for QA. For more complex knowledge work, you’ll likely find naive-RAG lacking, but there are more advanced retrieval and representation techniques that can be implemented to create much stronger performance (for example, we use Knowledge Graphs and Entity Based Chunking with normal chunks for Iqidis- which allows our AI to pull context from a much larger knowledge base)

Adapter Modules and LoRA (Low-Rank Adaptation) insert new knowledge through specialized, isolated subnetworks, leaving existing neurons untouched. This is best for stuff like formatting, specific chains, etc- all of which don’t require a complete neural network update.

LoRA vs Full Tuning for LLMs. LoRA is cheaper, faster, and typically matches performance in specific scenarios w/o causing the destructive interference.

Good prompts put the LLM in neighborhoods that are more likely to lead to good answers. This is a skill, not luck, even as LLMs are non-determinstic.

These techniques recognize neurons for what they truly are: finite, precious, and densely packed resources best left intact whenever possible. There are many others that we will cover in depth in AI Made Simple, but these 3 are techniques that most teams will be able to get started with without extensive AI expertise (there are frameworks/services for stuff like LoRA now days and while very complex RAG requires setup/tuning, the basics are now very to get out).

Fine-tuning isn’t knowledge injection — it’s knowledge overwrite. For advanced LLMs, neurons are no longer neutral placeholders; they’re highly specialized, densely interconnected repositories of valuable information. Carelessly updating them risks catastrophic, invisible damage.

If your goal is to build adaptable, scalable, and robust systems, treat fine-tuning with the caution it deserves. Embrace modular solutions (software principles don’t dissapear just b/c we’re working on AI) that maintain the integrity of your network’s foundational knowledge. Otherwise, you’re simply dismantling your carefully constructed knowledge ecosystem — one neuron at a time.

Thank you for being here, and I hope you have a wonderful day.

If you have a lot of money to burn, let’s just go to Vegas instead for Market Research

Dev ❤

I put a lot of work into writing this newsletter. To do so, I rely on you for support. If a few more people choose to become paid subscribers, the Chocolate Milk Cult can continue to provide high-quality and accessible education and opportunities to anyone who needs it. If you think this mission is worth contributing to, please consider a premium subscription. You can do so for less than the cost of a Netflix Subscription (pay what you want here).

If you liked this article and wish to share it, please refer to the following guidelines.

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow. The best way to share testimonials is to share articles and tag me in your post so I can see/share it.

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

AI Newsletter- https://artificialintelligencemadesimple.substack.com/

My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/

My (imaginary) sister’s favorite MLOps Podcast-

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

微调大型语言模型是浪费时间 Fine-tuning LLMs is a waste of time

微调大型语言模型是浪费时间
Fine-tuning LLMs is a waste of time