上下文是软件,权重是硬件。
Context Is Software, Weights Are Hardware

原始链接: https://www.aravindjayendran.com/writing/context-is-not-learning

## 超越更长的上下文:为什么LLM需要权重更新 目前改进LLM学习的主流方法是增加上下文窗口大小,利用KV缓存压缩和高效注意力机制的进步。然而,这假定仅仅增加足够的上下文长度就能消除模型权重更新的必要性——这是一个错误的假设。上下文和权重共同塑造了Transformer的内部表示(激活),但运作方式不同。上下文通过KV缓存提供*临时*的激活偏移,功能上模拟了单步梯度下降,而权重更新则创建了对模型核心计算的*永久*改变。 虽然令人印象深刻,但上下文本质上是运行在模型“硬件”(冻结权重)上的“软件”。它在模型预训练的分布范围内表现出色,但在需要新颖内部表示的任务面前会达到上限。权重修改则相反,*重新设计*了硬件,从而实现全新的计算。 此外,基于权重的学习更有效率——知识被编译到模型中(O(1)成本),而长上下文则需要持续的注意力成本(O(n))。最终,两者都至关重要:长上下文提供工作记忆,而权重更新则实现持久的知识积累和泛化。正如大脑同时利用快速、临时和缓慢、持久的记忆一样,LLM需要同时具备上下文*和*权重空间学习,才能实现真正强大和适应性的智能。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 上下文是软件,权重是硬件 (aravindjayendran.com) 10 分,由 maxaravind 发表于 2 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 帮助 maxaravind 2 小时前 [–] 作者在此。我上周末一直在思考持续学习。很多人认为我们可以通过简单地将上下文长度扩展到无限来解决 LLM 中的长期记忆和学习问题。我分析了一个不同的视角,挑战了这个假设。 请告诉我你的想法。回复 考虑申请 YC 2026 夏季批次!申请截止至 5 月 4 日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

The default answer to "how do we make LLMs learn?" (continual learning, not training from scratch) is: make context windows longer. Opus 4.7 has 1M tokens. KV cache compression keeps improving. Linear attention variants are making long context computationally cheap. The implicit assumption behind all of this: if context gets long enough and cheap enough, you don't need the model to update its own weights.

This assumption is incomplete in a way that matters. To see why, we need to look at what context and weights actually do inside a transformer. They're more similar than most people realise, and the difference between them is more fundamental than "one is temporary and the other is permanent."

What context and weights actually do

Every layer in a transformer produces activations: internal representations that flow forward to the next layer and ultimately determine the output. Both context (via the KV cache) and weights shape these activations. They're doing the same job through different mechanisms.

When you fine-tune a model, you change its weights. This changes how every input gets transformed at every layer. The activations shift toward a new distribution. Show the model enough examples of legal contracts, and its internal representations reorganise to process legal language more effectively. This shift is permanent.

When you do in-context learning, the KV cache fills with key-value pairs from your context. These cached representations influence how the model processes subsequent tokens through attention. The activations shift, sometimes dramatically. Few-shot prompting works precisely because those cached examples steer the model's internal computation toward the demonstrated pattern. But clear the context, and the activations revert to their default state.

Same job. Different mechanism. One permanent, one temporary.

Two Paths, Same Shift

Fine-tuning and in-context learning both modulate activations. One is permanent, one is temporary.

FINE-TUNING (SFT)IN-CONTEXT LEARNINGInputModelweights changedActivations shiftedpermanentInput+ContextModelKV cacheweights frozenActivations shiftedtemporary (clears with context)Same activation shift. Different mechanism. One survives the session, one doesn't.Von Oswald et al. (2023): for linear attention, the ICL shift ≈ one step of gradient descent.

This equivalence isn't just conceptual. Von Oswald et al. (2023) proved that for linear self-attention, the activation shift from in-context learning is mathematically equivalent to one step of gradient descent, the same operation used in fine-tuning. The KV cache is, in a real sense, a transient weight update. Mahankali et al. proved this is optimal for one-layer linear transformers.

So if they're doing the same thing, why does it matter which one we use?

Software vs hardware

Think of the frozen weights as hardware: they define what computations the model can perform, what patterns it can recognise, what representations it can build. Context is software running on that hardware.

A well-designed general-purpose processor can run a huge variety of programs. x86 handles word processing, video rendering, ML inference. The instruction set is rich enough to express almost anything. Similarly, a well-pretrained LLM handles an impressive range of tasks through in-context learning: translation, reasoning, code generation, style transfer.

But software has limits that hardware doesn't. If the chip lacks a floating-point unit, your software float emulation works, but it's slow and limited in precision. If the processor lacks SIMD instructions, your matrix multiply runs, but it's orders of magnitude slower than dedicated silicon. The program can be arbitrarily long. The instruction set is fixed.

Weight modification adds new instructions to the architecture. It's not writing a longer program. It's redesigning the chip.

The strongest case for longer context

Before arguing that weights matter, we should honestly engage with why longer context is so compelling.

Modern LLMs aren't random chips running arbitrary programs. They're specifically designed for in-context learning. During pretraining on trillions of tokens, the model's weights and KV cache co-evolve. The model learns to be a powerful meta-learner: its weights are optimised to make context-based activation shifts as expressive as possible. The "instruction set" was designed with this exact use case in mind.

This means a well-pretrained model's "hardware" supports a remarkably wide range of in-context "programs." Within the space of tasks the pretraining data covered, ICL can be highly expressive. For most of these tasks, you might genuinely not need weight changes.

You could also argue that persistence (context is temporary, weights are permanent) is just an engineering problem. Prefix caching, KV serialisation, or simply recomputing from stored text all give you persistence without touching weights. And there's an interpretability advantage: context is human-readable. You can inspect what the model was told. Weight changes are opaque.

These are real arguments. The "just make context longer" position is stronger than most weight-space advocates admit. Letta's "Continual Learning in Token Space" makes this case explicitly, and it's worth reading even if you disagree.

The ceiling

But the frozen weights are, in effect, a fixed meta-learning algorithm. Pretraining optimised that algorithm for the pretraining distribution. Within that distribution, it's powerful. At the boundary, it hits a ceiling.

When the target behaviour requires internal representations that pretraining didn't develop, because the data didn't contain them, no amount of context can conjure those representations. The program can be infinitely long. If the instruction set doesn't have the right primitives, it can't compute the target function.

The Meta-Learner Ceiling

Frozen weights define a fixed meta-learning algorithm. Context can reach anywhere within its range. Weights extend beyond it.

Reachable by context (ICL)pretraining distributiontranslation, QA, mathcode, reasoning, stylegeneral domain knowledgethe ceiling(fixed meta-learner boundary)your specific codebaseyour domain's nuanceshow you think about problemsyour company's internal jargonReachable by weight modificationThe long tail of human specificity is, by definition, not well-covered by general pretraining.

Think about what falls outside the pretraining distribution. Not exotic things. Mundane, specific things: the architectural conventions of your particular codebase, the regulatory nuances of your specific industry, how you personally think about problems, the internal jargon of your ten-person team. General pretraining covers general knowledge well. The long tail of human specificity is, by definition, not well-covered.

This is where weight modification matters most. It doesn't just run a different program on the same chip. It adds new circuits. Internal representations that didn't exist before. Computational pathways the pretrained meta-learner never developed. The model doesn't just behave differently; it becomes capable of computations it couldn't perform before.

Fine-tuned models consistently outperform prompted models on distribution-shifted tasks, even with very long context. This is the empirical signature of the ceiling.

Even within the ceiling, weights win

Grant, for the sake of argument, that context can express everything weights can. The cost is still fundamentally different.

Inference cost. Knowledge in weights is O(1) per forward pass. It's baked into the computation. Context-based knowledge requires attending over the full KV cache on every token generation: O(n), where n is the context length. For a million-token context, every single output token pays the cost of attending over a million cached entries.

Compression. A LoRA adapter encoding a complex domain adaptation is kilobytes. The equivalent context, enough examples to get the same behaviour through ICL, might be millions of tokens. Same function, orders of magnitude less storage.

Composability. Weight updates compound. Step 1 changes the model's computation. Step 2 operates on the new computation. Each update builds on the last, navigating to regions of function space that a single context window, no matter how long, cannot reach. This is the practical consequence of the Von Oswald result: in-context learning approximates one step of gradient descent. Real learning takes many steps, each compounding.

These aren't engineering quibbles. When something moves from software to hardware, from interpreted to native, the efficiency difference is the point. Nobody dismisses GPUs as "just an optimisation over CPU matrix multiply." The speed difference enables qualitatively different applications.

An honest acknowledgment

We don't have a clean theorem proving that the function class reachable by weight modification strictly contains the class reachable by context modulation for well-trained models. The formal separation is an open research question.

What we have: Von Oswald's result (ICL approximates one gradient descent step, and one step can only move a bounded distance in function space), consistent empirical evidence (fine-tuned models outperform prompted models on distribution-shifted tasks across model scales), and the architectural reality that context changes the input to frozen computation while weight modification changes the computation itself. Real transformers are messier than the linear attention proofs, but the gap is observable in practice.

Proving or disproving this separation formally would be a meaningful contribution to the theory of in-context learning. The practical gap is real today, even if the theory hasn't caught up.

The answer is both

This isn't context versus weights. It's context and weights.

Evolution faced the same engineering constraint: fast ephemeral memory (hippocampus) is cheap but bounded, slow persistent memory (neocortex) is expensive but vast. The solution wasn't to make the hippocampus infinitely large. It was to develop a transfer policy (sleep consolidation) that moves the right things from fast storage to slow storage. The brain has both memory systems. They're complementary, not competing.

Context windows will keep getting longer. Good. They solve the working memory problem: holding the information you need for the immediate task. Weight-space learning solves a different problem: accumulating knowledge that persists, generalises, and becomes native to the model's computation. Both are necessary. Neither is sufficient.

If you're curious about how weight-space learning could actually work, I wrote about that in Language Models Are Few-Shot Learners — They Just Can't Remember: three existing research threads that, combined, point toward a concrete mechanism. The few-shot learners are coming. They just need both memory systems.

联系我们 contact @ memedata.com