法学硕士的“新兴”能力实际上是逐渐且可预测地发展的—

法学硕士的“新兴”能力实际上是逐渐且可预测地发展的——研究
“Emergent” abilities in LLMs actually develop gradually and predictably – study

原始链接: https://www.quantamagazine.org/how-quickly-do-large-language-models-learn-unexpected-skills-20240213/

本文讨论了最近的一项研究，该研究表明大型语言模型中明显的“出现”可能是其性能衡量方式的产物，而不是模型本身的固有属性。作者认为，他们的发现挑战了关于这些系统中突然且不可预测地出现意想不到的能力的流行说法，但一些专家不同意，认为出现的证据仍然很充分。这场辩论强调了开发强大的方法来预测和理解先进人工智能行为的重要性。

我很欣赏这里的讨论和提出的各种观点。在大型语言模型的背景下，术语“涌现能力”似乎由于对该术语的不同解释而引起一些混乱。一些人认为它指的是没有明确教导的新行为的突然出现，而另一些人则认为它是描述由大量相对简单的单元相互作用产生的复杂新兴行为的一种方式。有鉴于此，文章中提到的研究的主要焦点似乎是理解在与数学问题解决（尤其是算术）相关的大型语言模型中观察到的涌现行为。作者指出，现有研究表明这些模型表现出算术计算的能力，但其潜在机制仍不清楚。因此，该研究旨在探索这些新兴的算术能力，将其视为更广泛推理技能的潜在指标，而不是具体目标本身。此外，这些发现可能有助于深入了解语言模型如何处理和表示信息，使我们能够更好地理解它们的功能并增强它们的能力。总体而言，重点是研究以前未探索的新兴数学能力领域，揭示人工智能领域研究和进步的潜在新途径。

原文

Two years ago, in a project called the Beyond the Imitation Game benchmark, or BIG-bench, 450 researchers compiled a list of 204 tasks designed to test the capabilities of large language models, which power chatbots like ChatGPT. On most tasks, performance improved predictably and smoothly as the models scaled up — the larger the model, the better it got. But with other tasks, the jump in ability wasn’t smooth. The performance remained near zero for a while, then performance jumped. Other studies found similar leaps in ability.

The authors described this as “breakthrough” behavior; other researchers have likened it to a phase transition in physics, like when liquid water freezes into ice. In a paper published in August 2022, researchers noted that these behaviors are not only surprising but unpredictable, and that they should inform the evolving conversations around AI safety, potential and risk. They called the abilities “emergent,” a word that describes collective behaviors that only appear once a system reaches a high level of complexity.

But things may not be so simple. A new paper by a trio of researchers at Stanford University posits that the sudden appearance of these abilities is just a consequence of the way researchers measure the LLM’s performance. The abilities, they argue, are neither unpredictable nor sudden. “The transition is much more predictable than people give it credit for,” said Sanmi Koyejo, a computer scientist at Stanford and the paper’s senior author. “Strong claims of emergence have as much to do with the way we choose to measure as they do with what the models are doing.”

We’re only now seeing and studying this behavior because of how large these models have become. Large language models train by analyzing enormous datasets of text — words from online sources including books, web searches and Wikipedia — and finding links between words that often appear together. The size is measured in terms of parameters, roughly analogous to all the ways that words can be connected. The more parameters, the more connections an LLM can find. GPT-2 had 1.5 billion parameters, while GPT-3.5, the LLM that powers ChatGPT, uses 350 billion. GPT-4, which debuted in March 2023 and now underlies Microsoft Copilot, reportedly uses 1.75 trillion.

That rapid growth has brought an astonishing surge in performance and efficacy, and no one is disputing that large enough LLMs can complete tasks that smaller models can’t, including ones for which they weren’t trained. The trio at Stanford who cast emergence as a “mirage” recognize that LLMs become more effective as they scale up; in fact, the added complexity of larger models should make it possible to get better at more difficult and diverse problems. But they argue that whether this improvement looks smooth and predictable or jagged and sharp results from the choice of metric — or even a paucity of test examples — rather than the model’s inner workings.

Three-digit addition offers an example. In the 2022 BIG-bench study, researchers reported that with fewer parameters, both GPT-3 and another LLM named LAMDA failed to accurately complete addition problems. However, when GPT-3 trained using 13 billion parameters, its ability changed as if with the flip of a switch. Suddenly, it could add — and LAMDA could, too, at 68 billion parameters. This suggests that the ability to add emerges at a certain threshold.

But the Stanford researchers point out that the LLMs were judged only on accuracy: Either they could do it perfectly, or they couldn’t. So even if an LLM predicted most of the digits correctly, it failed. That didn’t seem right. If you’re calculating 100 plus 278, then 376 seems like a much more accurate answer than, say, −9.34.

So instead, Koyejo and his collaborators tested the same task using a metric that awards partial credit. “We can ask: How well does it predict the first digit? Then the second? Then the third?” he said.

Koyejo credits the idea for the new work to his graduate student Rylan Schaeffer, who he said noticed that an LLM’s performance seems to change with how its ability is measured. Together with Brando Miranda, another Stanford graduate student, they chose new metrics showing that as parameters increased, the LLMs predicted an increasingly correct sequence of digits in addition problems. This suggests that the ability to add isn’t emergent — meaning that it undergoes a sudden, unpredictable jump — but gradual and predictable. They find that with a different measuring stick, emergence vanishes.

法学硕士的“新兴”能力实际上是逐渐且可预测地发展的——研究 “Emergent” abilities in LLMs actually develop gradually and predictably – study

法学硕士的“新兴”能力实际上是逐渐且可预测地发展的——研究
“Emergent” abilities in LLMs actually develop gradually and predictably – study