为什么大型语言模型的临界值和涌现行为是假的？

为什么大型语言模型的临界值和涌现行为是假的？
Why R the Critical Value and Emergent Behavior of Large Language Models Fake?

原始链接: https://cacm.acm.org/blogcacm/why-are-the-critical-value-and-emergent-behavior-of-large-language-models-llms-fake/

大型语言模型（LLM）展现“涌现特性”的说法受到了挑战，原因在于研究中对性能提升使用了误导性的对数尺度。Wei等人[1]的研究中观察到的精度急剧跃升，被认为是对参数计数采用对数尺度的人为产物，其中x轴（参数计数）上的一个单位变化代表着参数的大量增加。如果用线性尺度绘制，这些性能提升则显得更加平缓和符合预期[2]。此外，经常被认为是涌现特性的算术能力，被认为是LLM在其训练数据中遇到类似运算的结果。在数据稀少或复杂的情况下（例如，大数加法），LLM会失败，这表明是知识回忆而不是真正的推理。LLM最终能够正确进行算术运算的想法，被认为是实现的而非涌现的行为。一些研究[3, 4]也进一步证明了为什么不存在涌现推理，以及评估中使用的指标存在缺陷。

Hacker News 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交登录为什么大型语言模型的临界值和涌现行为是虚假的？（acm.org） eferdinand 3小时前 5 分 | 隐藏 | 往期 | 收藏 | 2 条评论 cratermoon 2小时前 | 下一条 [–] “为什么大型语言模型 (LLM) 的临界值和涌现行为是虚假的？”希望这能帮助任何认为这是关于 R 编程语言或用 r 表示的相关系数的人回复 JSR_FDED 2小时前 | 上一条 [–] 我一直难以理解对数图表。这是一个完美诠释何时不应使用它们的例子。回复加入我们，参加 6 月 16-17 日在旧金山举办的 AI 初创企业学校！指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

（评论） 2024-03-26

（评论） 2024-04-08

问 HN：ML 中发生了哪些事情，我们在法学硕士的喧嚣中听不到？ 2024-03-29

（评论） 2024-06-09

原文

Why there are no emergent properties in Large Language Models.

We heard lot about emergent properties of Large Language Models (LLMs) last year. I will share with you my thoughts, and some other scientists, of why there are no emergent properties and especially why the assumed critical value that these so-called emergent properties are based upon is not substantial.

The excitement about emergent properties started with a paper by [1], where the authors show that scaling LLMs beyond a specific size (they claim is critical) then the system provided unexpected behavior. Unexpected in that it was not considered that it can be done like ‘doing’ arithmetics for instance. In support of their claim, the graphs that the authors provided, displayed a sharp jump in the performance of the LLM in terms of accuracy. The problem in their demonstration is the following: They are using logarithmic charts where the x axis represents the weights (i.e., hyperparameters of the neural network of the LLM in use) and is divided as 10^1, 10^2, 10^3…10^10, 10^11 in equally separated units. The sharp jump occurs between 10^10 and 10^11 on the chart. But, this single unit shift between 10^10 and 10^11 is in fact multiplying 10 billion by 10 which means an increase (i.e., shift) of 90 billion! This kind of representation should have been done in linear scale to avoid any misunderstanding of the rate of change of the behavior of the system. If we draw the same graph in [1] in linear scale, the rate of change will appear almost constant [2]. Thus, the system will appear evolving normally, as expected, and 10^10 will not appear as critical and alarming boundary. Besides, expanding the system by 90 billion weights means supporting it with much more data than when increasing from 1,000 parameters to 10,000 (i.e., an increase of 9K) or from 10,000 to 100,000 (i.e., an increase of 90K) which will not require as much data to its training repository compared to when adding 90 billion parameters. For example, the LLM was able to pretend doing addition (considered as emergent behavior) of two numbers by giving the result of the addition because it has seen similar addition operations, and/or sentences containing addition operations and their results, when its training data increased hugely. A counter example would be giving it very large complicated numbers (e.g., 126541478975317 + 97631257998631) then it will not give correct result, because there is less chance that these numbers exist in its training data repository even if it is huge; this is since these numbers become very unique and their encounter is extremely rare or even impossible despite a huge large corpus.

One could easily assume that, in the near future, the problem of adding two huge numbers will be fulfilled in LLMs, and this for instance by lexically catching the occurrence of two numbers, transferring them to a ‘cognitive’ agency software module that performs basic logical operations, which is connected to the LLM, and relaying the result of the operations back to the LLM. However, this will be called “implemented behavior” and not “emergent behavior.”

Last but not least, I add the following remarks from [3, 4] in support of my argument. In a series of more than 1,000 experiments, the authors in [3] found no evidence of emergent reasoning abilities of LLMs and the authors in [4] claim that the metrics used for evaluation of LLMs are the source of the emergence assumption problem.

Finally, in a small parenthesis about programming—emergent behavior—based on AI, I am not saying that if we let an AI system keep on analyzing and generating new programs for example, that it will not one day write a magnificent piece of code, because in slight and ironic comparison to the Infinite Monkey Theorem, one day it will.

References

1-Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … & Fedus, W. (2022). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.

2-Carter, D. (2023). There are no “emergent abilities” in LLMs. Better Programming https://betterprogramming.pub/there-are-no-emergent-abilities-in-llms-2bb42e17ce7e (Retrieved 23 January 2024)

3-Lu, S., Bigoulaeva, I., Sachdeva, R., Madabushi, H. T., & Gurevych, I. (2023). Are Emergent Abilities in Large Language Models just In-Context Learning?. arXiv preprint arXiv:2309.01809.

4-Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are emergent abilities of Large Language Models a mirage?. arXiv preprint arXiv:2304.15004.

Mario Antoine Aoun is an ACM Professional member who has been a Reviewer for ACM Computing Reviews since 2006. He has more than 25 years of computer programming experience and holds a Ph.D. in Cognitive Informatics from the Université du Québec à Montréal. His main research interest is memory modelling based on chaos theory and spiking neurons.

This post was published in Communications (Vol. 67, 8) at https://dl.acm.org/doi/10.1145/3674118.

为什么大型语言模型的临界值和涌现行为是假的？ Why R the Critical Value and Emergent Behavior of Large Language Models Fake?

References

为什么大型语言模型的临界值和涌现行为是假的？
Why R the Critical Value and Emergent Behavior of Large Language Models Fake?