Llama 4 闻起来很臭

Llama 4 闻起来很臭
Llama 4 Smells Bad

原始链接: https://fastml.com/llama-4-smells-bad/

Meta发布的Llama 4，一个半开放的LLM，引发了争议和批评。尽管在LM Arena排行榜上表现良好，但事实表明Meta可能专门针对该基准对模型进行了优化。实际应用中的表现令人失望，用户报告称其幻觉现象严重，编码能力与较小的模型相比更差。 Llama 4采用专家混合（MoE）架构，提供了名为Scout、Maverick和Behemoth的不同模型。其中Maverick模型是一个拥有4000亿参数的模型。Llama 4拥有1000万上下文长度，但早期基准测试表明它难以有效利用这一优势。有传闻称，内部压力要求达到基准目标，这可能导致了有问题的训练方法，据称还导致了一位人工智能研究人员辞职。Meta的AI副总裁Joelle Pineau也已辞职。总的来说，Llama 4的发布受到了误导性信息、基准测试过拟合以及性能不佳的指控。

Hacker News上的一篇讨论围绕着一篇被标记的文章展开，该文章批评了Meta的Llama 4，标题为“Llama 4 臭味难闻”。评论者对文章的可信度表示担忧，尤其关注其依赖于GPT-4o mini自报（且可能不准确）的参数数量。用户指出作者显然不理解LLM的基础知识，并且对基准测试结果的解读存在缺陷，包括忽略了推理后修复基准测试。有人批评将人类偏好基准与其他基准进行比较，以及未能解释具体的技术方面。一些人指出，Llama 4自发布以来，多个bug已经被修复，并且原始测试也已经过时。总的来说，大家一致认为这篇文章显示出作者对主题的理解不足，许多用户强调这篇文章是一个未完成的草稿，包含事实错误和有问题的推理。一些人对Llama未来的迭代版本持乐观态度。

Llama 4 牧群 2025-04-05

Meta被发现操纵AI基准测试结果。 2025-04-08

（评论） 2025-04-05

（评论） 2025-04-08

原文

Meta has distinguished itself positively by releasing three generations of Llama, a semi-open LLM with weights available if you ask nicely (and provide your full legal name, date of birth, and full organization name with all corporate identifiers). So no, it’s not open source. Anyway, on Saturday (!) May the 5th, Cinco de Mayo, Meta released Llama 4.

This is a draft. Come back later for the final version.

Source: https://x.com/burkov/status/1909088837554291049

LM Arena

As became standard practice, Meta tested the model anonymously on LM Arena before release. The model ended up second on the leaderboard, which is great, and this is where controversy starts.

LM Arena is the most popular online benchmark, and they release some conversations along with their associated users preferences. These two facts mean that the companies are willing and able to overfit the benchmark. If you look at the leaderboard, about half of the models there are marked as “experimental”, “preview”, or something like that. This might well mean that what you get in normal use is not what you get on LM Arena. People usually don’t pay much attention to this when a model delivers. Llama 4 is exceptional in that it does not deliver.

By the way, that is not to say that the “experimental” Llama 4 is good. It’s just yappy, explains QKV in transformers through family reunions, and also hallucinates.

Mistakes happen

On release Meta published a chart of various model performances vs prices, with Llama on top. They forgot, however, to put Gemini 2.5 on the chart. They probably ran out of the vertical space and Gemini 2.5 just didn’t fit, being about 20 points above Llama. Honest mistakes like this happen, honestly, even to honest people, you know.

Source: https://x.com/AIatMeta/status/1908618302676697317

Early nail in the coffin

On April 8, LM Arena tweeted this:

Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.

In other words, Meta cheated so badly that it had to be officially slapped. The “real” Maverick, 400B parameters in size, ended up just behind Athene and Hunyan, the models nobody has ever heard of, and just in front of GPT-4o-mini, which is, according to itself, a model with 1.3B or 1.5B or 1.7B parameters, depending on when you ask.

Source: https://x.com/kalomaze/status/1911846553435967646

Mixture of Experts

Llama 4 comes in three versions: Scout, Maverick, and not yet released Behemoth. Why these names instead of simply Llama 4 109B / 400B / 2T? Apparently Meta is not too proud about these parameter counts, or perhaps the performance relative to the size, and just straight up lies on Huggingface:

We are launching two efficient models in the Llama 4 series, Llama 4 Scout, a 17 billion parameter model with 16 experts, and Llama 4 Maverick, a 17 billion parameter model with 128 experts.

Imagine the tremendous surprise that unsuspecting visitors experience when they go to the Llama-4-Maverick-17B-128E-Instruct repo and see 55 model part files, most of them either 21.5 GB or 10.7 GB. Bizarre!

In reality, Maverick has 400B parameters, 17B activated at any given time.

This comes from the fact that Llama 4, unlike the previous versions, is a Mixture of Experts (MoE) model. MoE models use only a portion of parameters during inference. As far as we understand, MoE is both theoretically and practically better than dense models. The best open source model, Deepseek V3/R1 is MoE too. Deepseek, however, does not claim the 685B model to be a 37B model.

Memory vs MoE and long context

In practice, MoE models run faster at inference, but all the parameters still have to be in the GPU memory. You know what else has to be in memory? Context.

Llama 4 has an impressive context length of 10M, theoretically surpassing Google Gemini, which had been leading in this department. However, there are two problems. One is that the context has to fit in memory, together with the model, and API providers won’t enable such a long context for practical reasons.

Another is that just because a model has a long context doesn’t mean it will be able to use it effectively. Another early benchmark indeed suggests that Llama 4 is remarkably bad at processing long contexts.

Source: https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87

More benchmarks

On the Artificial Analysis benchmark Scout achieved the same score as GPT 4o mini. A 109B model vs a 1.5B model (allegedly). This is ABYSMAL. It’s embarrassing.

Source: https://x.com/ArtificialAnlys/status/1908890796415414430

Artificial Analysis later adjusted the prompt format and Llama 4 models scored better.

On the Aider coding benchmark, Maverick scores about the same as Qwen 2.5 Coder 32B, a relatively old dense model less than a tenth Maverick size.

Source: https://x.com/paulgauthier/status/1908976568879476843

People

Someone on Reddit posted this (bolding ours):

Despite repeated training efforts, the internal model’s performance still falls short of open-source SOTA benchmarks, lagging significantly behind. Company leadership suggested blending test sets from various benchmarks during the post-training process, aiming to meet the targets across various metrics and produce a “presentable” result. Failure to achieve this goal by the end-of-April deadline would lead to dire consequences. Following yesterday’s release of Llama 4, many users on X and Reddit have already reported extremely poor real-world test results.

As someone currently in academia, I find this approach utterly unacceptable. Consequently, I have submitted my resignation and explicitly requested that my name be excluded from the technical report of Llama 4. Notably, the VP of AI at Meta also resigned for similar reasons.

We don’t know if it’s true. Maybe it’s bait. However, the vice president of AI at Meta, Joelle Pineau, indeed resigned. She announced her departure on April 1, stating that her last day with the company would be May 30, 2025. Pineau had been with Meta for nearly eight years, leading the Fundamental AI Research (FAIR) group since 2023.

Source: https://x.com/DavidSHolz/status/1908603882118537709

There will probably be Llama 5. Let’s hope it will be better than Llama 4.