Meta has distinguished itself positively by releasing three generations of Llama, a semi-open LLM with weights available if you ask nicely (and provide your full legal name, date of birth, and full organization name with all corporate identifiers). So no, it’s not open source. Anyway, on Saturday (!) May the 5th, Cinco de Mayo, Meta released Llama 4.
This is a draft. Come back later for the final version.
Source: https://x.com/burkov/status/1909088837554291049
LM Arena
As became standard practice, Meta tested the model anonymously on LM Arena before release. The model ended up second on the leaderboard, which is great, and this is where controversy starts.
LM Arena is the most popular online benchmark, and they release some conversations along with their associated users preferences. These two facts mean that the companies are willing and able to overfit the benchmark. If you look at the leaderboard, about half of the models there are marked as “experimental”, “preview”, or something like that. This might well mean that what you get in normal use is not what you get on LM Arena. People usually don’t pay much attention to this when a model delivers. Llama 4 is exceptional in that it does not deliver.
By the way, that is not to say that the “experimental” Llama 4 is good. It’s just yappy, explains QKV in transformers through family reunions, and also hallucinates.
Mistakes happen
On release Meta published a chart of various model performances vs prices, with Llama on top. They forgot, however, to put Gemini 2.5 on the chart. They probably ran out of the vertical space and Gemini 2.5 just didn’t fit, being about 20 points above Llama. Honest mistakes like this happen, honestly, even to honest people, you know.
Source: https://x.com/AIatMeta/status/1908618302676697317
Early nail in the coffin
On April 8, LM Arena tweeted this:
Meta’s interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that “Llama-4-Maverick-03-26-Experimental” was a customized model to optimize for human preference. As a result of that we are updating our leaderboard policies to reinforce our commitment to fair, reproducible evaluations so this confusion doesn’t occur in the future.
In other words, Meta cheated so badly that it had to be officially slapped. The “real” Maverick, 400B parameters in size, ended up just behind Athene and Hunyan, the models nobody has ever heard of, and just in front of GPT-4o-mini, which is, according to itself, a model with 1.3B or 1.5B or 1.7B parameters, depending on when you ask.
Source: https://x.com/kalomaze/status/1911846553435967646
Mixture of Experts
Llama 4 comes in three versions: Scout, Maverick, and not yet released Behemoth. Why these names instead of simply Llama 4 109B / 400B / 2T? Apparently Meta is not too proud about these parameter counts, or perhaps the performance relative to the size, and just straight up lies on Huggingface:
We are launching two efficient models in the Llama 4 series, Llama 4 Scout, a 17 billion parameter model with 16 experts, and Llama 4 Maverick, a 17 billion parameter model with 128 experts.
Imagine the tremendous surprise that unsuspecting visitors experience when they go to the Llama-4-Maverick-17B-128E-Instruct repo and see 55 model part files, most of them either 21.5 GB or 10.7 GB. Bizarre!
In reality, Maverick has 400B parameters, 17B activated at any given time.
This comes from the fact that Llama 4, unlike the previous versions, is a Mixture of Experts (MoE) model. MoE models use only a portion of parameters during inference. As far as we understand, MoE is both theoretically and practically better than dense models. The best open source model, Deepseek V3/R1 is MoE too. Deepseek, however, does not claim the 685B model to be a 37B model.
Memory vs MoE and long context
In practice, MoE models run faster at inference, but all the parameters still have to be in the GPU memory. You know what else has to be in memory? Context.
Llama 4 has an impressive context length of 10M, theoretically surpassing Google Gemini, which had been leading in this department. However, there are two problems. One is that the context has to fit in memory, together with the model, and API providers won’t enable such a long context for practical reasons.
Another is that just because a model has a long context doesn’t mean it will be able to use it effectively. Another early benchmark indeed suggests that Llama 4 is remarkably bad at processing long contexts.
Source: https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87
More benchmarks
On the Artificial Analysis benchmark Scout achieved the same score as GPT 4o mini. A 109B model vs a 1.5B model (allegedly). This is ABYSMAL. It’s embarrassing.
Source: https://x.com/ArtificialAnlys/status/1908890796415414430
Artificial Analysis later adjusted the prompt format and Llama 4 models scored better.
On the Aider coding benchmark, Maverick scores about the same as Qwen 2.5 Coder 32B, a relatively old dense model less than a tenth Maverick size.
Source: https://x.com/paulgauthier/status/1908976568879476843
People
Someone on Reddit posted this (bolding ours):
Despite repeated training efforts, the internal model’s performance still falls short of open-source SOTA benchmarks, lagging significantly behind. Company leadership suggested blending test sets from various benchmarks during the post-training process, aiming to meet the targets across various metrics and produce a “presentable” result. Failure to achieve this goal by the end-of-April deadline would lead to dire consequences. Following yesterday’s release of Llama 4, many users on X and Reddit have already reported extremely poor real-world test results.
As someone currently in academia, I find this approach utterly unacceptable. Consequently, I have submitted my resignation and explicitly requested that my name be excluded from the technical report of Llama 4. Notably, the VP of AI at Meta also resigned for similar reasons.
We don’t know if it’s true. Maybe it’s bait. However, the vice president of AI at Meta, Joelle Pineau, indeed resigned. She announced her departure on April 1, stating that her last day with the company would be May 30, 2025. Pineau had been with Meta for nearly eight years, leading the Fundamental AI Research (FAIR) group since 2023.
Source: https://x.com/DavidSHolz/status/1908603882118537709
There will probably be Llama 5. Let’s hope it will be better than Llama 4.