Meta got caught gaming AI benchmarks

Mond_ · 2025-04-08T15:59:56 1744127996

The Llama 4 launch looks like a real debacle for Meta. The model doesn't look great. All the coverage I've seen has been negative.

This is about what I expected, but it makes you wonder what they're going to do next. At this point it looks like they are falling behind the other open models, and made an ambitious bet on MoEs, without this paying off.

Did Zuck push for the release? I'm sure they knew it wasn't ready yet.

JKCalhoun · 2025-04-08T15:50:46 1744127446

Is LMArena junk now?

I thought there was an aspect where you run two models on the same user-supplied query. Surely this can't be gamed?

> “optimized for conversationality”

I don't understand what that means - how it gives it an LMArena advantage.

light_hue_1 · 2025-04-08T16:01:40 1744128100

LMArena was always junk. I work in this space and while the media takes it seriously most scientists don't.

Random people ask random stuff and then it measures how good they feel. This is only a worthwhile evaluation if you're Google or Meta or OpenAI and you need to make a chartbot that keeps people coming back. It doesn't measure anything else useful.

etamponi · 2025-04-08T15:28:29 1744126109

Meta got caught _first_.

davidcbc · 2025-04-08T15:53:09 1744127589

Not even first, OpenAI got caught a while back

Mond_ · 2025-04-08T15:56:43 1744127803

Do you have a source for this? That's interesting (if true).

davidcbc · 2025-04-08T16:00:53 1744128053

They got the dataset from Epoch AI for one of the benchmarks and pinky swore that they wouldn't train on it

https://techcrunch.com/2025/01/19/ai-benchmarking-organizati...

deckar01 · 2025-04-08T14:37:36 1744123056

The top of that leaderboard is filled with closed weight experimental models.

bn-l · 2025-04-08T14:05:53 1744121153

I believe this was designed to flatter the prompter more / be more ingratiating. Which is a worry if true (what it says about the people doing the comparing).

add-sub-mul-div · 2025-04-08T15:58:28 1744127908

There's no end to the possible vectors of human manipulation with this "open-weight" black box.

（评论） (comments)

（评论）
(comments)