(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43620452

Hacker News 的一个帖子讨论了 Meta 被指控在其 Llama 模型中“操纵 AI 基准测试”。评论者表达了对该模型性能的失望,认为它在其他开源替代方案中并不突出,并质疑其发布是否为时过早。一些人对 LMArena 作为可靠评估工具的价值提出了质疑,认为其主观性仅对专注于用户参与的公司有用。讨论还延伸到 OpenAI 也被指控通过使用承诺不使用的训练数据来操纵基准测试。一位用户认为“针对对话的优化”可能会优先考虑讨好的提示,这引发了人们对基准比较背后动机的担忧。讨论最后指出,“开放权重”的黑盒模型可能会以不可预测的方式被操纵。


原文
Hacker News new | past | comments | ask | show | jobs | submit login
Meta got caught gaming AI benchmarks (theverge.com)
49 points by pseudolus 4 hours ago | hide | past | favorite | 10 comments










The Llama 4 launch looks like a real debacle for Meta. The model doesn't look great. All the coverage I've seen has been negative.

This is about what I expected, but it makes you wonder what they're going to do next. At this point it looks like they are falling behind the other open models, and made an ambitious bet on MoEs, without this paying off.

Did Zuck push for the release? I'm sure they knew it wasn't ready yet.



Is LMArena junk now?

I thought there was an aspect where you run two models on the same user-supplied query. Surely this can't be gamed?

> “optimized for conversationality”

I don't understand what that means - how it gives it an LMArena advantage.



LMArena was always junk. I work in this space and while the media takes it seriously most scientists don't.

Random people ask random stuff and then it measures how good they feel. This is only a worthwhile evaluation if you're Google or Meta or OpenAI and you need to make a chartbot that keeps people coming back. It doesn't measure anything else useful.



Meta got caught _first_.


Not even first, OpenAI got caught a while back


Do you have a source for this? That's interesting (if true).


They got the dataset from Epoch AI for one of the benchmarks and pinky swore that they wouldn't train on it

https://techcrunch.com/2025/01/19/ai-benchmarking-organizati...



The top of that leaderboard is filled with closed weight experimental models.


I believe this was designed to flatter the prompter more / be more ingratiating. Which is a worry if true (what it says about the people doing the comparing).


There's no end to the possible vectors of human manipulation with this "open-weight" black box.






Join us for AI Startup School this June 16-17 in San Francisco!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com