选择而非预测
Selection rather than prediction

原始链接: https://voratiq.com/blog/selection-rather-than-prediction/

## 编码代理选择:超越排行榜 “哪个编码代理最好?”这个问题具有误导性。性能因语言、任务甚至时间而异,使得单一的“最佳”选择不可靠。与其*预测*最佳代理,更有效的方法是*从*候选池中*选择*——一种“最佳N选一”策略。 这包括并行运行多个代理,每个代理处理相同的任务,并由人工审核员选择最佳实现。这个过程不仅能提供更高质量的代码,还能基于实际合并生成有价值的评估数据。 对18个代理在211个任务上的分析揭示了性能等级,顶级代理与其余代理之间存在明显差距。然而,即使在顶级等级中,排名也存在噪声且置信区间重叠。运行一个群体可以显著提高胜率:顶级代理单独成功率为24%,而三个代理的组合提高到51%,七个代理的组合提高到91%。 关键在于,运行一小群表现最佳的代理——优先考虑前几个——可以大大提高成功几率,超过了额外token的成本,并减少了昂贵的人工工程时间。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 选择而非预测 (voratiq.com) 8 分,由 languid-photic 发表于 2 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论 tomtom1337 40 分钟前 [–] 有什么建议来“组织”这类实验吗?以及如何以易于解析的方式比较结果?7个模型产生 1 个 PR 是一种方法,但感觉比较起来并不容易。回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Coding agents are getting quite good, and the question everyone asks is: which one should I use?

However, agent performance varies considerably by language, task type, and time. When you commit to a single agent, you're predicting it will be best for whatever task you throw at it.

That bet might be informed by evals, experience, or word of mouth. But the variance is high enough that you'll often be wrong.

Selection sidesteps the prediction problem. Generate many candidate implementations, and choose from the pool of solutions. This converts the prediction problem into an optimization problem.

So, we think the question to ask instead is: how many agents should I use, and which ones?

This is often called "best-of-N": run N parallel attempts (here, across different models), then select the best output.

Agents Compete, Humans Arbitrate

We've been running this workflow for a few months now. Here's what it looks like:

Flowchart showing Agents Compete, Humans Arbitrate

We write a spec for the task and fan it out to multiple agents in parallel. Each agent works in its own isolated worktree and runs the repo's evals, then a human reviewer looks at the diffs, picks the best implementation, and applies that patch. The agent whose diff gets applied is the winner.

This is best-of-N with a human judge.

This turns everyday work into a useful eval signal: which agent, given a real task in a real codebase, produced the code we actually merged?

The data here comes from real day-to-day work (rather than a benchmark) and spans 211 tasks across 18 agents. We track ongoing results on our leaderboard. The plots here are a snapshot.

Most tasks are full-stack TypeScript product work, usually atomically scoped features, bug fixes, or refactors that take minutes to about an hour.

Rankings Are Noisy

Once each run has a winner, we can treat it as a multi-way match and fit ratings from the outcomes.

We fit a Bradley-Terry model to the winner/loser pairs and map strengths to an Elo-style rating.

Horizontal interval plot comparing 18 agents by rating (x-axis), with 90% confidence intervals. 1000 1100 1200 1300 1400 1500 1600 1700 1800 gemini-3-pro-preview gemini-2-5-flash gpt-5-1-codex-mini gemini-2-5-pro gpt-5-1-codex-max-high claude-haiku-4-5-20251001 claude-sonnet-4-5-20250929 gpt-5-1-codex gpt-5-1-codex-max-xhigh gpt-5-codex gpt-5-1-codex-max gpt-5-2 claude-opus-4-5-20251101 gpt-5-2-codex gpt-5-2-codex-xhigh gpt-5-2-xhigh gpt-5-2-codex-high gpt-5-2-high Rating

Each point is an agent's rating, where higher is better. Whiskers are 90% bootstrap confidence intervals, and color indicates model family.

Across this data, the agents separate naturally into tiers. And in particular, there's a gap between the top tier and the rest.

Within that top tier, the confidence intervals overlap heavily. The top two agents in this snapshot, gpt-5-2-high and gpt-5-2-codex-high, have about forty points of overlap. This means first and second are not separable with confidence.

The ranking exists, but it's noisy. If you had to pick one agent based on the leaderboard, expect variance from task to task even with a top-rated model.

Selection Advantage Is Large

So, how much do you gain by running a cohort instead of a single agent?

We measure the value of a cohort by its win rate on runs where at least one member participated. E.g., if you ran the top three agents (the top-3 cohort), how often would someone in that group produce the winning output?

Scatter plot comparing 18 entries by Win Rate (y-axis) and Cohort Size (x-axis). Top Win Rate entry: top-18 at 1. Hover over points for details. 2 4 6 8 10 12 14 16 18 0% 20% 40% 60% 80% 100% Cohort Size Win Rate

The x-axis is cohort size, ordered by rating. The y-axis is win rate on runs where at least one agent from that cohort participated. A win means the merged diff came from someone in that cohort.

Across our data, the top agent alone (top-1) wins 24% of the time. Add the next two agents (top-3) and someone in that group wins 51%. Expand to seven agents (top-7), and the win rate reaches 91%.

The first few agents matter most. After seven, the curve flattens, and agents 8 through 18 improve your odds by only a small amount.

You don't need to run every agent, but running only one leaves better code on the table most of the time.

Conclusion

Coding agents cluster into performance tiers, and within the top tier the margins are thin and noisy. A leaderboard is still useful, but mostly as a way to choose a cohort instead of a single default.

The best agent for a given task is hard to predict in advance, but if you run a few from the top tier, one of them will usually get it right.

In our data, going from one agent to three roughly doubles your win rate. Going from three to seven captures most of the remaining gains.

Again, this is across our data, which reflects performance on day-to-day product work in our codebase, according to our preferences. If your workflow and domain differ meaningfully from ours, your results may differ.

Tokens are cheap, and human engineering time is expensive. It may be worth running a few more agents if it means a better foundation, less cleanup, and fewer bugs for future work.

联系我们 contact @ memedata.com