即使是(非常)嘈杂的大模型评估器,对于改进人工智能智能体也很有用。
Even (very) noisy LLM evaluators are useful for improving AI agents

原始链接: https://www.tensorzero.com/blog/even-very-noisy-llm-evaluators-are-useful-for-improving-ai-agents/

LLM 评估器往往存在噪声,且与实际应用效果的相关性较差,因此在判断单个输出时(例如在生产环境的防护机制中)并不可靠。然而,对于模型选择或提示词优化等离线任务而言,这种噪声并非致命缺陷。 核心见解在于:**噪声会相互抵消**。在比较两个智能体时,评估器在单个输出上的偏差会在足够大的样本量下相互抵消。只要评估器不存在偏袒较差智能体的系统性偏差,其平均分就能可靠地识别出性能更优的版本。 在网格世界(Gridworld)、Wordle 和数据提取等多种任务的实证测试证实了这一点:尽管输出层面的相关性较低,但智能体层面的相关性却显著增强。在所有测试环境中,即使是存在噪声的评估器,也能成功识别出更好的智能体,其两两对比的胜率始终高于随机猜测。 **结论:** 从业者应区分输出层面的可靠性(生产环境防护所需)与智能体层面的可靠性(开发阶段所需)。只要评估数据集足够大,能够让信号从噪声中显现出来,即使是“有噪声”的评估器,也是用于离线模型选择和渐进式优化的有效工具。

抱歉。
相关文章

原文

· Alan Mishler

Summary

  • LLM evaluators are often noisy and weakly correlated with real-world outcomes.

  • Noisy evaluators have limited value for production decisions that hinge on judging a single output (e.g. guardrails).

  • However, even (very) noisy evaluators can reliably tell you which agent is better on average, meaning they can still help you pick the best variant to deploy and improve it over time.

It’s surprisingly hard to develop reliable LLM evaluators: they’re often noisy and poorly correlated with the metrics or outcomes practitioners actually care about. Sometimes the target is directly measurable but evaluators still disagree with experts (e.g. on correctness or faithfulness to a source document). Other times the target is only accessible through a proxy (e.g. whether code that passes tests satisfies user needs). And sometimes the target is hard to observe at all (e.g. whether a customer was actually happy with an interaction).

Why is it so hard to develop reliable LLM evaluators?

Rule-based and classical NLP metrics are often brittle and miss the semantic dimensions that matter.1, 2 Learned reward models are vulnerable to distribution shift3 and reward hacking.4 Studies of LLM-as-a-judge setups have repeatedly documented systematic biases and limitations: judges are heavily swayed by surface-level style,5 prefer longer responses to shorter ones of similar quality,6 are inconsistent across repeated evaluators and minor prompt variations,7 often align poorly with human judgments,8 and may correlate weakly with the downstream outcomes they’re meant to predict.9

An evaluator’s quality can be measured at two granularities:

  • Output-level correlation measures how well its score on individual outputs matches real-world outcomes. It governs production workflows (e.g. guardrails), where decisions hinge on individual outputs and noisy evaluators are unreliable. We’ll call an evaluator noisy with respect to a metric or outcome of interest if its output-level correlation is low.
  • Agent-level correlation measures how well its average over many outputs matches an agent’s real-world quality. It governs offline variant selection (e.g. picking the best prompt or model), and, unlike output-level correlation, it generally climbs with sample size as per-output noise averages out.

Even very noisy evaluators can be reliable for offline selection: enough to ship better agents today and keep improving them over time.

Why noisy evaluators can still rank agents

The key insight is that even a very noisy evaluator can yield scores that are higher on average for agents that truly are higher quality: the noise washes out over many samples.

To formalize this, suppose we have two agents we want to compare, AA and BB. Let μA\mu_A

Now suppose we have an evaluator whose scores can be regarded as noisy versions of the true scores. Here are three hypothetical samples of true scores and evaluator scores for increasingly noisy evaluators:

Hypothetical samples of true scores (x-axis) and evaluator scores (y-axis) for three evaluators with increasing noise. The dashed line marks y = x (a perfect evaluator).

The leftmost evaluator is accurate enough to judge individual outputs in production. The rightmost isn’t: its verdict on any single output is too noisy to trust.

However, if we’re using an evaluator offline to choose between AA and BB, then we don’t need every individual value to be accurate. We just need the evaluator to tell us which agent is better overall. All three evaluators will do that, given sufficiently large evaluation samples.

Suppose Agent AA has true-score mean μA=0.6\mu_A = 0.6

Agent AAgent Bμ^A\hat{\mu}_A
Samples drawn from the true-score distributions of Figure 2 (Agent A with mean 0.6, Agent B with mean 0.3; 30 samples per agent), with evaluator scores on the y-axis. Horizontal dashed lines mark each agent's empirical mean evaluator score (μ^\hat{\mu}
Sampling details

For each agent, we model the distribution of true scores as a Beta distribution parameterized by its mean μ(0,1)\mu \in (0, 1)

SBeta(κμ,  κ(1μ)).S \sim \text{Beta}\left(\kappa\mu,\; \kappa(1 - \mu)\right).

Even though the samples are noisier as we move from left to right, they still tend to produce the correct ordering (μ^A>μ^B\widehat{\mu}_A > \widehat{\mu}_B

  • How separated the agents are. A bigger gap between μA\mu_A
  • How noisy the evaluator is. A less noisy evaluator narrows the spread (lowers the variances) of μ^A\widehat{\mu}_A
  • How many evaluator samples we have. Empirical means concentrate around their expected values as the sample size NN grows, so larger evaluation datasets give more reliable comparisons — no matter how well separated the agents are or how noisy the evaluator.

In general, even noisy evaluators can reliably distinguish stronger from weaker agents, given a sufficiently large evaluation dataset.

How big does an evaluation dataset need to be?

The sample size required to reliably distinguish two agents scales inversely with the square of the performance gap between them — halving the gap roughly quadruples the number of samples you need. This squared scaling comes from how the sampling distribution of a mean tightens with NN: the variance of a sample mean shrinks as 1/N1/N, so its standard error shrinks as 1/N1/\sqrt{N}

The argument above works as long as the evaluator is not biased in a way that causes it to systematically favor the worse variant.

Formal argument and failure modes

This section formalizes the claim that the empirical evaluator means can recover the true ordering of agents given enough samples, and discusses when this can fail.

Let xx be an agent output, and let S(x)S(x) be the true score for xx, meaning the thing we’d ideally like to measure. Each agent gives rise to a distribution over trajectories xx, which can differ in arbitrary ways. Maybe AA tends to produce long, detailed responses, while BB tends to be short and crisp. Or maybe AA is more prone to hallucination than BB. The average scores μA\mu_A

μA=EA[S(x)],\mu_A = \mathbb{E}_A[S(x)],

How this works in real benchmarks

To see this phenomenon in action with real evaluation data, we ran LLM-generated evaluators on five tasks: Gridworld, Wordle, Data Extraction (NER), Data Extraction (NDA), and Business Management. We evaluated 25 agent variants (different prompts and models) per task and 50 test traces per variant.

Each environment comes with a target metric that is computed programmatically and serves as the ground truth for any given trace: success or failure for Gridworld and Wordle, exact match against gold annotations for Data Extraction (NER), F1 score against gold annotations for Data Extraction (NDA), and number of subtasks completed for Business Management.

For each task we compute both correlations introduced above: the output-level correlation between an evaluator’s score on a single trace and that trace’s ground truth (holding the variant fixed), and the agent-level correlation between an evaluator’s mean over a variant’s traces and the variant’s ground-truth mean. The agent-level correlation exceeds the output-level correlation in every environment, often by a wide margin.

Pearson correlation between evaluator score and ground truth, at two granularities, across five environments. The output-level correlation is consistently weaker than the agent-level correlation — in every environment, the evaluator is more reliable for ranking agent variants than for judging individual outputs.

For example, Wordle’s output-level correlation is 0.41 — the evaluator is only modestly better than random at predicting which of two Wordle traces is better, holding the variant fixed. Its agent-level correlation is 0.96 — averaging across many traces per variant compresses the per-output noise into a much stronger signal of agent quality.

To further quantify the evaluator’s alignment with agent quality, we ask: when the evaluator compares two variants, how often does it pick the better one? That’s the evaluator’s pairwise win rate: the fraction of variant pairs where the evaluator’s mean ordering agrees with the ground-truth ordering. A win rate of 1.0 means perfect ranking; 0.5 is no better than random. We computed the win rate across all (252)=300\binom{25}{2} = 300

Pairwise win rate of the evaluator on each environment: out of all variant pairs, the fraction where the evaluator-mean ordering matches the ground-truth-mean ordering, both computed over all available traces per variant. Equivalent to the area under the curve (AUC) of an "is variant A better than B" classifier; random selection scores 0.5 (dashed line), perfect selection 1.0.

Every evaluator we tested clears 0.5 (random) by a comfortable margin. Gridworld is nearly perfect: the evaluator picks the better of two variants 97% of the time. Wordle and Data Extraction (NER) are also very high, at 0.87 and 0.82, respectively. The same Wordle evaluator that you wouldn’t trust to gate individual outputs in production can reliably tell you which Wordle agent to ship. NDA and Business Management both have a win rate of 0.64 — less reliable than the others, but still meaningfully better than coin-flipping. Used as a selection signal, all of these evaluators will move you in the right direction on average, despite their very low utility for judging individual traces.

Benchmark setup and methodology

Agent and evaluator generation. Each task’s 25 agent variants differ in their system prompt and the underlying model they use. Both the agents and evaluators were LLM-generated: the LLM was told that each evaluator should be as highly correlated as possible with the task’s ground-truth metric, and was given access to a training set when generating the evaluators. We used small models and didn’t aim to optimize either the agents or the evaluators — the point isn’t to show how well an LLM can perform at these tasks, but to illustrate how even noisy evaluators can be useful for agent selection.

Pearson vs Spearman. The numbers above are Pearson correlations, which capture linear association — a good fit for the output-level use case (e.g. guardrails), where the magnitudes of evaluator scores feed directly into downstream decisions. Spearman correlations capture rank association and are a more natural fit for the variant selection use case, where what matters is whether the evaluator’s ranking matches the ground-truth ranking. The two metrics agree qualitatively in every environment we measured (both increase from output-level to agent-level, both consistent with the central claim) and are mathematically equivalent in the special case that both variables are binary.

TaskOutput-level: Pearson / Spearman correlationAgent-level: Pearson / Spearman correlation
Gridworld0.81 / 0.810.97 / 0.98
Wordle0.41 / 0.380.96 / 0.88
Data Extraction (NER)0.08 / 0.270.75 / 0.79
Data Extraction (NDA)0.28 / 0.240.43 / 0.38
Business Management0.22 / 0.190.50 / 0.45

The largest divergence between the two metrics is on the Data Extraction (NER) task’s output-level cell, where ground truth is binary (exact match) and the Pearson value (0.08) is low because the evaluator’s discrete partial-credit scores compress more than they correlate linearly with the binary outcome.

Within-variant decomposition by de-meaning. The “output-level” bars in the chart above are not raw per-(variant, trace) correlations; those would mix two distinct effects: a between-variant effect (the evaluator agrees with ground truth on which variants are good on average) and a within-variant effect (the evaluator agrees with ground truth on which traces within a single variant are good). To isolate the within-variant effect, for each variant we subtract that variant’s mean from both the evaluator score and the ground-truth score, then compute the correlation across all (variant, trace) cells on the de-meaned values. What’s left is the part of the evaluator-vs-ground-truth signal that is not explained by knowing which variant produced the output — i.e., the signal an evaluator would need to reliably grade individual outputs in production.

The Takeaway

Noisy evaluators can’t reliably judge individual agent outputs, but they can reliably distinguish overall agent performance, because the per-output noise averages out across many samples.

Per-output unreliability is what limits noisy evaluators for typical production tasks (e.g. guardrails), all of which hinge on trusting the verdict on any specific output. Reliability at the aggregate level is what makes them useful offline: on every environment we tested, the evaluator picked the better of two variants substantially more often than not. Used as a selection signal, even noisy evaluators can help you ship better-performing agents today and improve them over time.

联系我们 contact @ memedata.com