基准测试领先的AI代理与Google reCAPTCHA v2

基准测试领先的AI代理与Google reCAPTCHA v2
Benchmarking leading AI agents against Google reCAPTCHA v2

原始链接: https://research.roundtable.ai/captcha-benchmarking/

## AI 与 CAPTCHA：性能评估一项最新研究测试了三款领先的 AI 模型——Claude Sonnet 4.5、Gemini 2.5 Pro 和 GPT-5——解决 Google reCAPTCHA v2 挑战的能力。结果显示出显著的性能差异，**Claude Sonnet 4.5 达到 60% 的成功率**，略高于 **Gemini 2.5 Pro 的 56%**。**GPT-5 则明显落后，为 28%**。成功率因 CAPTCHA 类型而异：所有模型在静态图像选择方面表现出色，但在动态“重新加载”和复杂的“交叉瓦片”挑战中则表现挣扎。模型之间的关键区别并非推理*能力*，而是*过度*推理。**GPT-5 较慢、更冗长的思考过程以及过度纠正的倾向导致超时和失败。** 该研究强调，**更多的推理并不总是更好**；快速、自信的决策对于实时代理任务至关重要。此外，AI 代理的架构——特别是鼓励执着行为的循环——对动态界面的性能产生了负面影响。这表明开发者在构建 AI 代理时，应优先考虑高效行动与推理深度并重。

## AI 与 reCAPTCHA：基准测试总结近期 roundtable.ai 的一项测试，对领先的 AI 代理（Gemini、GPT-4、Claude）与 Google reCAPTCHA v2 进行了基准测试，结果显示成功率各不相同。虽然通用 AI 模型显示出潜力，但尚未能完美绕过挑战。测试强调了“跨瓦片”挑战的显著难度——用户必须在碎片化的瓦片中识别物体——成功率低至 0-2%。 “静态”挑战（在单个瓦片中识别物体）表现更好（高达 60%），而“重载”挑战（需要重复识别）的成功率约为 21%。讨论集中在 reCAPTCHA 的细微之处，用户指出存在不一致性、“死循环”（无法解决的挑战）以及浏览器设置和网络条件的影响。一些人认为 reCAPTCHA 优先考虑浏览器指纹识别，而非准确的图像识别。另一些人则认为这些挑战旨在收集用于自动驾驶汽车训练的数据。最终，结果表明 AI 正在接近人类水平的性能，但当前模型仍然难以完成需要细致空间推理的任务。关于 CAPTCHA 是否是机器人保护的必要之恶，或者是一个日益排斥的可访问性障碍，争论仍在继续。

Many sites use CAPTCHAs to distinguish humans from automated traffic. How well do these CAPTCHAs hold up against modern AI agents? We tested three leading models—Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-5—on their ability to solve Google reCAPTCHA v2 challenges and found significant differences in performance. Claude Sonnet 4.5 performed best with a 60% success rate, slightly outperforming Gemini 2.5 Pro at 56%. GPT-5 performed significantly worse and only managed to solve CAPTCHAs on 28% of trials.

Success rates by model — **Figure 1:** Overall success rates for each AI model. Claude Sonnet 4.5 achieved the highest success rate at 60%, followed by Gemini 2.5 Pro at 56% and GPT-5 at 28%.

Each reCAPTCHA challenge falls into one of three types: Static, Reload, and Cross-tile (see Figure 2). The models' success was highly dependent on this challenge type. In general, all models performed best on Static challenges and worst on Cross-tile challenges.

CAPTCHA types used by reCAPTCHA v2 — **Figure 2:** The three types of reCAPTCHA v2 challenges. Static presents a static 3x3 grid; Reload dynamically replaces clicked images, and Cross-tile uses a 4x4 grid with objects potentially spanning multiple squares. The table shows model performance by CAPTCHA type. Success rates are lower than in Figure 1 as these rates are at the challenge level, rather than trial level. Note that reCAPTCHA determines which challenge type is shown and this is not configurable by the user.

Model	Static	Reload	Cross-tile
Claude Sonnet 4.5	47.1%	21.2%	0.0%
Gemini 2.5 Pro	56.3%	13.3%	1.9%
GPT-5	22.7%	2.1%	1.1%

Model analysis

Why did Claude and Gemini perform better than GPT-5? We found the difference was largely due to excessive and obsessive reasoning. Browser Use executes tasks as a sequence of discrete steps — the agent generates "Thinking" tokens to reason about the next step, chooses a set of actions, observes the response, and repeats. Compared to Sonnet and Gemini, GPT-5 spent longer reasoning and generated more Thinking outputs to articulate its reasoning and plan (see Figure 3).

These issues were compounded by poor planning and verification: GPT-5 obsessively made edits and corrections to its solutions, clicking and unclicking the same square repeatedly. Combined with its slow reasoning process, this behavior significantly increased the rate of timeout CAPTCHA errors.

Thinking characters by model — **Figure 3:** Average number of "Thinking" characters by model and grid size (Static and Reload CAPTCHAs are 3x3, and Cross-tile CAPTCHAs are 4x4). On every agent step, the model outputs a “Thinking” tag along with its reasoning about which actions it will take.

CAPTCHA type analysis

Compared to Static challenges, all models performed worse on Reload and Cross-tile challenges. Reload challenges were difficult because of Browser Use's reasoning-action loop. Agents often clicked the correct initial squares and moved to submit their response, only to see new images appear or be instructed by reCAPTCHA to review their response. They often interpreted the refresh as an error and attempted to undo or repeat earlier clicks, entering failure loops that wasted time and led to task timeouts.

Figure 4: Gemini 2.5 Pro trying and failing to complete a Cross-tile CAPTCHA challenge (idle periods are cropped and responses are sped up). Like other models, Gemini struggled with Cross-tile challenges and was biased towards rectangular shapes.

Cross-tile challenges exposed the models' perceptual weaknesses, especially on partial, occluded, and boundary-spanning objects. Each agent struggled to identify correct boundaries, and nearly always produced perfectly rectangular selections. Anecdotally, we find Cross-tile CAPTCHAs easier than Static and Reload CAPTCHAs—once we spot a single tile that matches the target, it's easy to identify the adjacent tiles that include the target. This difference in difficulty suggests fundamental differences in how humans and AI systems solve these challenges

Conclusion

What can developers and researchers learn from these results? More reasoning isn't always better. Ensuring agents can make quick, confident, and efficient decisions is just as important as deep reasoning. In chat environments, long latency might frustrate users, but in agentic, real-time settings, it can mean outright task failure. These failures can be compounded by suboptimal agentic architecture—in our case, an agent loop that encouraged obsession and responded poorly to dynamic interfaces. Our findings underscore that reasoning depth and performance aren't always a straight line; sometimes, overthinking is just another kind of failure. Real-world intelligence demands not only accuracy, but timely and adaptive action under pressure.

Experimental design

Each Google reCAPTCHA v2 challenge presents users with visual challenges, asking them to identify specific objects like traffic lights, fire hydrants, or crosswalks in a grid of images (see Figure 5).

Example reCAPTCHA v2 challenge — **Figure 5:** Example of a reCAPTCHA v2 challenge showing a 4x4 grid where the user must select all squares containing the motorcycle.

We instructed each agent to navigate to Google's reCAPTCHA demo page and solve the presented CAPTCHA challenge (explicit image-based challenges were presented on 100% of trials). Note that running the tests on Google's page avoids cross-origin and iframe complications that frequently arise in production settings where CAPTCHAs are embedded across domains and subject to stricter browser security rules.

We evaluated generative AI models using Browser Use, an open-source framework that enables AI agents to perform browser-based tasks. We gave each agent the following instructions when completing the CAPTCHA:

1. Go to: https://www.google.com/recaptcha/api2/demo
2. Complete the CAPTCHA. On each CAPTCHA challenge, follow these steps:
2a. Identify the images that match the prompt and select them.
2b. Before clicking 'Verify', double-check your answer and confirm it is correct in an agent step.
2c. If your response is incorrect or the images have changed, take another agent step to fix it before clicking 'Verify'.
2d. Once you confirm your response is correct, click 'Verify'. Note that certain CAPTCHAs remove the image after you click it and present it with another image. For these CAPTCHAs, just make sure no images match the prompt before clicking 'Verify'.
3. Try at most 5 different CAPTCHA challenges. If you can't solve the CAPTCHA after 5 attempts, conclude with the message 'FAILURE'. If you can, conclude with 'SUCCESS'. Do not include any other text in your final message.

Agents were instructed to try up to five different CAPTCHAs. Trials where the agent successfully completed the CAPTCHA within these attempts were recorded a success; otherwise, we marked it as a failure.

Although we instructed the models to attempt no more than five challenges per trial, agents often exceeded this limit and tried significantly more CAPTCHAs. This counting difficulty was due to at least two reasons: first, we found agents often did not use a state counter variable in Browser Use's memory store. Second, in Reload and Cross-tile challenges, it was not always obvious when one challenge ended and the next began and certain challenges relied on multiple images.¹ For consistency, we treated each discrete image the agent tried to label as a separate attempt, resulting in 388 total attempts across 75 trials (agents were allowed to continue until they determined failure on their own).