步骤 3.7 刷写
Step 3.7 Flash

原始链接: https://static.stepfun.com/blog/step-3.7-flash/

Step 3.7 Flash 是一款智能体基础模型,它利用测试时缩放(test-time scaling)而非单纯依赖参数规模来实现高水平的视觉性能。通过调用专用工具,该模型弥补了其体积较小的劣势,能够媲美规模大其五倍的模型性能。 主要功能包括: * **视觉搜索:** 通过集成外部搜索能力增强识别效果,其性能可与规模大得多的模型相媲美。 * **Python 集成:** 提供统一的代码接口(缩放、裁剪、像素级处理),以处理复杂的、高分辨率的推理任务。 * **图形用户界面(GUI)操作:** 实现对智能手机应用程序稳健的长程控制,在 Android Daily 基准测试中表现优于规模更大的模型。 该模型的一项重大突破是其**涌现出的组合泛化能力**。Step 3.7 Flash 能够自主结合视觉和非视觉工具(例如先编写代码,然后使用图形界面来验证其输出),而无需明确的训练。这种跨领域迭代和自我修正的能力,标志着智能体推理的一大进步,使模型能够执行超越标准文本交互的复杂现实任务。

最近发布的 **Step-3.7 Flash** 模型在 Hacker News 上引发了广泛关注。用户反馈显示该模型基准测试表现强劲,在 M1 Mac 硬件上使用 Q4_K_S GGUF 版本时,能达到令人印象深刻的每秒 35 个 token (tps) 的速度。 早期使用者对该模型的能力大加赞赏,特别是其超越简单 OCR 的视觉识别与推理功能。许多人认为它足以媲美甚至优于 Qwen 3.6 等主流模型。 尽管技术评价很高,但一些用户指出了使用门槛问题。批评主要集中在官方网站“半成品”式的英文翻译,以及对非中文用户不太友好的操作体验。此外,关于“StepFun”这个品牌名称也引发了简短且轻松的讨论,不过大多数参与者并不在意。总体而言,社区认为 Step-3.7 是本地大模型用户的一个强大且高性能的选择,前提是用户能够克服当前的语言障碍。
相关文章

原文

We establish Step 3.7 Flash as an agentic foundation model with vision input support, shifting perception and recognition from parametric capacity to test-time scaling with visual tools. As the first of these, we strengthen its ability to invoke the Visual Search tool, thereby compensating for the parametric knowledge deficiencies caused by Step 3.7 Flash's limited model size. As shown in the table below, on visual recognition tasks, Step 3.7 Flash with Visual Search achieves performance on par with models five times its size.

Visual Recognition with Visual Search

Flash Level Pro Level
Benchmarks Step 3.7 Flash Kimi K2.6 GLM 5V Turbo GPT 5.5
SimpleVQA 79.16% 78.24%* 78.20% 79.11%*
WorldVQA 58.10% 55.98%* 47.81%* 54.58%*
BC-VL 58.96% 57.12%* 51.90%* 65.68%*

For a broader set of challenging vision tasks that demand fine-grained perception over high-resolution images or visual reasoning capabilities—such as V*, HR-Bench, and VisualProbe—we grant the model an enriched action space to interact with images, including cropping, zooming in and out, and drawing pixels or bounding boxes. These tools are implemented as a unified code interface, commonly referred to in the field as the Python tool. With Python, Step 3.7 Flash achieves exceptionally strong performance on these benchmarks.

Visual Perception with Python Tool

Flash Level Pro Level
Benchmarks Step 3.7 Flash Kimi K2.6 GLM 5V Turbo Gemini 3 Flash
V* 95.29% 96.90% 89.00% 96.30%
HR-Bench 4K 89.13% 91.25%* 84.62% 94.50%
HR-Bench 8K 86.34% 90.13%* 83.12% 94.80%
VisualProbe 65.05% 64.47%* 53.01% 69.90%

One particularly interesting finding is the emergent ability of compositional generalization across visual and other tools. During testing, Step 3.7 Flash seamlessly combined visual tools with non-visual ones to accomplish complex tasks, despite never having been explicitly guided toward such compositional tool use during training.

Visual Reasoning with Python Tool

Compositional Usage across Visual and Non-visual Tools

Operating graphical user interfaces (GUI) is another foundational visual capability for an agentic model — many real-world tasks live beyond the chatbox and the CLI, and require the agent to see, click, and verify. We extend Step 3.7 Flash with GUI operation, in particular for the Phone-use stack, so that it can complete long-horizon tasks across multiple apps. On the Android Daily benchmark, Step 3.7 Flash achieves a substantial improvement over last year's Step-GUI in stability, robustness, and long-horizon completion, and ahead of other models of larger scale.

Score of Android Daily Benchmark

The same compositional pattern we observed across visual tools also surfaces here: in the following case, after writing a piece of frontend code, the model autonomously turned to the GUI to test the page it had just produced — inspecting the rendered output, exercising interactive elements, and iterating on its own code based on what it saw. Again, this code-and-GUI compositional behavior was never explicitly demonstrated or rewarded during training, yet emerges robustly in test-time use.

GUI Operation

联系我们 contact @ memedata.com