We establish Step 3.7 Flash as an agentic foundation model with vision input support, shifting perception and recognition from parametric capacity to test-time scaling with visual tools. As the first of these, we strengthen its ability to invoke the Visual Search tool, thereby compensating for the parametric knowledge deficiencies caused by Step 3.7 Flash's limited model size. As shown in the table below, on visual recognition tasks, Step 3.7 Flash with Visual Search achieves performance on par with models five times its size.
Visual Recognition with Visual Search
| Flash Level | Pro Level | |||
|---|---|---|---|---|
| Benchmarks | Step 3.7 Flash | Kimi K2.6 | GLM 5V Turbo | GPT 5.5 |
| SimpleVQA | 79.16% | 78.24%* | 78.20% | 79.11%* |
| WorldVQA | 58.10% | 55.98%* | 47.81%* | 54.58%* |
| BC-VL | 58.96% | 57.12%* | 51.90%* | 65.68%* |
For a broader set of challenging vision tasks that demand fine-grained perception over high-resolution images or visual reasoning capabilities—such as V*, HR-Bench, and VisualProbe—we grant the model an enriched action space to interact with images, including cropping, zooming in and out, and drawing pixels or bounding boxes. These tools are implemented as a unified code interface, commonly referred to in the field as the Python tool. With Python, Step 3.7 Flash achieves exceptionally strong performance on these benchmarks.
Visual Perception with Python Tool
| Flash Level | Pro Level | |||
|---|---|---|---|---|
| Benchmarks | Step 3.7 Flash | Kimi K2.6 | GLM 5V Turbo | Gemini 3 Flash |
| V* | 95.29% | 96.90% | 89.00% | 96.30% |
| HR-Bench 4K | 89.13% | 91.25%* | 84.62% | 94.50% |
| HR-Bench 8K | 86.34% | 90.13%* | 83.12% | 94.80% |
| VisualProbe | 65.05% | 64.47%* | 53.01% | 69.90% |
One particularly interesting finding is the emergent ability of compositional generalization across visual and other tools. During testing, Step 3.7 Flash seamlessly combined visual tools with non-visual ones to accomplish complex tasks, despite never having been explicitly guided toward such compositional tool use during training.
Visual Reasoning with Python Tool
Compositional Usage across Visual and Non-visual Tools
Operating graphical user interfaces (GUI) is another foundational visual capability for an agentic model — many real-world tasks live beyond the chatbox and the CLI, and require the agent to see, click, and verify. We extend Step 3.7 Flash with GUI operation, in particular for the Phone-use stack, so that it can complete long-horizon tasks across multiple apps. On the Android Daily benchmark, Step 3.7 Flash achieves a substantial improvement over last year's Step-GUI in stability, robustness, and long-horizon completion, and ahead of other models of larger scale.
Score of Android Daily Benchmark
The same compositional pattern we observed across visual tools also surfaces here: in the following case, after writing a piece of frontend code, the model autonomously turned to the GUI to test the page it had just produced — inspecting the rendered output, exercising interactive elements, and iterating on its own code based on what it saw. Again, this code-and-GUI compositional behavior was never explicitly demonstrated or rewarded during training, yet emerges robustly in test-time use.