Lf-lean：验证软件工程的前沿

Lf-lean：验证软件工程的前沿
Lf-lean: The frontier of verified software engineering

## 使用任务级规范生成器扩展验证软件工程这项研究展示了在扩展验证软件工程——通过形式化证明确保代码正确性——方面取得了显著进展，通过引入“任务级规范生成器”。传统上，验证代码需要大量人工投入，随着人工智能快速生成更多代码，这成为一个瓶颈。这种方法将重点从审查单个代码片段转移到为整类任务*一次性*验证正确性规范，并自动将其应用于所有代码库。研究团队使用“rocq-dove”系统进行了原型设计，该系统用于从Rocq到Lean的验证翻译。将其应用于翻译逻辑基础教材的所有1276个语句，与手动验证相比，速度提高了**350倍**。人工智能自主验证了97%的语句，人工投入集中在仅有的6个“极端”难度案例上。这一成就表明，验证软件工程的进展速度比预期更快，并且随着人工智能代码生成的增加，可能比未验证的方法*更具*可扩展性。关键在于分摊人工监督成本——一次性定义正确性，而不是为每个代码库重复定义。这项工作突出了在人工智能辅助编程时代扩展可靠性的潜力，并为基于强化学习的软件验证开辟了进一步研究的途径。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Lf-lean：验证软件工程的前沿 (theorem.dev) 7点由 alpaylan 2小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

Figure 1: Closing the Gap Between Verified and Unverified Software Engineering. Adapted from METR’s Time Horizon plot, including software verification benchmarks where AIs write code and then prove it correct. We plot only the time horizon for software implementation (not verification) for an an apples-to-apples comparison of how much functionality is implemented via each method of software development. lf-lean gives us an encouraging measurement of where verified software engineering capability is.

We present lf-lean, a verified translation of all 1,276 statements of the Logical Foundations textbook from Rocq to Lean, produced by frontier AI with ~2 person-days of human effort versus an estimated ~2.75 person-years manually (a 350x speed-up). We achieve this through task-level specification generators: because many software transformations are semantics-preserving, correctness can be defined once for an entire task class and checked automatically across all instances and codebases. This scales human oversight from 𝒪(𝓃) to 𝒪(1) regardless of program complexity. Placed on METR's time horizon graph, our result suggests verified software engineering is advancing faster than expected.

Introduction

As AIs automate increasingly complex software tasks, a fundamental tension emerges: how do we know the code they produce is correct? The standard approach of reviewing AI-generated code and its tests doesn’t scale. Human review effort grows proportionally with code volume, while AI code generation capacity grows exponentially.^{If this trend continues, we’re headed toward a world where code deployment is bottlenecked on human review capacity, or just deployed without review.}

This post introduces a different approach: task-level specification generators. Instead of reviewing each piece of generated code individually, we review a correctness specification once for an entire class of tasks, and the correctness specification covers all instances of the task across any codebase. The key insight is that for many important software transformations (translation, optimization, refactoring), we can define what “correct” means generically, without reference to any specific codebase. Moreover, as codebase complexity increases, a lot of feature implementation is a semantics-preserving transformation of existing features, so the correctness specification for the new feature is a simple corollary of the correctness specification for the existing features. Thus, building task-level specification generators helps scale oversight in two ways: (1) you can easily work on many more codebases simultaneously, and (2) what works on simple codebases extends to arbitrarily complex ones.

We prototype task-level specification generation with rocq-dove, building a scalable environment for verified translation^{from Rocq^{to Lean.
Given any Rocq source code, rocq-dove automatically generates a correctness specification (a theorem that any valid translation must satisfy) and grades model-generated translations and proofs against it.
Because the specification is derived from the source code itself, rocq-dove scales to arbitrary codebases without additional human effort.}}

We use rocq-dove to generate an environment for a verified translation of all 1,276 statements^{in Logical Foundations^{1.
Using our scaffold, frontier AI models^{autonomously produced verified translations of 97% of the statements.^{We solved 6 extreme-difficulty statements manually (~2 person-days), after which models completed the remaining 2.5%.
For context, we estimate the full task would have taken 3 person-months for translation plus 2.5 person-years for verification if done entirely by humans.}}}}

Two key world-view updates:

Verified software engineering can scale without marginal human oversight. rocq-dove automatically generated correctness specifications for the 1,276 translations in lf-lean. No human judgment was needed to define “correct” for each case. Combined with rich feedback from the proof assistant, this creates a dense reward signal that lets AI iterate toward correct solutions autonomously.
AI capabilities on software verification are advancing faster than expected. We developed lf-lean, the first sufficiently complex evaluation to demonstrate this. Compared to METR’s time horizon trend 2 on unverified software engineering, verified software engineering appears to be catching up.

The implication is that software verification, long considered too expensive for practical software development, may become the more scalable approach in an era of AI-assisted programming. When humans can’t review all the code AI produces, having machines verify correctness automatically becomes not just nice to have, but necessary.

In what follows, we explain task-level specification generators, the conceptual framework that lets correctness checking scale without marginal human effort. We then describe rocq-dove, our prototype environment for verified translation from Rocq to Lean, and how we applied it to produce lf-lean. We present details of how we achieved a 350× speedup over manual verification, and make the case for scaling verified software engineering.

Task-Level Specification Generators

Figure 2: Constant oversight for arbitrarily scaling codebases. Human oversight is \(\mathcal{O}(1)\)—define the task-level specification generator once—while verified task completions scale to \(\mathcal{O}(n)\).

The scaling problem with AI coding is that humans can only review so much. As models generate more code more quickly, human oversight becomes the bottleneck.

One response is to make review faster—better tooling, smarter sampling, and more efficient workflows. This can mitigate the immediate issue, but doesn’t change the fundamental scaling problem: a human must understand each codebase and write codebase-specific verification. Oversight effort remains linear: \(\mathcal{O}(n)\) where \(n\) is the number of tasks. This is akin to writing a formal specification for every codebase.

The other response is to make review amortizable. For certain task classes \(\mathcal{T}\) (e.g., translation, optimization, parallelization, refactoring), correctness can be defined generically, without reference to individual task details. A human reviews the specification generator for \(\mathcal{T}\) once; the system then automatically verifies any task instance. Human effort becomes \(\mathcal{O}(1)\) with respect to codebase count. Task-level specification involves defining a specification generator, which is a function that takes a particular instance of a task, or a description thereof, and outputs a specification for that specific instance.

When building a new toolchain, the ideal problem is useful, simple, and representative. We started with translation, which hits all three; there is real demand for porting codebases between languages 3, it is the simplest semantics-preserving transformation, and it is paradigmatic for the generalized challenge of software verification—establishing semantic equivalence. Semantic equivalence often looks the same regardless of the exact specification or implementation, and verified translation captures this core difficulty, making it an effective task for training on software verification.

`rocq-dove`: A Task-Level Specification Generator for Translating from Rocq to Lean

We developed a prototype task-level specification generator for the task of verified software translation from Rocq to Lean. Unlike an unverified software translation which yields only a translated version of the code, a verified translation produces a (translation, proof) pair. The proof then serves as a guarantee that the translation is semantically equivalent to the source. Even if the translation is correct, the pair is rejected if the proof does not verify its correctness. In this section, we describe the environment setup used for every codebase and the details of how we build typewise isomorphism.

Environment

We describe the general environment for verified translation from Rocq to Lean.

Figure 3: Pipeline structure for verified translation. The task-level specification generator is applied to the Rocq source and provided to the AI agent, which produces a Lean translation and Rocq proof. The grader validates the output above the trust boundary.

Specification Generation: rocq-dove derives a theorem statement \(\texttt{Thm}_{\texttt{Rocq}}\) directly from the Rocq source \(\texttt{Src}_{\texttt{Rocq}}\). The theorem states that the translation is typewise isomorphic to the source; we describe nuances of the definition in the Typewise Isomorphism section.
Translation: The AI agent translates \(\texttt{Src}_{\texttt{Rocq}}\) into \(\texttt{Target}_{\texttt{Lean}}\) using \(\texttt{Thm}\) as a guide.
Round-Trip Compilation: \(\texttt{Target}_{\texttt{Lean}}\) is compiled back into Rocq using rocq-lean-import^{4, yielding \(\texttt{RoundTrip}_{\texttt{Rocq}}\).
This step is necessary because the correctness specification \(\texttt{Thm}\) is stated as an equivalence between two Rocq terms.}
Proof Generation: The AI agent constructs a proof \(\texttt{Proof}_{\texttt{Rocq}}\) of \(\texttt{Thm}\) demonstrating the equivalence \(\texttt{Src}_{\texttt{Rocq}} \cong \texttt{RoundTrip}_{\texttt{Rocq}}\).
Grading: The rocq-dove grader checks that the proofs have three properties: (1) proofs are valid, (2) adhere to the specified theorem statements, and (3) avoid relying on additional unsupported assumptions.^{If checking fails, the agent iterates on the translation or the proof until it passes.}

Typewise Isomorphism

Since Rocq and Lean differ slightly in behavior, we require a notion of “equivalence” that’s neither too strict nor too loose. We cannot use strict equality, as writing the same source code for a datatype definition in different files results in datatypes that are not technically “equal”. As such, we define typewise isomorphism as follows:

Dealing with Types and Values

For datatypes, the standard relaxation of equality^{is isomorphism:
\(A\) and \(B\) are isomorphic (\(A \cong B\)) when we have functions \(\{\text{to} : A \to B\}\) and \(\{\text{from} : B \to A\}\) such that \(\{\text{to} \circ \text{from} = \text{id}\}\) and \(\{\text{from} \circ \text{to} = \text{id}\}\).^{We can then define equivalence (“valid translation”) for values via transport:}}

\[a \cong b := \text{iso.to}(a) = b\]

where \(a : A\), \(b : B\), and \(\text{iso} : A \cong B\).

The Problem with Propositions

Unfortunately, this isn’t enough. Since isomorphism is defined as reversible mapping, all true propositions are isomorphic. For example, \(1 = 1\) and \(2 = 2\) are both true; in a proof-irrelevant setting like Lean, simply mapping any proof of one to any proof of the other yields an isomorphism. We would not want to say that “True” is a valid translation of “my C++ compiler is correct” just because both statements are provable.

Dealing with Type-Level Functions

We need to instead enforce that the types line up structurally, which we can do by relationally lifting isomorphism along type-level functions. For example, consider translation of a type-level function \(F : \texttt{Type} \to \texttt{Type} \to \texttt{Type}\). Instead of saying that a valid translation of the type \(F(A, B)\) is any type \(X\) with \(F(A, B) \cong X\), we say that a valid translation of \(F(A, B)\) is any \(F'(A', B')\) satisfying \(A \cong A'\), \(B \cong B'\), and for all \(C\), \(C'\), \(D\), and \(D'\),

\[(C \cong C') \to (D \cong D') \to (F(C, D) \cong F'(C', D'))\]

Dealing with Polymorphic Functions

Sometimes, we must apply the appropriate kind of equivalence based on what the object in question is. For example, consider the function if_then_else_ : ∀ {ρ}, bool → ρ → ρ → ρ that returns its first or second argument of type ρ depending on whether its boolean argument is true or false. Then the relation if_then_else_ ≅ if_then_else_' would involve = if ρ is int, but would involve ≅ if ρ is Type instead.

Prop vs. SProp and Universes

Most projects in Rocq prove theorems whose types are Props, meaning that we cannot prove that all proofs of these theorems are equal. However, because Lean only contains proof-irrelevant propositions, the rocq-lean-importer imports Prop in Lean to SProp, which is like Prop, but with computational proof irrelevance. Luckily, axiomatizing Prop = SProp is consistent with software verification projects, and sufficient to deal with such issues. There are also subtleties around universe levels and cumulativity which we elide in this post.

Making `lf-lean`

Loading Translation Explorer…

Click to interact with explorer

While there has been tremendous progress in AI theorem-proving focused on mathematics 678, our focus is on reasoning about the correctness of software, eg: programs involving data structures, interpreters, and operational semantics. Software verification differs from formal mathematics in important ways: abstractions are more porous, requiring reasoning thoroughly about codebases with significantly more dependencies. The proof methods required are also different. Brute-force case analysis is more commonly used, and there is a much tighter tie between theorem statement, proof generation methodology, and proof. Yet precisely because verification has historically been so expensive, very few real-world projects have been verified—and correspondingly few benchmarks exist for evaluating AI capabilities in this domain.

The translation task of rocq-dove addresses this gap. Because it can generate correctness specifications from any Rocq source, it turns existing codebases into evaluation and training environments for AI software verification without additional human effort. We applied it to the 1,276 Rocq theorem statements from Logical Foundations^{1, an interactive textbook covering functional programming, type theory, and formal program reasoning. While textbook problems are not real-world software engineering, the textbook’s pedagogical progression—from simple definitions to deeply interconnected proofs—makes it a natural starting point for measuring how AI handles increasing complexity.}

The statements in Logical Foundations are not standalone problems. The textbook comprises 17 modules forming a deep dependency graph:

Over 90% of statements depend on at least one earlier statement.
Dependency chains reach up to 61 total dependencies.

To our knowledge, this makes lf-lean the first AI software verification result that includes software with dependencies.

Figure 4: # Dependencies per Statement. Distribution of dependency counts across the 1,276 statements in Logical Foundations. Nearly half (49%) depend on 1–5 prior statements, while a long tail extends to 21+ dependencies (3%), illustrating the progression from self-contained definitions to deeply interconnected proofs.

This dependency structure creates a practical problem. A natural but brittle approach would be monolithic translation: asking the model to translate and verify a target statement plus its entire dependency tree in one pass. Empirically, this fails:

We estimate monolithic translation would yield less than 20% success, as the required context quickly exceeds what any model can reliably hold and manipulate.
Rollouts where a model solved more dependencies were exponentially rarer, with roughly a 3× reduction in successful problem count for every additional dependency solved.
We observed 0% performance when a model attempted more than 9 dependencies at once.

The fix mirrors how large software systems stay manageable: modular decomposition.^{Instead of forcing the model to re-derive the whole dependency tree, we treat previously verified dependencies as trusted interfaces. We provide the model with dependency files that have already been translated and verified. Then the model only needs to translate and verify the marginal statement (the new component) against its interface. The proof assistant enforces that interfaces between components match exactly. If each component verifies, the whole dependency graph is correct by construction.}

This is just standard software engineering practice applied to verification: simple components, low coupling between them, and cohesive proof obligations that can be discharged independently.

Beyond evaluation, we hope the translation itself proves directly useful. With the rising demand for Lean, we hope that our translation will be valuable to PL students. Moreover, we note that since we worked on statement translation, not proof translation, the Lean statements serve as an additional benchmark for evaluating AIs.

Verified Translation at Scale

Without any human intervention, AI agents autonomously generated verified translations of 97% (1,237 / 1,276) of the theorem statements. The remaining 39 statements were blocked by just 6 “extreme” difficulty statements (0.5% of the total). These 6 required manual solving, taking approximately 15 hours of human effort. After being provided with those six solutions, models autonomously completed the remaining 33 statements, achieving 100% coverage of the benchmark.

To put this in perspective: we estimate that manually translating Logical Foundations would have taken 3 person-months, and formally proving the correctness of those translations would have taken an additional 2.5 person-years. Instead, in this workflow, the human effort required was ~15 hours—an over 350× speedup on verification effort alone. We annotated each statement by difficulty based on inference compute required; LoC is the strongest predictor, with a sharp cliff at 17+ LoC where hard and extreme problems dominate (see Appendix E).

To situate these results in the broader trajectory of AI software engineering, we adapt METR’s Time Horizon framework, which tracks task complexity with how long it would take a human to complete the task.

Figure 1: Closing the Gap Between Verified and Unverified Software Engineering. Adapted from METR’s Time Horizon plot, including software verification benchmarks where AIs write code and then prove it correct. We plot only the time horizon for implementation (not verification). lf-lean gives us our first measurement of where verified software engineering capability actually is, and the early signal is surprisingly encouraging.

In Figure 1, circles represent AI model releases evaluated on unverified software engineering tasks (METR’s time horizon methodology), while diamonds represent software verification benchmarks where AIs must both implement and formally prove correctness. We include AlphaVerus 13, FVAPPS 14, VerinaBench 15, and lf-lean, which are benchmarks where AIs succeed at not just program implementation, but also software verification.

Note that the plot indicates the time horizon for just the program implementation component, rather than both implementation and translation, as we want a direct comparison of the human labor that can be automated. To approximate their respective time horizons, we conducted human baseline tests on the hardest solved problems of each of the formal verification projects. ^{We note that the comparison of verified engineering is still a slight overestimate w.r.t. feasibility given that humans are slower at programming in languages amenable to verification.}

We used OLS on the three preexisting verification benchmarks to fit the y-intercept of METR’s time horizon exponential curve to get the lower curve. We used OLS to fit an exponential to the four verification benchmarks (including lf-lean) to get the improved curve.

Conclusion

Conventional wisdom is that software verification is infeasible. But infeasible has always meant too expensive, not impossible. The cost is measured in LoC: verification artifacts—specifications, proofs, and auxiliary infrastructure—typically dwarf the implementation itself by almost an order of magnitude. Across landmark projects from seL4 9 to CompCert 10, the median productivity is roughly 2,558 verified LoC per person-year (Appendix B), and our own translation required approximately 215k lines of isomorphism proofs to verify the translation of 6k lines of Rocq statements into 25k lines of Lean (Appendix A). Verification has never failed because it is conceptually hard; it has failed because it multiplies the volume of code that humans must write and maintain.

We are now entering a regime with a new calculus of feasibility. LoC generated are becoming effectively free, while human review is the bottleneck. In this regime, verified software engineering may actually be more scalable than unverified software engineering, because the oversight itself is automated. Models are not yet good enough at verification to fully realize this vision. But lf-lean demonstrates that the gap is closing.

The oversight benefit is realizable by scaling RL on software verification. Most RL environments for code generation rely on programmatic rewards or model-grading. Programmatically written unit tests provide sparse feedback, because test coverage isn’t meaningfully correlated with the number of bugs found (Appendix C).^{And model-graded rewards are not robust against goodharting or spurious correlations resulting in catastrophic out-of-distribution generalization.}

Given the appetite for scaling software complexity, the only scaling feedback on correctness leverages model capability. In rocq-dove, we leverage model-generated proofs, guided by the rich feedback of proof assistants. Moreover, by building task-level specification generators, we significantly amortize the investment in the task specification, reusing the same environment for scalably many codebases. We are optimistic that this type of RL environment—dense signal, unbounded task complexity, scalable reuse—can drive rapid capability gains in verified software engineering.

We're hiring! If you're excited about building the future of verified software engineering, we'd love to hear from you. See open roles →

Appendix

A. `lf-lean` by the Numbers

Logical Foundations contains approximately 6k lines of Rocq statements. On average, each statement has 6.4 dependencies with a total of 24.8 LoC (61 dependencies with a total of 382 LoC at maximum). We translated this to approximately 25k lines of Lean code^{(≈ 3 person-months of work^{) and about 215k lines of Rocq isomorphism proofs (≈ 2.5 person-years of work).
Our models produced 1.5M lines of Lean and 7M lines of Rocq proofs in the process of doing the translation.}}

Our environment, rocq-dove, ensures that all isomorphism proof tasks are essentially independent and parallelizable, having interdependencies only through the constraints imposed on the Lean translation. As a result, while the time horizon on verified software engineering ranges from 5 hours (METR) to 3 person-months (lf-lean), the time horizon on the verification proofs themselves is best measured in years.

B. Human Productivity in Software Verification

To estimate how long the lf-lean project would have taken a human team, we need reliable baselines for two distinct activities: writing translation code and writing formal proofs of correctness.

Translation productivity. For the code-writing component (translating Rocq definitions and theorem statements into Lean), we draw on unverified software engineering productivity estimates. Published industry figures typically place productive output at 50–100 kLoC/person-year for a developer working in a familiar language and codebase (a range that likely already reflects AI-assisted workflows). We calibrated against internal time trials on lf-lean statements and cross-referenced with these published figures, adopting the upper end of the range (100 kLoC/year) to produce a conservative lower bound on human time. Since our Lean translation comprises approximately 25 kLoC, this yields an estimate of roughly 3 person-months for translation alone. We note this likely underestimates the true effort: Lean is less widely known than mainstream languages, the translation requires understanding both Rocq and Lean type theory, and the translator must make non-trivial design decisions about how to represent Rocq idioms in Lean.

Verification productivity. For the proof-writing component, no single industry benchmark exists, so we surveyed representative large-scale mechanized verification projects to establish a historical baseline. Projects were included if they:

reported non-trivial mechanized verification artifacts (e.g., in Rocq, Isabelle/HOL, F*, or Dafny);
provided sufficient quantitative data in peer-reviewed papers, technical reports, or authoritative project documentation; and
corresponded to a clearly identifiable verification effort rather than tooling-only or purely theoretical work.

Multiple entries for the same codebase (e.g., original vs. extended CompCert) were treated as distinct data points when reported separately in the literature.

Across these projects, we measured verified LoC per person-year—that is, how many LoC of implementation a team could formally verify per person-year of combined effort on code, specifications, and proofs. The resulting median is approximately 2,558 LoC/person-year, reflecting the PhD-level expertise and intensive manual effort traditionally required. Canonical examples include seL4 9, the first formally verified OS microkernel, and CompCert 10, a verified optimizing C compiler (see Figure 5). We note that the surveyed projects vary in domain and proof assistant, introducing additional uncertainty into this median estimate.

Figure 5: Historical Productivity in Software Verification. Verified LoC per person-year across landmark software verification projects from 2008 to 2022. The median of approximately 2,558 LoC/person-year reflects the PhD-level expertise and extensive manual effort traditionally required.

Applying these baselines to lf-lean. The lf-lean release includes approximately 215 kLoC of Rocq isomorphism proofs covering the 1,276 translated statements. At the historical median verification rate, this volume of proof would require roughly 2.5 person-years of effort. Combined with the ~3 person-months for translation, we estimate the full verified translation would have taken approximately 2.75 person-years. In contrast, our actual human effort was approximately 15 hours (2 person-days), spent on 6 extreme-difficulty statements that the models could not solve autonomously. This represents a speedup of over 350× on the verification effort and roughly 30× on the translation effort alone.

C. Unit Testing Is Insufficient for Rigorous Analysis

Unit tests are generally insufficient for frontier software engineering tasks, which typically result in non-linear bug count growth in codebase size and complexity. Bugs often result from subtle / high-interaction causes, which can be challenging for unit tests to cover. Expanding test suite size is insufficient, as studies have found a low to moderate correlation between bugs found and code coverage with unit tests 16. Additionally, after controlling for codebase size and complexity, there is no correlation between code coverage and bugs found 17.

Figure 6: Test Coverage vs. Bug Density. Analysis of ~100 large, mature open-source Java systems shows unit-test coverage has no meaningful relationship with the number of bugs reported after release, even when controlling for LoC and complexity. High-coverage systems often still exhibit substantial defect counts, while lower-coverage systems are not consistently worse. (Kochhar et al. 2017)

More anecdotally, Google’s OSS-Fuzz 18 tool continuously tests popular and well-maintained open source libraries and has found thousands of bugs and vulnerabilities.

D. Compositional Verification

In a monolithic approach, verifying a program \(P\) requires fitting the entire codebase into a single context and reasoning about it all at once. As program size grows, this becomes intractable: verification effort scales with total complexity and the required context exceeds any model’s window. We call this monolithic verification.

For programs with modular structure, correctness can be established piecewise. Decompose \(P\) into components \(c_1, c_2, \ldots, c_k\) with explicit interfaces. Verify each \(c_i\) independently; the proof assistant checks that interfaces between dependent components match exactly. If all components verify, the whole is correct by construction. This bounds verification context to \(\mathcal{O}(\max_i |c_i|)\) rather than \(\mathcal{O}(|P|)\). We call this compositional verification. We move from verifying whole programs to verifying components against interfaces.

Classical heuristics for well-designed software map directly onto proof tractability. Let \(\text{LoC}_{\text{proof}}(c)\) denote lines of proof required for component \(c\):

Simplicity → low \(\text{LoC}_{\text{proof}}(c_i)\) for each component
Low coupling → proofs for \(c_i\) and \(c_j\) proceed independently
High cohesion → unrelated functionality has disjoint proof obligations

In lf-lean, 90% of statements depend on at least one prior statement. Rather than requiring the model to verify the full dependency tree simultaneously, we provide previously verified dependencies as trusted interfaces. The task reduces to checking that the new component matches the existing interface, compositionally building toward the full solution.

E. Measuring Difficulty and Difficulty Annotations

While dependency count captures one axis of difficulty, Lines of Code (LoC) provides a more universal and fine-grained measure. LoC has many nice properties: it is objective, easy to measure, and strongly correlated with token count on model-generated code. Higher LoC problems consume the context window and introduce more opportunities for mistakes in the solution.

In our modular setup, we refine this to marginal LoC: the new code the model must understand and translate to solve a given statement, given that its dependencies are already verified.

Figure 7: Marginal LoC vs. # Dependencies. Each bubble represents a unique (dependency count, LoC) pair; size and color intensity encode the number of statements at that point. The weak positive trend reflects that higher-dependency statements tend to be longer, but the wide vertical spread shows that LoC and dependency count capture largely independent axes of difficulty.

We did not run clean evaluations specifically for difficulty annotations. Instead, our easy, medium, hard, and extreme ratings are based on how much inference compute was necessary to solve each statement. Concretely, we considered (1) how many best-of-k runs were needed to obtain a successful verified translation, and (2) how many different evaluation setups we had to try before hitting these numbers. Extreme problems were solved by a human.

Figure 8: Difficulty by marginal LoC. Percentage breakdown of difficulty levels within each marginal LoC bin. As marginal LoC increases, the proportion of easy problems decreases monotonically: 93% → 88% → 85% → 81% → 79% → 50% → 13% → 0%. The difficulty cliff is sharp at 17+ marginal LoC, where hard and extreme problems become dominant.

Acknowledgements

We are grateful to Jake Mendel for encouraging us to work on hard problems.

This work builds on impressive projects and ideas. Gaëtan Gilbert made rocq-lean-import, and Benjamin Pierce wrote the Software Foundations textbooks to educate generations of PL students. Kaiyu Yang and Quinn Dougherty generously provided hard problems from VerinaBench and FVAPPS, respectively, for our time horizon baseline. Thomas Kwa helped us understand the METR result methodology.

Author Contributions

Jason led the project, and developed the specification generator and grader in rocq-dove. Vishesh and Jeffrey co-developed the AI agent harness; Vishesh contributed to the interactive visualizer, and Jeffrey contributed to the blog post. Harshikaa built infrastructure for analyzing agent performance and contributed to the AI agent harness. Vaidehi designed diagrams and collected data for the appendix baselines. Robert developed the human-generated solutions for extreme-difficulty statements, and developed the time horizon study. Twm, Shiki, and Jacob extracted the initial dataset of Rocq problems from the Software Foundations textbook. Lawrence advised on blog post writing and presentation of the results. Rajashree co-led the project, and developed the task-level specification generator framework and wrote the blog post.

References

Pierce, B. C. et al.. Software Foundations, Volume 1: Logical Foundations. softwarefoundations.cis.upenn.edu, 2024.
METR. Measuring AI Ability to Complete Long Tasks. METR Blog, 2025.
Fleishman, G.. It's COBOL all the way down. Increment, 2018.
Gilbert, G.. rocq-lean-import: Import Lean exported files into Rocq. github.com/rocq-community/rocq-lean-import, 2025.
Ullrich, S. et al.. lean4export: Plain-text declaration export for Lean 4. github.com/leanprover/lean4export, 2025.
AlphaProof and AlphaGeometry Teams, Google DeepMind. AI achieves silver-medal standard solving International Mathematical Olympiad problems. Google DeepMind Blog, 2024.
Achim, T. et al.. Aristotle: IMO-level Automated Theorem Proving. arXiv:2510.01346, 2025.
AxiomMath. AxiomProver at Putnam 2025. github.com/AxiomMath/putnam2025, 2025.
Klein, G. et al.. seL4: Formal Verification of an OS Kernel. SOSP, 2009.
Leroy, X.. Formal verification of a realistic compiler. Communications of the ACM, 2009.
Gu, R. et al.. CertiKOS: An Extensible Architecture for Building Certified Concurrent OS Kernels. OSDI, 2016.
Zinzindohoué, J. K. et al.. HACL*: A Verified Modern Cryptographic Library. CCS, 2017.
Aggarwal, P., Parno, B., & Welleck, S.. AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement. ICLR, 2025.
Dougherty, Q. & Mehta, R.. Proving the Coding Interview: A Benchmark for Formally Verified Code Generation. LLM4Code Workshop, ICSE, 2025.
Sun, Y. et al.. Verina: Benchmarking Verifiable Code Generation. arXiv:2505.23135, 2025.
Inozemtseva, L. & Holmes, R.. Coverage is not strongly correlated with test suite effectiveness. ICSE, 2014.
Kochhar, P. S. et al.. Practitioners' Expectations on Automated Fault Localization. ISSTA, 2017.
Ding, Z. Y. & Le Goues, C.. An Empirical Study of OSS-Fuzz Bugs. MSR, 2021.

Please cite as:

Jason Gross, Vishesh Saraswat, Jeffrey Chang, Harshikaa Agrawal, Vaidehi Agarwalla, Robert Zhang, Twm Stone, Jacob Green, Shiki Vaahan, Lawrence Chan, and Rajashree Agrawal. "lf-lean: The frontier of verified software engineering." Theorem Blog, February 2026. https://theorem.dev/blog/lf-lean/

BibTeX:

@article{lf-lean,
  author = {Jason Gross and Vishesh Saraswat and Jeffrey Chang and Harshikaa Agrawal and Vaidehi Agarwalla and Robert Zhang and Twm Stone and Jacob Green and Shiki Vaahan and Lawrence Chan and Rajashree Agrawal},
  title = {\texttt{lf-lean}: The frontier of verified software engineering},
  journal = {Theorem Blog},
  year = {2026},
  month = {feb},
  url = {https://theorem.dev/blog/lf-lean/}
}