面向未审查人工智能生成代码的自动化验证
Toward automated verification of unreviewed AI-generated code

原始链接: https://peterlavigne.com/writing/verifying-ai-generated-code

## 从代码审查到 AI 生成代码验证 本次实验探讨了在生产环境中使用 AI 生成代码,*无需*人工逐行审查的可能性。作者将重点从“审查”代码转移到“验证”其正确性,无论如何实现。 核心方法是让 AI 解决一个简化的 FizzBuzz 问题,然后对解决方案进行严格的自动化检查:基于属性的测试(确保代码在各种输入下都能按预期工作)、变异测试(识别测试套件中的弱点)以及副作用检查。还实施了标准的 Python 代码风格检查和类型检查。 结果表明,这些约束可以建立对生成代码的足够信任,可能使可读性变得不那么重要——将输出视为更像编译后的二进制文件。目前,设置这些约束比简单的代码审查更费力,但作者认为 AI 代理和工具的改进最终将使这种方法可行。 该实验强调了向自动化验证的转变,并承认了形式化验证和“软件工厂”相关的工作,旨在实现零人工代码审查。代码和测试可供实验使用。

黑客新闻 新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 关于未审查的AI生成代码的自动化验证 (peterlavigne.com) 7点 由 peterlavigne 1小时前 | 隐藏 | 过去 | 收藏 | 1评论 帮助 jghn 8分钟前 | 下一个 [–] 我认为GenAI将导致变异测试、属性测试和模糊测试的增加。但值得人们记住,这些技术尚未普及是有原因的。其中一个问题是,它们在计算上可能非常昂贵,尤其是变异测试。 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

I've been wondering what it would take for me to use unreviewed AI-generated code in a production setting.

To that end, I ran an experiment that has changed my mindset from "I must always review AI-generated code" to "I must always verify AI-generated code." By "review" I mean reading the code line by line. By "verify" I mean confirming the code is correct, whether through review, machine-enforceable constraints, or both.

I had a coding agent generate a solution to a simplified FizzBuzz problem. Then, I had it iteratively check its solution against several predefined constraints:

(1) The code must pass property-based tests (see Appendix B for a primer). This constrains the solution space to ensure the requirements are met. This includes tests verifying that no exceptions are raised and tests verifying that latency is sufficiently low.

(2) The code must pass mutation testing (see Appendix C for a primer). Mutation testing is typically used to expand your test suite. However, if we assume our tests are correct, we can instead use it to restrict the code. This constrains the solution space to ensure that only the requirements are met.

(3) The code must have no side effects.

(4) Since I'm using Python, I also enforce type-checking and linting, but a different programming language might not need those checks.

These checks seem sufficient for me to trust the generated code without looking at it. The remaining space of invalid-but-passing programs exists, but it's small and hard to land in by accident.

I was concerned that the generated code would be unmaintainable. However, I'm starting to think that maintainability and readability aren't relevant in this context. We should treat the output like compiled code.

The overhead of setting up these constraints currently outweighs the cost of just reading the code. But it establishes a baseline that can be chipped away at as agents and tooling improve.

The repo fizzbuzz-without-human-review implements these checks in Python, allowing you to try this for yourself.

↑ Back

Appendix B: Primer on property-based testing

Software tests commonly check specific inputs against specific outputs:

def test_returns_fizzbuzz_for_multiples_of_3_and_5(n: int) -> None:
    assert fizzbuzz(15) == "FizzBuzz"
    assert fizzbuzz(30) == "FizzBuzz"

Property-based tests run against a wider range of values. The property-based test below (using Hypothesis) runs fizzbuzz with 100 semi-random multiples of both 3 and 5, favoring "interesting" cases like zero or extremely large numbers.

@given(n=st.integers(min_value=1).map(lambda n: n * 3 * 5))
def test_returns_fizzbuzz_for_multiples_of_3_and_5(n: int) -> None:
    assert fizzbuzz(n) == "FizzBuzz"

Compared to testing specific input, this approach gives us more confidence that a given "property" of the system holds, at the cost of being slower, nondeterministic, and more complex.

For additional information, the Hypothesis docs are a good starting point.

↑ Back

Appendix C: Primer on mutation testing

Mutation testing tools like mutmut change your code in small ways, like swapping operators or tweaking constants, then re-run your test suite. If your tests fail, the "mutant" code is "killed" (good), and if your tests pass, the mutant "survives" (bad).

As an example, consider the following code:

def double(n: int):
    print(f"DEBUG n={n}")
    return n * 2

def test_doubles_input():
    assert double(3) == 6

Mutating print(f"DEBUG n={n}") to print(None) leaves test_doubles_input passing, so the mutant survives. You would fix it by removing the side effect or adding a test for it.

Appendix D: Acknowledgements

Thanks to Taha Vasowalla and other reviewers for their feedback on an early draft of this post.

联系我们 contact @ memedata.com