Meta 的新的基于 LLM 的测试生成器

Meta 的新的基于 LLM 的测试生成器
Meta's new LLM-based test generator

原始链接: https://read.engineerscodex.com/p/metas-new-llm-based-test-generator

Metas“使用 Meta 的大型语言模型进行自动化单元测试改进”提出了一种改进单元测试的创新方法，以快速有效的方式增强软件可靠性。通过采用根据特定指南操作的法学硕士，该软件不仅可以生成可靠且有效的测试用例，还可以识别非回归结果和可验证的改进，从而节省以前花费在手动验证上的时间和资源。处理过滤器，包括构建能力、执行测试、脆弱性预防和覆盖率改进，确保只有最高价值的测试才能通过，避免重复或资源浪费，并防止可能损害软件性能和可靠性的错误测试用例。总体而言，Metas 的实施展示了利用 LLMS 实现测试过程自动化的巨大潜力，为以前未探索过的关键软件方面提供了前所未有的见解。此外，所提供的实际应用程序为寻求优化其产品和服务的效率、安全性和功能的开发人员和工程经理提供了可操作的见解。此外，它还有助于解决两个关键的软件挑战：不稳定和缺少测试，确保测试套件全面而强大，提高行业对软件的整体信心。

就实际应用而言，AI工具可以通过以下几种方式提高软件工程的效率和效果： 1. 代码生成：使用AI工具自动生成常见重复逻辑的代码，减少实现给定功能所需的手动工作量。这可以显着减少开发时间并提高整个应用程序的一致性。 2. 测试：人工智能生成的测试可以在流程的早期发现边缘情况，从而提高整体代码质量。这些测试还可以在开发阶段的早期发现潜在的缺陷，从而提高产品的可靠性并降低发布后的维护成本。 3. 性能分析：人工智能算法可以准确地查明复杂系统中的瓶颈，从而实现优化的代码路径，从而减少计算时间并简化资源使用。然而，人工智能工具无法完全取代传统的软件工程实践，例如仔细的设计决策、架构选择、对约束的透彻理解以及最佳实践的遵守。为了获得最佳结果，软件工程师必须将人工智能驱动的洞察力与批判性推理能力结合起来。此外，在软件工程中使用人工智能时必须考虑一些注意事项： 1.数据隐私问题：随着全球数据监管法的兴起，复杂的人工智能算法的部署可能会引起消费者和监管机构之间敏感的隐私问题。软件工程师和组织必须确保适当的数据收集、存储和传输协议。 2. 知识产权侵权：虽然开源代码存储库可以在软件工程过程中提供成本效益，但代码中的版权侵犯，特别是涉及人工智能衍生内容创建，可能会暴露责任风险，从而对公司估值指标产生负面影响。因此，软件工程师必须制定适当的风险管理策略来减轻这些危险的暴露情况。总体而言，实施强大的人工智能技术不一定是简单的，但它通过为软件生命周期的关键方面（即功能、性能和安全性）提供快速而准确的反馈循环，为软件工程流程的彻底变革和转变提供了令人兴奋的机会。

原文

Engineer’s Codex is a publication about real-world software engineering.

Meta recently released a paper called “Automated Unit Test Improvement using Large Language Models at Meta”. It’s a good look at how Big Tech is using AI internally to make development faster and software less buggy. For example, Google is using AI to speed up code reviews.

A major win of this paper is that while it integrates LLMs into a developer’s workflow, it also recommends fully-formed software improvements that are verified to be both correct and an improvement to current code coverage. Compare this to GitHub Copilot, where suggestions still have to be manually verified to work by the human - and we all know that debugging code is twice as hard as writing it.

Meta claims that this “this is the first paper to report on LLM-generated code that has been developed independent of human intervention (other than final review sign off), and landed into large scale industrial production systems with guaranteed assurances for improvement over the existing code base.”

Furthermore, there are solid principles that developers can take away in order to use AI effectively themselves.

Table of Contents (total read time: 7 minutes):

Key Points (1 minute read)
Stats (1 minute read)
Actionable Takeaways ← if you’re short on time, just read this! (3 minute read)
How TestGen-LLM Works (2 minute read)

SWE Quiz is for ambitious developers who want to make sure their software fundamentals are rock solid. Reveal gaps in your knowledge and make sure you really know what you’re doing both at work and while interviewing.

Take the tests for authentication, caching, databases, API Design, and more.

Check out SWE Quiz

TestGen-LLM uses an approach called ‘Assured LLM-based Software Engineering’ (Assured LLMSE), using private, internal LLMs that are probably fine-tuned with Meta’s codebase. This means that it uses LLMs to generate code improvements that are backed by verifiable guarantees of improvement and non-regression.

TestGen-LLM uses an ensemble approach to generate code improvements. This means that it uses multiple LLMs, prompts, and hyper-parameters to generate a set of candidate improvements, and then selects the best one. This approach can help to improve the quality of the generated improvements.

TestGen-LLM is specifically designed to improve existing human-written tests rather than generate code from scratch.

TestGen-LLM has been integrated into Meta's software engineering workflows. This means that it can be used to automatically improve tests as part of the development process. It would be cool to see some screenshots of how exactly it’s integrated, but the paper doesn’t provide any.

These stats are either direct quotes or paraphrased quotes from the paper.

The image above shows that: in an evaluation on Reels and Stories products for Instagram, 75% of TestGen-LLM test cases that were generated built correctly, 57% passed reliably, and 25% increased coverage.
TestGen-LLM was able to improve 10% of all classes to which it was applied and 73% of its test improvements were accepted by developers, and landed into production.
In a “test-a-thon” between engineers, where various Meta engineers created tests in order to increase Instagram’s test coverage, “the median number of lines of code added by a TestGen-LLM test was 2.5.”
However, one test case “hit the jackpot” and covered 1,326 lines. This is a really important stat, which I iterate upon below.
All improved cases generated during the “test-a-thon” did “cover at least one additional valid corner case, such as an early return and/or special processing for special values such as null and empty list.”

TestGen-LLM is a good example of how LLMs can be used to improve dev productivity and software reliability in a time-efficient manner. There are a few takeaways I got from reading this paper that give us a look both to how Big Tech is implementing LLMs internally and to how any developer or engineering manager reading this can use LLMs in a more productive manner. (Note: many of these are my own personal opinions that I’ve taken away from the paper.)

Small context windows and scattered dependencies make LLMs nearly unusable for non-boilerplate solutions in large codebases. Aside from any privacy concerns, it’s not feasible to paste in multiple files of code into an LLM when there could be 20+ dependencies from across a codebase in a C++ header file (as an example). Even if you do paste in multiple files, there is a time and cognitive cost to actually using and trying the code outputted by an LLM in a chat window or even in the code editor by GitHub Copilot.

The price of extra cognitive load cannot be understated. Hacker News commenters find the inaccuracies of GPT-based tooling exhausting and unreliable. This is where the verification of outputs being both valid and non-regressive is extremely important.

This means that for a long-term productivity boost in large codebases, improvements will probably come in incremental, specialized use cases, like test generation and automatic suggestions during code reviews. These are also low risk ways to save cumulative developer time. Basically, “GPT wrappers” will continue to be useful 🙂.

The real value of LLMs here are displayed through the edge cases. The paradox of writing good code is that nobody ever gets credit for fixing problems that never happened.

Will Wilson writes:

“The fundamental problem of software testing… is that software has to handle many situations that the developer has never thought of or will never anticipate. This limits the value of testing, because if you had the foresight to write a test for a particular case, then you probably had the foresight to make the code handle that case too. This makes conventional testing great for catching regressions, but really terrible at catching all the “unknown unknowns” that life, the universe, and your endlessly creative users will throw at you.”

Most of the test cases created by Meta’s TestGen-LLM only covered an extra 2.5 lines. However, one test case covered 1326 lines! The value of that one test case is exponentially more valuable than most of the previous test cases and exponentially improves the value of TestGen-LLM. LLMs can vigorously “think outside the box” and the value of catching unexpected edge cases is very high here.

In fact, it’s so high that the creator of FoundationDB’s startup, Antithesis, is entirely based on the fact that software testing edge cases are best found by AI. For reference, FoundationDB was acquired by Apple and is the basis for Apple iCloud’s billions of databases.

Base model LLMs aren’t “plug-n-play" and shouldn’t reasonably ever be expected to. Sure, they might output pristine React and Tailwind CSS code, but that’s a narrow use case in most production codebases. They need a fair amount of processing and filtering for code generation tasks that require correctness. Part of this processing means grounding LLMs with examples. Google and Meta both make suggestions based on existing code, where the results are much, much better than raw generation. LLMs used in production should take ideas from how Meta processes and filters LLM outputs, and most outputs should be expected to be discarded.

LLMs do work best integrated into workflows. This is a reason why GitHub Copilot is so popular and another reason why Google’s Workspace integrations are a great idea. Asking a chatbot works great for certain use cases, like debugging and boilerplate code generation, but chatbots can often fail at more complex use cases.

TestGen-LLM applies a series of semantic filters to candidate solutions generated by Meta’s internal LLMs, making sure that only the most valuable tests are preserved. Here’s how it works:

Filter 1: Buildability: Initially, TestGen-LLM checks if the generated code can be built within the app's existing infrastructure. Any code that fails to build is immediately discarded.

Filter 2: Execution (does the test pass?): Next, the system runs the tests that passed the buildability filter. Any test that doesn't pass is discarded. This step is crucial because, without a way to automatically determine the validity of a failing test (whether it's due to a bug or an incorrect assertion), TestGen-LLM opts to keep only those tests that can be used for regression testing (aka making sure they can protect current code against future regressions).

Filter 3: Flakiness: To address the issue of flakiness (tests that pass or fail inconsistently under the same conditions), TestGen-LLM employs repeated execution. A test must pass consistently across multiple (five) executions to be considered non-flaky.

Filter 3: Coverage Improvement: Finally, to ensure that new tests actually add value, TestGen-LLM evaluates them for their contribution to test coverage. Tests that do not enhance coverage by exploring new code paths or conditions are discarded. Only tests that provide new insights or protect against regressions are kept.

These processing filters are pretty important as they guarantee improvements to a test suite. It also shows that LLMs are very far from being “plug-and-play.”

The tests that successfully pass through all these filters are guaranteed to enhance the existing test suite, offering reliable regression testing capabilities without duplicating effort or wasting resources. Pre- and post-processing steps in TestGen-LLM facilitate the extraction and reconstruction of test classes, streamlining the integration of new tests into the software development workflow.

This paper is a good formalization of an use case that many devs probably already use LLMs like ChatGPT, Gemini, and Mistral/LLaMA for. Keeping it in writing is a good way of tracking the progress of future improvements on LLMs in the software reliability space. Unit tests are probably the lowest, most basic level of code generation where LLMs have the most immediate value, but as time goes on, we’ll definitely see LLMs be able to catch and test for bugs in increasingly complex software systems.

The question is - will that make software easier to develop in the long run or will it lead to a proliferation of software complexity in the future?

Meta 的新的基于 LLM 的测试生成器 Meta's new LLM-based test generator

Meta 的新的基于 LLM 的测试生成器
Meta's new LLM-based test generator