进化仍然是一种有效的机器学习技术

进化仍然是一种有效的机器学习技术
Harper Evolves

原始链接: https://elijahpotter.dev/articles/harper_evolves

## Harper 现在可以进化其语法规则一个名为“The Ripper”的新系统允许语法检查工具 Harper 自动*进化*其规则，从而大幅提升开发速度和准确性。过去，创建和完善语法规则对开发者来说是一个繁琐的迭代过程。 The Ripper 的工作原理是生成大量的随机表达式（识别语言模式的小程序），用精选的数据集对其进行测试，然后“变异”表现最佳的表达式以创建新的世代。这个过程类似于人工选择，能够迅速收敛到高度准确的规则——通常比手动编写的规则*更*准确。目前，一台笔记本电脑每秒可以处理 90,000 个候选表达式，并在几分钟内产生结果。未来的开发重点是自动化数据集创建（可能使用 LLM）以及简化处理大量规则的工作流程，最终目标是创建一个可以将风格指南直接转换为 Harper 规则的系统。这代表了 500-1000% 的规则创建效率提升，而不会影响性能。

原文

I want you to read that title as literally as possible. Harper is now capable of evolution.

This past week, I've been working on a system that should allow us to handle more complex grammatical cases and contexts, faster. I believe it will improve our ability to add new grammatical rules to Harper by somewhere between 500% and 1,000%.

To top it off, this system does it without slowing Harper itself down or increasing the memory footprint.

Let's get into it.

The Problem

There are several unique methodologies at play when Harper goes about grammar checking. Which strategy depends on the grammatical rule in question. Today, we're interested in expression rules.

For the curious, I have recently written a reflection on expression rules, as well as a guide for anyone interested in producing them. This post, however, will not recount information I've already written on this blog.

By count, expression rules make up the majority of grammatical rules Harper is currently capable of detecting. This is because they are fast, easy to write, and most importantly, easy to review.

There are, however, occasional hiccups that I encounter when tackling a problem. The English language is tricky and often it contradicts itself. I will often try to write a rule which covers a certain case, only to find that it doesn't cover all cases. I can iterate, but it often becomes tedious and time-consuming.

The Solution

Last week, I threw in the towel. I was tired of iterating ceaselessly towards a goal, only to have a new one to tackle after that. So I decided I would let the computer iterate for me.

Harper's expressions are essentially small programs which are able to identify the locations of given patterns in natural language. They are constructed at runtime, but they run exceedingly fast because they tend to be amenable to modern branch prediction. We can use this fact to our advantage.

When generating an expression that detects a particular grammatical rule, the new system (which I've called The Ripper) follows three steps.

Generate N random Harper expressions
Score the performance of these expressions by testing them against a curated dataset. The dataset contains labeled rows of sentences that do and do not contain the grammatical rule of interest.
Take the best K expressions and mutate them to left with L new "child" expressions. Go to step 2.

That's it! We're essentially treating expressions as living creatures and subjecting them to artificial selection. It works remarkably well.

Since these datasets are handcrafted (or generated by an LLM), they don't need to be large. Plus, the expressions themselves are quite fast to generate and test, so we can do so at an exceptional rate.

My laptop is able to churn through about 90 thousand candidates per second, allowing us to converge on an acceptable result in just a few minutes. Given more time, it's able to produce an expression rule that is more accurate than what I could write myself.

What's Next?

I intend to spend some time optimizing the process, particular for the human element. I'd like to be able to create batches of these datasets and let The Ripper take care of them all at once, overnight or on a beefy server in the cloud.

I'd also like to set up automated workflows for piping data from an LLM directly into the Ripper. Ideally, I want this system to get to a point where I can feed information from a style guide into an LLM and get a guaranteed functioning Harper expression rule out of it.