本田：2年机器学习对比1个月提示工程

本田：2年机器学习对比1个月提示工程 - 我们的经验总结
Honda: 2 years of ml vs 1 month of prompting - heres what we learned

原始链接: https://www.levs.fyi/blog/2-years-of-ml-vs-1-month-of-prompting/

## 从SQL到LLM：彻底改变保修索赔分类主要汽车制造商面临着因车辆召回而产生的大量成本，这推动了对高效保修索赔分析的需求。几十年以来，一家公司一直依赖复杂的SQL查询来对这些索赔进行分类，但车辆问题和语言的不断演变导致分类不准确以及分析师瓶颈。 2023年，一个项目开始使用监督机器学习来自动化此过程。虽然最初使用XGBoost的模型显示出希望，但数据稀缺——源于漫长的标注过程和不断变化的优先级——阻碍了全面部署。最近，该团队重新审视了大型语言模型（LLM），发现性能和成本效益得到了显著提高。通过与Nova Lite进行六轮提示工程，他们在五个索赔类别中的四个类别中实现了与现有XGBoost模型相当，甚至超越了它们的水平。这种转变代表着一种根本性的变化：分类不再受数据可用性的限制，而是受指令质量的限制。 LLM为动态领域提供了一种灵活的解决方案，在这些领域中，数据稀缺或需求变化迅速，有效地将一个缓慢、数据依赖的过程替换为更快、迭代的基于提示的方法。

原文

←

November 7, 2025

Recalls at major automakers cost hundreds of millions of dollars a year. It’s a huge issue. To mitigate it, our company created an analytics department solely focused on categorizing warranty claims into actionable problems.

For decades, this team has relied on SQL queries to classify warranty data. But vehicles—and the language used to describe them—have evolved. SQL struggles with semantics, negations, and contextual nuance. Here’s a fictional example of a claim we might see in the wild:

“Customer reports oil on driveway, thought engine leak. Detailed inspection found no engine leaks. Traced oil to spill during last oil change. Oil on subframe dripping to ground. Cleaned subframe, verified no leaks from engine or drain plug. Customer advised.”

An oversimplified SQL query that might try and capture this scenario:

SELECT
    claim_id,
    claim_text,
    CASE
        WHEN (
            (LOWER(claim_text) LIKE '%leak%'
             OR LOWER(claim_text) LIKE '%leaking%'
             OR LOWER(claim_text) LIKE '%seep%'
        AND
            (LOWER(claim_text) LIKE '%oil%'
             OR LOWER(claim_text) LIKE '%fluid%'
        AND
            LOWER(claim_text) NOT LIKE '%no leak%'
            AND LOWER(claim_text) NOT LIKE '%not leaking%'
        )
        THEN 1
        ELSE 0
    END AS is_leak
FROM warranty_claims;

What we can gather from this example is that the leak came from a service oil spill—not the vehicle. Yet this query would still flag it as a leak. In production, these types of queries balloon into hundreds—if not thousands—of similar clauses. Over the years, the team created thousands of classification buckets. Many of these legacy buckets still siphon off claims today—creating unnecessary work for analysts and slowing down the detection of new issues.

The classification project

In 2023, the company launched a major initiative to automate warranty classification using supervised models. Here’s how that went:

Data Collection: The first challenge was establishing a ground truth. Each team member had different mental models of how claims should be categorized. After months of discussion, the team finally aligned on a set of core “symptoms” to categorize warranty claims by. Then came the hard part: manually labeling thousands of complex claims per symptom—work that only domain experts could handle. After many months we had labeled just half the symptoms.
Preprocessing: Raw warranty text is messy—full of acronyms, error codes, and multilingual input.

“cust reports mil on with p0420. tech found a/c compressor clutch noise at idle. checked map sensor, reading normal. replaced cat converter per tsb. dtc cleared, road test ok.”

Translation: Customer reports a check engine light. Technician found an unrelated AC compressor issue. Catalyst converter was replaced per technical service bulletin. Problem resolved.

We built a 9 stage preprocessing pipeline: text sanitization, concatenation, tokenization, acronym expansion, stop word removal, spell checking, service bulletin extraction, diagnostic code parsing, and translation. That took another 6 months.

Fun fact: Translating French and Spanish claims into German first improved technical accuracy—an unexpected perk of Germany’s automotive dominance.
Modeling: We tried multiple vectorization and classification approaches. Our data was heavily imbalanced and skewed towards negative cases. We found that TF-IDF with 1-gram features paired with XGBoost consistently emerged as the winner. See the PR curve attached below [1]

Getting to production was another challenge. Migrating everything to the cloud, building a UI for our analytics team, onboarding vendors, and coordinating with IT—the project stretched across multiple years. Our plan was to deploy the first 10 models, gather real world feedback, and resume labeling for the remaining symptoms. But once the initial batch of classifiers went live, project priorities shifted: the scope expanded to deploying all classifiers, while the team previously helping with annotation had moved on to new initiatives.

We suddenly had a data scarcity problem. How do you deploy models without training data? Even with renewed labeling efforts, it would’ve taken months to label new datasets. We needed a faster, more flexible solution.

What about large language models?

We actually tried few-shot prompting with GPT-3.5 at the start of this project—but the results were disappointing: low accuracy, high latency, and prohibitive costs. Fast forward two years, and the landscape had radically changed. Modern models were faster, cheaper, and showed strong few-shot performance across various domains. That raised a question: could we get within 5% of our purpose-built classifiers?

To find out, we benchmarked 6 frontier models against our baseline using 5 labeled datasets ranging from broad symptoms like leak and noise to narrow ones like cut-chip. With our data skewed toward negative cases, we chose PR AUC as the primary metric, backed by Matthews Correlation Coefficient (MCC) and F1. The early results? XGBoost still led by ~15% on average, especially on the hardest tasks, though LLMs showed promise in broader categories. (See chart below.)

Radar chart showing model performance across categories

Closing the gap

When we factored in cost, Nova Lite was the clear value pick—third best PR AUC score, yet the second cheapest model [2]. So we pushed forward with it and began iterating on our prompts.

Our prompt tuning combined evaluation with reasoning. For each symptom, we ran Nova Lite on a stratified sample of labeled data, capturing two outputs: the prediction and its reasoning. We compared results to the ground truth, analyzed where prompts failed, and used those reasoning traces to identify gaps. Failure cases and the current prompt were then passed to a larger LLM to generate refinements. Each version was re-evaluated multiple times to confirm accuracy and rule out noise. See the step-by-step progression [3].

After 6 rounds of refinement, Nova Lite closed the performance gap and matched—or slightly beat—our supervised XGBoost model in 4 of 5 categories (cut-chip, deformed-misaligned, leak, and noise). The biggest leap was cut-chip, which improved 35 points and edged ahead of our baseline. Broader categories like noise and leak started strong and saw only marginal gains. Superficial-appearance remains the exception—still trailing by 12 points, which suggests it may need a different modeling approach entirely.

Bar chart comparing XGBoost vs Nova Lite performance

So What?

Over multiple years, we built a supervised pipeline that worked. In 6 rounds of prompting, we matched it. That’s the headline, but it’s not the point. The real shift is that classification is no longer gated by data availability, annotation cycles, or pipeline engineering. The constraint moved—from collecting examples to writing instructions. That’s not a marginal improvement; it’s a different way of building classifiers.

Supervised models still make sense when you have stable targets and millions of labeled samples. But in domains where the taxonomy drifts, the data is scarce, or the requirements shift faster than you can annotate, LLMs turn an impossible backlog into a prompt iteration loop.

We didn’t just replace a model. We replaced a process.

[1] PR curve exploring various vectorization methods.

PR curve showing vectorization methods

[2] Price vs. performance table

Model	Cost per 1M tokens	PR AUC
Claude Sonnet 4.5	$3.00	0.722
Claude Haiku 4.5	$1.00	0.717
⭐ Nova Lite	$0.06	0.716
Llama 3.3 70B	$0.72	0.712
Llama 4 Maverick 17B	$0.24	0.709
Nova Micro	$0.04	0.600
Llama 4 Scout 17B	$0.17	0.575

** All LLM prices shown are on-demand. Batch pricing is ~50% lower.

[3] Prompt by prompt progression

Chart showing prompt improvement progression