如果 Dspy 这么好，为什么没人用它？

如果 Dspy 这么好，为什么没人用它？
If DSPy is so great, why isn't anyone using it?

原始链接: https://skylarbpayne.com/posts/dspy-engineering-patterns/

## DSPy：为什么不被广泛使用？尽管DSPy承诺简化AI工程，但其下载量（470万）远少于LangChain（2.22亿）。这并非由于DSPy存在缺陷，而是因为它*难以掌握*。它需要一种不同的AI系统设计思维方式，优先考虑预先抽象，而非快速实现。作者认为，许多团队在不知不觉中通过痛苦的迭代过程重新创造了DSPy的核心原则——类型化输入/输出、模块化代码、提示分离和强大的评估。他们通过一个结构化提取任务来说明这一点，展示了最初的简单性如何随着提示调整、结构化输出、错误处理、RAG和模型切换等功能的添加而迅速升级为复杂且脆弱的代码。 DSPy为这些常见挑战提供了预构建的解决方案，从而提高了可维护性和更快的模型测试速度。然而，当迫切需要仅仅*让它工作*时，最初的学习曲线显得陡峭。最终，作者建议拥抱DSPy的底层原则——即使不使用该框架本身——以避免重复造轮子并构建效率较低的AI系统。

## Dspy 采用与挑战 - Hacker News 总结一个 Hacker News 讨论集中在为什么 Dspy，一个用于优化大型语言模型 (LLM) 提示的框架，没有得到更广泛的应用，尽管用户反馈积极。原作者 sbpayne 指出，公司通常最终会自行构建类似的功能。导致采用率低的关键原因包括：**需要预先投资于创建自动化评估指标**——Dspy 在你可以*衡量*提示改进时表现出色，但并非总是直接明了。用户也表达了对**提示可提取性**（将优化后的提示*从*框架中获取）的担忧，以及感觉**被锁定**在 Dspy 生态系统中。一些评论员同意 Dspy 的代码结构即使在不需要完全优化时也有益处。另一些人则认为它将“提示作为参数”的概念推得太远。关于 Dspy 和 RLM 等相关项目是否解决了真正困难的问题，或者主要是一种营销手段，存在争论。最终，讨论强调了将新框架集成到现有工作流程中的挑战，以及量化 LLM 性能的困难。

原文

4.7M

DSPy monthly downloads

222M

LangChain monthly downloads

For a framework that promises to solve the biggest challenges in AI engineering, this gap is suspicious. Still, companies using Dspy consistently report the same benefits.

They can test a new model quickly, even if their current prompt doesn't transfer well. Their systems are more maintainable. They are focusing on the context more than the plumbing.

So why aren’t more people using it?

DSPy’s problem isn’t that it’s wrong. It’s that it’s hard. The abstractions are unfamiliar and force you to think a litle bit differently. And what you want right now is not to think differently; you just want the pain to go away.

But I keep watching the same thing happen: people end up implementing a worse version of Dspy. I like to jokingly say there’s a Khattab’s Law now (based off of Greenspun’s Law about Common Lisp):

Any sufficiently complicated AI system contains an ad hoc, informally-specified, bug-ridden implementation of half of DSPy.

You’re going to build these patterns anyway. You’ll just do it worse, after a lot of time, and through a lot of pain.

Let’s walk through how virtually every team ends up implementing their own “Dspy at home”. We’ll use a simple structured extraction task as an example throughout. Don’t let the simplicity of the example fool you though; these patterns only become more important as the system becomes more complex.

Stage 1: Ship it

Let’s say you need to extract company names from some text, you might start out with the OpenAI API:

from openai import OpenAI
def extract_company(text: str) -> str:
    response = client.chat.completions.create(
        messages=[{"role": "user", "content": f"Extract the company name from: {text}"}]
    return response.choices[0].message.content

It basically works. So you ship it and life is good.

Stage 2: “Can we tweak the prompt without deploying?”

But inevitably, Product will want to iterate faster. Redeploying for every prompt change is too annoying. So you decide to store prompts in a database:

from openai import OpenAI
from myapp.config import get_prompt
def extract_company(text: str) -> str:
    prompt_template = get_prompt("extract_company")
    prompt = prompt_template.format(text=text)
    response = client.chat.completions.create(
        messages=[{"role": "user", "content": prompt}]
    return response.choices[0].message.content

Now you have a prompts table, a little admin UI to edit it. And of course you had to add version history after someone broke prod last Tuesday.

Stage 3: “It keeps returning garbage formats”

You notice the model sometimes returns "Company: Acme Corp" instead of just "Acme Corp". So you add structured outputs:

from openai import OpenAI
from pydantic import BaseModel
class CompanyExtraction(BaseModel):
def extract_company(text: str) -> CompanyExtraction:
    prompt_template = get_prompt("extract_company_v2")
    response = client.chat.completions.parse(
        messages=[{"role": "user", "content": prompt_template.format(text=text)}],
        response_format=CompanyExtraction,
    return response.choices[0].message

You now have typed inputs and outputs and higher confidence the system is doing what it should.

Stage 4: “We need to handle failures”

After running for a while, you’ll notice transient failures like 529 errors or rare cases where parsing fails. So you add retries:

from openai import OpenAI
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))
def extract_company(text: str) -> CompanyExtraction:
    prompt_template = get_prompt("extract_company_v2")
    response = client.beta.chat.completions.parse(
        messages=[{"role": "user", "content": prompt_template.format(text=text)}],
        response_format=CompanyExtraction,
    return response.choices[0].message.parsed

Now each call has a bit more resilience. Though, in practice you might fallback to a different provider because retrying against an overloaded service returning 529s is a recipe for… another 529 error.

Stage 5: “Now we need RAG”

Eventually, you might start parsing more esoteric company names, and the model might not be good enough to recognize an entity as a company name. So you want to add RAG against known company information to help improve the extraction:

from openai import OpenAI
def extract_company_with_context(text: str) -> CompanyExtraction:
    # Step 1: Write a RAG query
    query_prompt_template = get_prompt("extract_company_query_writer")
    query_prompt = query_prompt_template.format(text)
    query_response = client.chat.completions.create(
      messages=[{"role": "user", "content": query_prompt}]
    query = response.choices[0].message.content
    query_embedding = embed(query)
    docs = vector_db.search(query_embedding, top_k=5)
    context = "\n".join([d.content for d in docs])
    # Step 2: Extract with context
    prompt_template = get_prompt("extract_company_with_rag")
    prompt = prompt_template.format(text=text, context=context)
    response = client.chat.completions.parse(
        messages=[{"role": "user", "content": prompt}],
        response_format=CompanyExtraction,
    return response.choices[0].message

Now we have two prompts: one to create the query and one to create parse the company. And we have also introduced other parameters like k. It’s worth noting that not all of these parameters are independent. Since the retrieved documents feed into the final prompt, any changes here affect the overall performance.

Stage 6: “How do we know if this is getting better?”

You keep changing both prompts, the embedding model, k, and any parameter you can get your hands on to fix bugs as they are reported. But you’re never quite sure if your change completely fixed the issue. And you’re never quite sure if your changes broke something else. So you finally realize you need those “evals” everyone is talking about:

def evaluate(dataset: list[dict]) -> dict:
        prediction = extract_company_with_context(example["text"])
            "correct": prediction.company_name == example["expected"],
            "confidence": prediction.confidence
        "accuracy": sum(r["correct"] for r in results) / len(results),
        "avg_confidence": sum(r["confidence"] for r in results) / len(results)

Data extraction tasks are amongst the easiest to evaluate because there’s a known “right” answer. But even here, we can imagine some of the complexity. First, we need to make sure that the dataset passed in is always representative of our real data. And generally: your data will shift over time as you get new users and those users start using your platform more completely. Keeping this dataset up to date is a key maintenance challenge of evals: making sure the eval measures something you actually (and still) care about.

Stage 7: “Let’s try Claude instead… oh no”

Inevitably, some company will release a new model that’s exciting that someone will want to try. Let’s say Anthropic releases a new model. Unfortunately, your code is full of openai.chat.completions.create, which won’t exactly work for Anthropic. Your prompts might not even work well with the new model.

So you decide you need to refactor everything:

    def __init__(self, signature: type[BaseModel], prompt_key: str):
        self.signature = signature
        self.prompt_key = prompt_key
    def forward(self, **kwargs) -> BaseModel:
        prompt = get_prompt(self.prompt_key).format(**kwargs)
        return self._call_llm(prompt)
    def _call_llm(self, prompt: str) -> BaseModel:
        # Model-agnostic, with retries, parsing, validation
extract_company = LLMModule(
    signature=CompanyExtraction,
    prompt_key="extract_company_v3"
result = extract_company.forward(text="...")

You now have typed signatures, composable modules, swappable backends, centralized retry logic, and prompt management separated from application code.

Congrats! You just built a worse version of Dspy.

Dspy packages important patterns every serious AI system ends up needing:

Signatures

Typed inputs and outputs. What goes in, what comes out, with a schema.

Modules

Composable units you can chain, swap, and test independently.

Optimizers

Logic that improves prompts, separated from the logic that runs them.

These are just software engineering fundamentals. Separation of concerns. Composability. Declarative interfaces. But for some reason, many good engineers either forget about these or struggle to apply them to AI systems.

🔄

Weird feedback loops

You can't step through a prompt. The output is probabilistic. When it finally works, you don't want to touch it.

🚀

Pressure to ship

Getting an LLM to work feels like an accomplishment. Clean architecture feels like a luxury for later.

❓

Unclear boundaries

Where do you draw the boundaries? Your prompts are both code and data. Nothing is familiar.

So engineers do what works in the moment. Inline prompts. Copy-paste with tweaks. One-off solutions that become permanent.

But 6 months later, they are drowning in the complexity of their half-baked abstractions.

DSPy forces you to think about these abstractions upfront. That's why the learning curve feels steep. The alternative is discovering the patterns through pain.

Option 1: Use DSPy

Accept the learning curve. Read the docs. Build a few toy projects until the abstractions click. Then use it for real work.

Option 2: Steal the ideas

Don't use DSPy, but build with its patterns from day one. See below.

If you're stealing the ideas, build with these patterns:

✓ Typed I/O for every LLM call. Use Pydantic. Define what goes in and out.

✓ Separate prompts from code. Forces you to think about prompts as distinct things.

✓ Composable units. Every LLM call should be testable, mockable, chainable.

✓ Eval infrastructure early. Day one. How will you know if a change helped?

✓ Abstract model calls. Make swapping GPT-4 for Claude a one-line change.

DSPy has adoption problems because it asks you to think differently before you’ve actually felt the pain of thinking the same way everyone else does.

The patterns DSPy embodies aren’t optional. If your AI system gets complex enough, you will reinvent them. The only question is whether you do it deliberately or accidentally.

You don’t have to use DSPy. But you should build like someone who understands why it exists.

If this resonated, let’s continue the conversation on X or LinkedIn!