Optimizing Tool Selection for LLM Workflows with Differentiable Programming

原始链接: https://viksit.substack.com/p/optimizing-tool-selection-for-llm

Current agentic architectures rely on chaining LLM calls for tool selection, which, while easy to prototype, suffers from scalability issues due to latency, cost, and context inflation. This makes GPT-4 handle control flow without reuse or efficiency at scale. The proposed solution is a trainable, differentiable controller for tool selection, executed outside the LLM. This network learns tool selection from data via supervised fine-tuning and reinforcement learning, providing architectural benefits like local execution, determinism, composability, and control. It avoids API calls and stochastic sampling while integrating with PyTorch/DSPy pipelines. This approach eliminates context inflation, reducing token overhead, truncation risk, and attention dilution. It separates declarative control logic from generative inference, allowing LLMs to focus on generation and lightweight neural modules to handle routing, orchestration, and control. This modular architecture significantly reduces costs and improves performance by avoiding transformer inference costs for classical programming constructs. It enables a more scalable and inspectable system, marking a shift from prompt chains to programs.

Hacker News 的讨论围绕一篇博客文章展开，该文章探讨了使用可微编程来优化大型语言模型 (LLM) 工作流程中的工具选择。作者 viksit 尝试使用一个可学习的局部路由器来减少 token 开销和成本，并将其集成到 DSPy 管道中。主要讨论点包括：较小、专用模型在工具选择中的潜在优势；当前以 LLM 为中心的工具使用的局限性，这通常依赖于小样本学习，并且无法通过反向传播更新模型先验；选择正确的工具参数的挑战；以及在复杂工作流程中学习工具路由和提示以实现端到端管道优化的可能性。一些评论者建议使用替代方法，例如微调较小的语言模型或对确定性任务使用硬编码逻辑，而不是仅仅依赖 LLM。作者计划通过比较不同的工具选择方法（包括 LLM 选择与 RNN 和编码器模型的比较）来解决反馈。讨论强调了在所有任务中使用通用 LLM 与结合更专业和可训练的组件以获得更好控制和效率之间的权衡。

原文

Modern agentic architectures rely heavily on chaining LLM calls. A typical pattern looks like:

Use an LLM to decide which tool to invoke
Call the tool (e.g. search, calculator, API)
Use another LLM call to interpret the result and generate a final response

This structure is easy to reason about, simple to prototype, and generalizes well.

But it scales poorly.

Each LLM call incurs latency, cost, and token overhead. More subtly, it compounds context: every step includes not only the original query, but intermediate outputs and scratchpad logic from earlier prompts. This creates a growing burden on both inference and model performance.

The consequence is that most agent stacks are paying GPT-4 to do what amounts to classical control flow — tool selection — with no reuse, no abstraction, and no efficiency gains at scale.

Instead of using an LLM to route between tools, we can model the decision as a trainable function. A differentiable controller learns tool selection from data — typically via reinforcement or supervised fine-tuning — and runs entirely outside the LLM.

The benefits are architectural:

Local execution — avoids external API calls
Determinism — removes stochastic sampling from routing
Composability — integrates natively with PyTorch / DSPy pipelines
Control — tool choice is explainable and debuggable

A minimal examples looks like this (PyTorch):

This is a simple 4-layer network: input is tokenized text; output is a softmax distribution over tools. Because it’s differentiable, you can backpropagate from downstream task reward and improve the router over time.

We can either get data from existing logs, or use GPT to create a synthetic dataset. (Our costs will be one time per tool controller, vs LLM calls for them in production).

We use a simple Adam optimizer to train this simple controller.

And finally, the demo!

For completeness, this is how we’d do it via an LLM directly.

And as a bonus, here’s how you would integrate it into a DSPy Pipeline.

Prompt-based planners incur a hidden penalty: context inflation.

Each new prompt must reintroduce the full conversation history, prior decisions, and any scratch output. The result is exponential growth in irrelevant tokens, particularly in multi-hop workflows.

This leads to:

Token tax — redundant tokens sent repeatedly
Truncation risk — long contexts hit model limits earlier
Attention dilution — more tokens competing for limited compute
Leakage — planner logic unintentionally affects final output

By contrast, a differentiable router operates entirely out-of-band. The only input to the final LLM call is the original query and the selected tool’s result. Context length is constant regardless of tool depth.

This architectural separation restores clarity to the final model call — reducing hallucinations, improving determinism, and reclaiming inference capacity for core reasoning.

The shift to differentiable routing mirrors a broader trend:

Separating declarative control logic from generative inference.

Current agentic systems blur this line. Tool selection is handled in the same modality — and often the same model — as natural language generation. This creates coupling where there should be composition.

Differentiable programming is one way to decouple the two:

LLMs focus on generation and synthesis
Lightweight neural modules handle routing, orchestration, and control

The result is a more modular, inspectable, and scalable architecture — one that avoids paying transformer inference costs for classical programming constructs.

To drive this home, lets consider a planner that routes queries between a search API and a calculator tool. Each query invokes:

At GPT-4.1 prices (75 input / 75 output tokens per call), this costs:

A 3× reduction in cost per run — with larger savings as tool chains grow in complexity.

In early-stage workflows, LLM routing is fast to implement and flexible to extend. But at scale, it’s structurally inefficient — economically and architecturally.

Differentiable controllers offer an excellent alternative. They reduce cost, improve performance, and clarify model behavior. They mark a step toward LLM systems that look less like prompt chains — and more like programs.