类型和神经网络

类型和神经网络
Types and Neural Networks

原始链接: https://www.brunogavranovic.com/posts/2026-04-20-types-and-neural-networks.html

## LLM 代码生成未来的类型化当前的大型语言模型 (LLM) 在生成代码方面表现出色，但将其视为一系列标记，将训练与类型检查分离。训练后确保有效代码的方法——例如重试循环或约束解码——虽然取得了一些进展（例如，在编码基准测试中提高了分数），但最终效率低下，因为它们不会更新模型的核心理解。关键在于*训练* LLM 原生生成类型化输出。这需要通过类型系统进行微分，因为类型是离散的，这构成了一个挑战。近期研究表明，一个解决方案是：与其强制在类型系统中进行选择，不如*学习* 分区过程本身。这涉及使用可微分映射来基于学习到的概率对输出进行采样，从而通过构造来确保代码类型正确。这种方法受 AlphaZero 在国际象棋等技术启发，允许模型学习编程语言的底层结构，而不仅仅是在生成后遵守其规则。收益随着语言的复杂性而扩大——语言结构越复杂，潜在的改进就越大。最终，这会将重点从*强加*意义给标记转移到构建一个固有*携带*意义的输出空间，为更强大、更可靠的代码生成铺平道路。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录类型和神经网络 (brunogavranovic.com) 5 分，bgavran 1 小时前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 big-chungus4 17 分钟前 [–] 所以模型生成代码，如果代码输入错误，我们然后使用正确输入的版本和它们之间的交叉熵？这样理解对吗？这听起来就像典型的训练，除非你可以从模型生成的任意代码中自动找到正确输入的版本，这样你就不需要数据集了。回复考虑申请 YC 2026 夏季班！申请截止至 5 月 4 日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

Posted on April 20, 2026

[This is cross posted to the GLAIVE blog ]

Neural networks are used to generate increasingly more code in languages which enable highly generic and provably correct programming: Idris, Lean, and Agda, for example.

However, most frontier models generating the code – Large Language Models – separate the process of training from the process of typechecking. They are trained to produce output of a fixed type: List Token. To get valid code, that output is then during post-training parsed, in multiple ad-hoc ways, into the types of a particular language.

What are these ad-hoc ways? Do they work, and should we expect them to? And more importantly, can we rebuild LLMs from the ground up so that they’re trained to produce typed output?

Enforcing types after training

Today, LLMs are trained to predict the next token in a given training corpus, resulting in functions of type

LLM : List Token -> D (List Token)

where D (List Token) here denotes a distribution over a list of tokens. LLMs consume an input prompt and produce this distribution one token at a time – sampling from the next-token distribution and feeding the result back in. For the purposes of this blog post, we’ll consider Token to simply be Maybe Char, i.e. either a character the model is able to generate, or a STOP token which signals the termination of the generative process.

The approaches in this section all take this trained network as fixed. They modify only the inference procedure, and fall along two axes:

Granularity: how often is the typechecker consulted? When the model emits a STOP token, or after every token?
Bandwidth: what does the typechecker communicate back? A structured error message, or merely a binary accept/reject signal?

These approaches sit at opposite extremes along these axes, and neither fundamentally solves typed generation, but we describe them here because they help us understand where the issue is.

1. Try; Compile; If Error Repeat

This is what most programmers do. They type raw text into the editor; the compiler either processes it into structured data, or returns an error the programmer has to internalise before resubmitting.

The same loop can be performed with LLMs. Given the task of producing a value of type List Int, the model generates a candidate – say, “[ 1 , 2 , 3 ] STOP” – and the controller hands it to a typechecker. If it passes, we’re done. If not, we feed the error back and ask the model to try again. Along our two axes, this approach has low granularity (the typechecker is consulted only after the STOP token is produced) and high bandwidth (errors are structured messages the model can reason about).

But there is a problem. Suppose the task is instead to produce a value of Either Char Double, and the model begins with “[ 1 , 2” – already unrecoverable, since no value of this type starts with [. Nothing notices until the full generation completes! At that point the entire sequence is rejected and the model starts over. This is incredibly wasteful.

It can be remedied by tightening the loop – generating fewer tokens before each check – and in the limit this is what experienced dependently-typed programmers do: make a small move, see the compiler’s response, make the next move. But tightening only addresses granularity. The LLM, unlike the programmer, does not carry what it learns across sessions: nothing in this approach updates the weights, meaning that the next session starts from the same state as the last.

2. Constrained decoding

A different approach is to consult the typechecker before each token is sampled, and mask out any tokens that would lead to an ill-typed result.

For instance, if the model is generating a value of type List Int and has already produced the tokens “[ 1 , 2 , 3”, we set the probability for alphabetic characters to zero and sample from the remains. Despite the fact that type inference is in general not decidable, this approach is well-known, and has been applied to JSON schemas and restricted fragments of various type systems. Along our axes, this sits at the opposite corner: maximum granularity (every token), and minimum bandwidth (one bit per token). Here, output is guaranteed to typecheck! However, there is a big problem: this only prevents the model from saying certain things, and does not change what the model wants to say.

To see this, consider the case where the model has learned to assign high probability to alphabetic characters in our example. Masking them out forces it to sample tokens it considers unlikely. At each step, the mask checks only whether a token can lead to a well-typed output, without accounting for how much of the model’s probability mass remains on each branch. This can trap generation into branches where only low-probability completions exist, producing output that typechecks but is increasingly nonsensical.

Like with the previous approach, no gradient flows through the mask, so the model cannot learn to adapt and assign a higher probability to densely populated branches, as the weights are never updated.

Learning types during training

The above approaches, as strange as it may seem, work. Frontier models using retry loops are making genuine progress: FrontierMath scores climbed from under 2% upon release to almost 50% in less than two years, and so did the scores for ARC-AGI-2 and SWE-bench. In chess, GPT-4 reaches roughly 1371 ELO (intermediate player) as a byproduct of general training.

These are significant results. But in domains where we know how to train models that utilise structure during training, we see even more dramatic performance improvements. Chess is one of those. The system AlphaZero utilises the game structure during training and reaches superhuman ELO (>3400) with ~30x fewer parameters than GPT-4 (<60 million vs 1.8 trillion). And AlphaZero never makes an illegal move, compared to GPT-4 which makes one in 30% of games.

What about dependently-typed programming languages? These are an incredibly rich domain, and one can wonder what happens if we tightly integrate their compilers into neural network training. Unfortunately, this hasn’t happened yet: we don’t really know how to do that. Most current models generating typed code are playing the game without being told its rules. What is surprising is how good they are.

If chess is any indication of what happens to performance when you train using structure, encoding the rules of a language is worth figuring out. The only question is: how do you differentiate through type systems? How do you learn to produce a typed output, given that types are discrete, non-differentiable objects?

Differentiating through structure

To understand differentiation, let’s start with a simpler question. How do you write a function of the following type in Haskell, modelling a simple choice of output structure: a coproduct?

f :: (x, p) -> Either a b
f = ?

If you try to write it down, no matter how creative you get, it will end up having the shape

f (x, p) = if c (x, p) then Left (f_l (x, p))
                       else Right (f_r (x, p))

for some predicate c : (x, p) -> Bool and two maps f_l :: (x, p) -> a, f_r :: (x, p) -> b. There is no other option. To produce a Left or a Right, you must commit to one or the other by partitioning the input via c. When it returns True you go left, otherwise you go right.

This decomposition is not an accident of Haskell – it is one of the properties of the category \(\mathbf{Set}\), which is an extensive category. And this property is central to research on differentiating maps into coproducts, most notably CHAD and Higher Order AD of Higher Order Functions. They use this theorem, and seemingly – without retries or constrained decoding – show us how to build a network whose output is a structured type, at train time.

But as we’ve seen above – these approaches require us, the programmer, to partition the input space, instead of having the neural network automatically learn it. This means that, given a network \(f : X \times P \to A + B\), if your model starts out guessing the wrong branch \(A\), it can never fix it through gradient updates, because it was never able to learn it to begin with.

The same story applies to autodiff through other structures, such as pairs, inductive types, and function objects. All of these relinquish the model’s ability to learn the structure. This is an important perspective, but one that’s overlooked in the aforementioned literature. Of course, this is justified since their focus is autodiff, and not learning. But I have not seen this distinction acknowledged, and it deserves to be said explicitly.

Often, we don’t want just a method for propagating gradients through a known program, instead we want the ability to learn the program.

Differentiating with respect to structure

There is a different way we can go about learning a program \(f : X \times P \to A + B\). Instead of fixing a partitioning, what if we learn it? This is coincidentally the answer to our question of how to backpropagate gradients while making discrete choices: we never differentiate through discrete choices to begin with. Instead, we collect evidence for each discrete choice, normalise it, and then sample to decide what next.

That is, we use three maps:

A differentiable map \(f_c : X \times P \to D(\text{Bool})\)
A differentiable map \(f_A : X \times P \to A\)
A differentiable map \(f_B : X \times P \to B\)

Then, to produce an output of type \(A + B\) given some \((x, p) : X \times P\), we apply \(f_c\) to \((x, p)\), and then sample from the resulting distribution. The resulting boolean tells whether to use \(f_A\) or \(f_B\). Another way to say this: \(f\) is now a map of type \(X \to D(A + B)\), producing a distribution over both the choice of the output types, and their values.

The question is now: how do we differentiate this, and train it using supervised learning? Perhaps surprisingly, this is straightforward. We differentiate only through the branch taken, and use cross entropy loss for comparing the produced distribution with the actual one. This means that, if we’re ready to provide these three neural networks, we can obtain an effectful neural network of type \(f : X \times P \to A + B\).

This approach produces well-typed output by construction. And unlike the previous approach, it allows the network to learn the right output type. It also directly subsumes the previous approach: setting the sampling temperature to t=0 recovers argmax, i.e. differentiation through structure.

Even more, this coproduct can be thought of as an instance of a dependent pair indexed by a finite set. This generalises in a number of different ways, and can be iterated to produce an array of typed structures, all in a completely principled way! As we’ll see (in future posts) – the thing that enables this is the theory of containers, and the only way to proceed with this is to build the requisite typed infrastructure!

It’s incredibly exciting.

Of course, this idea has to some extent been studied in the literature. In the most crude way, teacher forcing is an untyped variant of this. Abstract Syntax Networks, which appeared in 2017, also describe this, but without (dependent) types. Just like AlphaZero, ASNs predate the LLM era, but have not been scaled further. Perhaps that was due to a lack of prominence of dependently typed languages where this kind of an approach pays off. Who knows?

And of course, AlphaZero and AlphaProof are worth mentioning. They use constrained decoding, but at train time, updating only the chosen branch via reinforcement learning. In AlphaZero, this reached superhuman performance on Chess, Go, and Shogi. But when it comes to AlphaProof, the situation is different. Even though the reinforcement learning approach is similar, the output space of all the models is still just List Token. This means that the types shape which tactics receive reinforcement, but the gradients still flow through the coarse typing of tokens, completely decoupled from the types of the underlying programming language.

Final words

LLMs operate on a flat token vocabulary. Every structural technique we apply around them – such as retries or constrained decoding – is an attempt to impose meaning on outputs that were generated without any notion of meaning. What differentiating with respect to structure proposes is that the output space itself can carry that meaning. A token is no longer an opaque character or subword; it is tied to meaning of the domain in which we’re generating output. This means that, the more structured our programming language is, the bigger the benefit is.

It might appear that this is an argument against scale, and the Bitter Lesson. That is not the case. I see this as a move that lets scale do its work on the right object. As with chess, where encoding the game rules into training produces a leap that no amount of inference-time search can today match, the move here is to encode the programming language itself into the training, and apply scale on a structure that actually reflects what we’re trying to produce.

Lastly, the distinction between differentiating through structure and with respect to structure is essentially the distinction between treating type-level choices as fixed scaffolding versus treating them as something the network can learn. Both produce well-typed output, but only the latter enables the particulars of that output to be learned. The mathematical language for describing these typed action spaces uniformly – for sums, products, inductives, and dependent types alike – ends up being the language of containers and dependent lenses.

That is where this line of work gets interesting.

Thanks to Jules Hedges and Andre Videla for providing feedback on a draft of this post.