大型语言模型理解空值
LLMs understand nullability

原始链接: https://dmodel.ai/nullability-gentle/

大型语言模型 (LLM) 越来越多地用于代码生成,但它们对代码属性(例如可空性,即变量可能为 null 的可能性)的“理解”程度如何呢?研究人员通过探测 LLM 在编写 Python 代码时的内部激活来对此进行调查。 他们创建了一个“可空性探测器”,用于分析模型的内部状态,以确定它是否将变量视为潜在的 null 值。实验表明,LLM 在预训练期间确实会形成对可空性的内部概念,学习应用类型规则。然而,参数更多的更大的模型展现出更快的学习速度和整体性能。 该研究还发现,复杂的规则和过程间分析(跨多个函数的推理)对 LLM 构成了挑战。可空性探测器输出的可视化结果表明,模型可以动态地跟踪函数内的可空性,理解变量在条件检查后可以变成非 null 值。这项研究深入了解了 LLM 如何在内部表示基本的编程概念,为更深入地理解其代码生成能力铺平了道路。

一篇Hacker News帖子讨论了一篇文章,该文章声称大型语言模型(LLM)展现了对可空性的理解。原帖作者发现这种理解的可视化令人着迷,并建议将可空性探测与其他Python类型检查工具结合使用以提高准确性,尤其关注像可空性这样的接口而非精确的类型,因为Python具有鸭子类型特性。 一位评论者认为这类似于说夜灯“理解黑暗”——语法上正确,但语义上不正确。另一位评论者承认原帖存在一个需要注意的地方,指出其意图并非要争论LLM是否以有感知的方式“理解”,而是探索它们对可空性的理解程度,就如同它们能够理解任何事物一样。最后一位评论者简单地表示“非常酷”。

原文
Inside the CodeBot: A Gentle Introduction to How LLMs Understand Nullability

A line drawing of a robot with a brain on their antenna

The last five years have shown us that large language models, like ChatGPT, Claude, and DeepSeek, can write code in many domains, to huge excitement: many claim to be using these models to write entire web servers and apps from scratch. These tools have opened up programming to a whole new class of people who consider themselves non-technical.

A gif of github copilot completing a phone number validating function as the user types

But there are still many unanswered questions that someone trying to understand or even use these tools might have. For example, how often, and in what situations, can LLMs write correct code entirely on their own? And, maybe more importantly, but harder to answer: Do LLMs “understand” the code they are writing?

Understanding is a tricky concept to measure. Some would argue that sentience precedes understanding, and so that LLMs can’t have understanding, because they aren’t biological organisms with sentience. But they certainly have something akin to “thought processes”: a series of internal representations that determine their final outputs. Recently, it’s become possible to study these processes more deeply, measuring internal “beliefs” of the model as they think. This gives us a powerful tool for determining what kinds of problems LLMs falter on, when they’ll succceed, and when they are “thinking through” problems more fully versus just guessing at a solution.

So far, these techniques for measuring internal model state have mostly been applied to chatbots writing text for human consumption, using what we call “natural language” (to be contrasted with “programming language”s). This makes sense, since some of the most critical LLM tasks involve chatting with a user, and some of the most interesting concepts to measure, such as honesty or power-seeking, apply most readily to these conversations. But it’s hard to say quantitative or precise things about natural language concepts, so our ability to rigorously study internal representations is limited.

A diagram from Zou et al showing probes that read hallucination, honesty, morality, and power-seeking from the outputs of a chatbot.

Code, on the other hand, is another matter. Humans have been studying properties of code for a long time, and there are many abstract properties of a given program that can now be determined using static analysis. If we pick the right properties, we don’t need to worry about our ability to label data— static analysis can do that for us, and so we can easily scale up and train on thousands of examples generated from scratch.

In that spirit, we wanted to start with a simple property that comes up in nearly every programming language: nullability. A variable is said to be of nullable type when it can possibly take on a null value. Null values are represented differently across languages— as null pointers in C or C++, with explicit Option types in Rust, and with special nil or None values in dynamic languages like Javascript, Lisp, or Python. In every case, understanding where values can be null is necessary for writing even basic code, and misunderstanding where they are null can often be a source of bugs.

Do our models understand when a variable is of nullable type? They must, in order to be able to write code that deals with null values, but we haven’t known what form this understanding takes, or what situations are likely to confuse the model, until now.

A robot with glowing eyes


Before we get into the nitty-gritty details, let’s take a step back. To set up this work, we’ll first start with a simple example, similar to our dataset, which illustrates the task of inferring nullability with concrete code. . Then, we can run experiments to answer the question: in what situations are models good at reasoning about nullability? Next, we’ll introduce techniques that have been used to probe the internals of a model for different concepts. Finally we’ll put it all together into a “nullability probe”, a tool for answering the question: Given a location in the program, does the model think that the variable there could be null?

A Simple Example

Let’s say you’re writing a Python program with your LLM assistant. You’ve reached a point at which you need to do something with a variable called num. Maybe you’re building a list of numbers called positive_nums. How do you proceed?

The answer often depends on the context in which you’re working. If num and positive_nums are the only things in scope, then you might guess that you should write the lines:

technical post, so we’re just going to show the highlights here.

Impact of Variable Names and Arbitrary Constants

For programs involving lists and for loops, variable names and constant values heavily influence how able a model is to complete these programs correctly.

repeng library to extract states from the models we’re testing. That library captures the contents of a part of the model called the “residual stream” after every layer. But if you don’t want to sweat the details, you can just think of it as a numerical snapshot of the model, organized in terms of snapshots of each layer.

A diagram of the residual stream of a transformer being measured after each layer

Analyzing the Data and Building the Probe

Now that we have these model snapshots, labeled with either “nullable” or “non-nullable”, we can start to build a probe. The goal of the probe is to be able to tell us, at any given point, whether the model thinks the token it just generated is more likely to be a nullable variable or a non-nullable variable.

There’s a lot of flexibility in what form this probe could take. In theory, you could use anything to look at the model’s activations and make a prediction, even a neural network, or another transformer model. You could even say your “probe” is a static analysis which computes nullability from the program’s syntax, as represented in the model!

We want to make sure we’re not doing that, and are only extracting the most “plain” represenation of nullability that we can from the model. So we’re going to make the assumption that nullability is represented “linearly” somewhere in the model. Algebraically: the model represents the amount of “nullability” at a given token as a linear function of (some subset of) the activations, where each activation is given a weight and summed, like so:

\text{Nullability}(\hat{x}) = C + w_0x_0 + w_1x_1 + w_2x_2 + ...

Next, geometrically: if the model activations form a “space”, then there we want to look for a “direction” in this space which represents nullability.

A diagram showing nullability represented as a direction in space

There are different ways we can compute a “direction” of nullability. The simplest is just to measure the difference between the average state when the model is thinking about nullable variables, and the average state when it’s thinking about non-nullable variables. This gives us a “direction” pointing from non-nullable to nullable in our space, which we can use to project any new state onto, to determine how “nullable” it is.

This technique is called “mass means shift”, because we’re taking the difference between the means (average values) of each “mass” of points. You can think of it as drawing a line from the center of the “non-nullable” cluster to the center of the “nullable” cluster.

A diagram showing two blobs of points, with a line connecting their centers

It might be surprising that this works, given that we know there are better ways to fit linear functions, like logistic regression. And in fact, we can easily see scenarios where this returns a direction that doesn’t split the training data as well as possible.

A diagram showing the difference between a mass means and linear regression classification

However, the method that splits the training data best doesn’t always generalize best to splitting the test data well. And it turns out that in high-dimensions, at least within a single layer, mass means generalizes better than logistic regression.

This isn’t always the case across layers, though. In practice, we found that some of the layers in the model are better at representing nullability than others, and that there are some dependencies between layers that change the best direction on each layer. This makes sense, because the number of layers is relatively small with respect to the dimension of the residual stream, and so we have fewer dimensions to overfit. So, instead of using mass-means probing across all layers simultaneously, we do it for each individual layer. Then, we weight the contribution of individual layers to the final prediction using linear regression. We found this gave us better results for larger models, though for smaller models the simpler mass means approach worked better.

Visualizing Our Results

Now that we’ve built our probe, we can use it to visualize how the model “thinks” about nullability as it processes a program. Remember that reading diagram from earlier? Let’s look at it again and explain what it shows:

A diagram showing a simple program, and the probe’s nullability predictions for each variable load.

In this diagram, we’re showing a simple Python program with type annotations. Whenever a variable is read in the code (what we call a “variable load”), we’ve highlighted it in either green or red. Green means our probe detected that the model thinks this variable is not nullable, while red means the model thinks it is nullable.

The most interesting case is the variable result. When it first appears in the if statement, it’s highlighted in green because it comes from find_value, which returns an Optional[int]. But when it appears again in the print statement inside the if block, it’s highlighted in red! This shows that the model understands that inside the if result block, result can’t be None anymore.

How Does Understanding Develop During Training?

One of the most interesting things we found is how the model’s understanding of nullability develops over time during training. Using the checkpoints in the Pythia model suite, we can track how our probe’s performance improves as the model is pretrained for longer.

The performance of each Pythia model size during pretraining

This graph shows the probe’s test loss over training steps for models of different sizes. Lower means better, so we can see that all models generally get better at understanding nullability as they train longer, and larger models learn faster and reach better performance overall.

Interestingly, for models up to 1 billion parameters, the loss actually starts to increase again after reaching a minimum. This might be because as training continues, the model develops more complex, non-linear representations that our simple linear probe can’t capture as well. Or it might be that the model starts to overfit on the training data and loses its more general concept of nullability.

A robot thinking deeply about code


What’s Next?

This is just a first step in understanding the internal thought processes of LLMs as they think about code. There are still richer types, program invariants, and all sorts of high-level concepts that are necessary for writing working code, but extracting them from LLMs might not be so easy.

But we’ve already shown several important things about looking into the “mind” of a model as it writes code. We can say definitively that LLMs have an internal concept of nullability, even if they aren’t always able to do the neccessary program analysis to decide if variables are nullable.

As these models continue to improve, and as we scale to larger models, it will be interesting to see how their understanding of programming concepts evolves. And we’ll be here to study them as they do.

We thank Leo Gao, Chelsea Voss, and Zhanna Kaufman for their comments and suggestions during the drafting process of the technical writeup of this work.

Zou, Andy, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, et al. 2025. “Representation Engineering: A Top-Down Approach to AI Transparency.” arXiv. https://doi.org/10.48550/arXiv.2310.01405.
联系我们 contact @ memedata.com