二次微通道类型推断

二次微通道类型推断
Quadratic Micropass Type Inference

原始链接: https://articles.luminalang.com/a/micropass-inference/

## 优先类型推断以提供更清晰的错误信息许多具有强类型推断的语言可能会产生令人困惑的错误信息，因为编译器对类型的假设并不总是与开发者的意图一致。本提案介绍了一种新的类型推断算法，旨在优先考虑基于开发者*可能*思考代码方式的类型统一，而不是简单地遵循源代码的顺序。核心思想是摆脱单次自上而下的推断过程，而是利用一系列有序的推断过程——优先考虑“已知应用”（具有预期类型的函数参数）和“已知赋值”，然后再进行不太确定的推断。这种方法旨在通过反映开发者的心理模型来解决类型歧义。该算法迭代地完善类型，在每个步骤之后重新运行早期推断过程，以利用新信息。错误生成被推迟到最终类型检查阶段，确保消息反映了优先的推断。基准测试显示出有希望的性能，尤其是在具有许多小函数的语言中，这使其成为改善开发者体验的可行方法。

一种名为“二次微步类型推断”的新方法旨在改进复杂类型系统中错误报告的质量。虽然在调试方面前景看好，但其二次性能引发了对实际编译器使用的担忧。其核心思想是采用一种多步算法，可以比传统的单步合一化更精确地定位类型错误。然而，目前缺乏与现有方法的直接比较基准测试。一种建议的优化方法是，对于成功编译的情况使用更快的传统合一化，仅在检测到错误后才采用二次方法。这种“污染”策略可以将性能影响限制在错误场景中，而在这些场景中，延迟不太重要。作者目前正在将该系统集成到实际编译器中，以评估实际性能，并已更正了代码中发现的一个笔误。一个关键问题仍然是，单个更大的推断步骤是否可以在不产生二次复杂度的情况下实现类似的结果。

原文

Showcase example animation

For the proof of concept implementation, see this repository

Languages that allow plentiful of type inference can often produce confusing error messages. Type inference makes assumptions about what the user intends types to be, and if there’s a type error, there’s a high risk one of those assumptions were wrong. When the compiler then creates an error message it’ll use those types that may have been inferred from false assumptions.

I propose a new kind of type inference algorithm that always prioritises the type unifications the end-user is most likely to care about, independent from the types order in source code.

The goal being to not have the end-user need to reverse engineer what causes the compiler to infer an incorrect type, and instead work with the users pre-existing assumptions.

Languages with simpler inference such as Go can infer inside-out. Meaning we can figure out the type of an expression/statement by looking at its kind plus the types of its subexpressions.

func main() {
    x := someFunction() // the type of the right-hand side is always known
}
 
func someFunction() string {...}

But this approach quickly falls apart.

fn main() {
    let f = |x| {
        // we don't know the type of `y` as the type of `x` is not yet known
        let y = (x, x);
        return y;
    };
    // we don't know the type of `f` either.

 
    f("Hello!"); // it's not until here we know the type of `x`, `y` and f`.
}

A more powerful form of inference is type variable unification. With such a system we can leave types partially uninferred, and have later expressions/statements fill in the gaps.

Lets take the same example but use type variable unification this time. We start inferring types inside-out like in the first example. But; for the types that are not yet known we put a type variable. (annotated as ’N)

fn main() {
    let f: fn(ˈ1) -> ˈ2 = |x| {
        let y: (ˈ1, ˈ1) = (x, x); 
        return y; // '2 = ('1, '1)
    };

 
    ...
}

As the name “type variable” suggests, they are variables and thus can be re-assigned.

    ...
 
    f("Hello!"); // the literal unifies '1 with string, assigning '1 = string
}

After that type variable assignment the final types are what we’d expect.

fn main() {
    let f: fn(string) -> (string, string) = |x| {
        let y: (string, string) = (x, x);
        return y;
    };

 
    f("Hello!");
}

However this approach creates a new problem. Since we still unify types inside-out, we can end up making a type variable assignment which later on turns out to be much more clearly described as being intended to be assigned to something else.

If we look at this example.

fn example(x) -> Point[int] {
    log(["origin: ", x]);
    return Point(x, x);
                 ^  ^   error: expected int, got string
}

By following the type variable unification steps from before x would be inferred as a string since it’s being used in a list with a string eventually creating the error message shown. But when a developer looks at this function they would be much more likely to expect x to be inferred as int since it’s more clearly being used as an int in the return statement.

If we could instead unify types in the order the end-user deems most important, we could create error messages that more strongly correlate to the developers thought process.

Core Concepts

Instead of one large unification pass traversing the code top to bottom and inside-out, inference is split into the following ordered passes.

known_applications
known_assignments
known_return_types
known_same_as_unifications
known_record_fields
resolve_records
less_known_functions
default_numbers
default_unknown_to_unit_or_lift

See Passes for a description of each pass

With this ordering the higher up the list you go the closer to the users thought process you are. This way types are inferred by what the user is likely to consider the most important rather than by their position in source code.

Each pass performs its minimal task leaving most types untouched. After a pass is completed, it re-runs all previous passes before it. Hoping the changes it made will facilitate earlier passes to perform their role. With this design it’s logical that the lower down the list you go the more aggressive passes are allowed to be with their inference. As its implied less new type information will be made available.

Since each type variable may be unified many times by the same pass, error generation is not reasonably able to work during inference. Therefore passes ignore errors and only touch the types it knows it can and should work with. Error generation is instead delegated to type checking.

After type inference has finished we perform a finalization step which creates a map from type variables to known static types. Since these don’t contain any type variables we can perform type checking with plain equality operations.

Examples

Let’s see what happens if we now try the snippet that generated the misleading error message again.

fn example(x) -> Point[int] {
    log(["origin: ", x]);
                     ^ error: expected string, got int
    return Point(x, x);
    //           ^  ^ `known_applications` pass runs first,
    //                 assigning the type variable of `x` to `int`
}

Much better!

Let’s walk through a more feature-complete example step-by-step to see how this algorithm reaches its result.

fn main(a, b) -> _ {
    let point = Point(a, 200);
    let list = [point.x, 300];
    let record = { x = 100, y = 200 };
    return point.y == record.y;
}

struct Point[T] { x: T, y: T }

In this imaginary language’s number literals can infer into any int size and are not assumed to be int. That’s why the first change isn’t the 200 literal being applied to Point.

(known application) Point(a, 200): Point[_] -> Point[{numeric}]
(known assignment) point: _ -> Point[_]
(known assignment) list: _ -> [_]
(known assignment) x = 100: _ -> {numeric}
(known assignment) y = 200: _ -> {numeric}
(known assignment) main return type: _ -> bool
(known application) a: _ -> {numeric}
(known application) point.y: _ -> {numeric}
(known application) record.y: _ -> {numeric}
(known same as) point.x: _ -> {numeric}
(known assignment) list: [_] -> [{numeric}]
(resolve record) { x = 100, y = 200 }: _ to Point[{numeric}]
(known assignment) record: _ -> Point[{numeric}]
(known record field) unifies record.y with y = 200
(default numbers) all {numeric} default to int
(default unit or lift) main declares type parameter T, b: _ -> T

Here’s a visualisation of those inference steps

Feature Complete Example

Passes

Infer the type of a parameter given to a function if the parameters type isn’t known but the expected type is.

Infer the type for the target of an assignment if the assigned value has a known type but the target does not.

This is also used when returning from a function, with the expected return type of the function being used as target.

If the same function is called in multiple places, and the return type of the called function is known, then for each call if the returned type is not known infer it to be the called functions known return type.

known_same_as_unifications

Unify all types that are used in the same list or branching expression with each other. Such as [x, y] and if true { x } else { y }.

If the type of a record is known, then for each instance where a field from that record is accessed, unify the type of the accessed field with the known field type from the record.

If the type of a record is not known, use its field names to figure out its type.

If a type which has not been inferred is called as a function, infer the type to be a function with the signature matching its first call.

If a type is not known but must be numeric, default it to be the default integer size.

default_unknown_to_unit_or_lift

If a type is not known, infer it to either the unit type if its created from an expression or an implicitly declared type parameter if its from the functions type signature.

Benchmarks

for _ in 0..iterations {
    defaulting();
    list();
    if_expr();
    field_inference();
    article_feature_complete_example_go();
    article_showcase_example_go();
    static_function_application();
    generic_function_application();
}

Intel i7 13700k @ 5.3GHz

7.79ms

2.78s

The benchmark is biased towards an Ocaml-like language where projects consist of many smaller functions, which reduces the severity of the algorithms quadratic nature. But with cached translation units the algorithm performs well enough that it should be viable regardless.

This proof of concept implementation sacrifices some performance for the sake of code clarity.