你现在是需要运行五个 Python 类型检查器吗？

你现在是需要运行五个 Python 类型检查器吗？
Are you expected to run five Python type-checkers now?

原始链接: https://pyrefly.org/blog/too-many-type-checkers/

库维护者经常面临越来越多的 Python 类型检查器（mypy、Pyright 等）所带来的困扰。为了使内部源代码兼容每一个检查器，往往会导致代码臃肿不堪，充斥着大量的 `type-ignore` 注释。作者认为维护者的优先级搞反了：与其强迫内部逻辑去满足每一个检查器，不如专注于库的公共 API。由于用户依赖不同的类型检查器，最有效的做法是对**测试套件**运行尽可能多的检查器，而不是针对源代码。对公共 API 进行测试，可以确保无论用户偏好哪种工具，都能获得准确的自动补全、文档说明和错误保护。由于类型检查器在处理公共 API 的行为时往往达成一致，即使它们在内部实现细节上存在分歧，这种方法也能在减少维护负担的同时，显著提升开发者的使用体验。核心结论很明确：在测试中优先考虑跨检查器的兼容性，以确保你的库能为整个 Python 生态系统提供无缝支持。

最近一篇名为《现在你需要运行五个 Python 类型检查器吗？》的 Hacker News 讨论，凸显了人们对 Python 日益复杂的类型检查现状所产生的不满。此次讨论的核心是 Python 的灵活性所带来的实际阻力。参与者们探讨了诸如 `__eq__` 等双下划线方法的细微差别，并指出虽然标准做法通常要求返回布尔值，但像 Polars 或 NumPy 这类库往往会返回数组或复杂对象，从而导致必须采取繁琐的类型检查变通方案。讨论随后演变为对 Python 生态系统的广泛批评。一位用户认为，该语言的开发体验已趋于“过时”，其特征表现为碎片化的包管理、跨平台部署问题以及多种相互竞争的类型检查器的激增。他们认为，随着 AI 智能体的兴起，迁移到强类型编译语言变得更加容易，因为智能体可以通过移植旧有的 Python 功能来弥补语言间的鸿沟。反之，也有人支持 Python，指出许多开发者更倾向于传统的开发工作流，而非依赖大模型生成的代码。归根结底，这场讨论反映了 Python 动态且“百无禁忌”的哲学与现代行业对严谨、自动化类型安全的追求之间的张力。

原文

Mypy, Pyrefly, Pyright, ty, Zuban, and possibly more that will come in the future... how are library maintainers expected to cope?

TL;DR: Prioritise running as many type-checkers as possible on your test suite. Run at least one on your source code.

The type checking that matters most (and why you've probably got it backwards)

If you only read one section of this blog post, please make it this one. Because this is where a lot of packages get it wrong. It's common to see packages run type checkers on their source code and to leave their tests untyped. That approach has it backwards.

Suppose you maintain a Python package. As a hypothetical user of your code, I don't particularly care about your internal development practices. Whether you use ruff format or black, how you sort your imports, whether you use pytest or unittest, none of this affects me. What I do care about is your public API and my experience interacting with it.

When you run a type-checker on your internal source code, you're mostly testing your internal logic. You can do that with whichever type checker you prefer, that's your choice. Which type-checker your users use, on the other hand, isn't.

By running as many type-checkers as possible over your test suite, you ensure that your package's public API works well for as many of your users as possible.

The Polars story

Polars is a modern dataframe library which, since its launch in 2020, has been taking the data science world by storm. As a heavy user of the library, I was very interested in making its developer experience even better. If Polars' types are accurate, then as a user I get better auto-complete, documentation, and protection from certain classes of bugs. What would it take to add Pyrefly to Polars' continuous integration jobs?

I started investigating this, and quickly ran into some roadblocks. Pyrefly is generally stricter than mypy, so it required rewriting parts of the codebase or adding more explicit type annotations when instantiating variables. Furthermore, I encountered some bugs in Pyrefly, and encouragingly enough, fixes for the vast majority of them were shipped with the highly anticipated v1 release. I think it was worth it, especially as it uncovered a medium-priority bug, but I did have to ask myself whether going through this for another three type-checkers would be.

To illustrate this point, let's look at the function DataType.__eq__. In Python, any method __eq__ is expected to return bool, and if it doesn't, then we need to explicitly tell type-checkers to ignore the type error. This function in Polars can also return different types depending on the inputs, thus requiring overloads. To get this function to satisfy all of mypy, Pyrefly, and ty, we need to write:

    @overload  
    def __eq__(  
        self, other: pl.DataTypeExpr
    ) -> pl.Expr: ...

    @overload
    def __eq__(self, other: PolarsDataType) -> bool: ...

    def __eq__(self, other: pl.DataTypeExpr | PolarsDataType) -> pl.Expr | bool:  

Wow, that's 4 different type-ignore comments for just 7 lines of code! You can see how a codebase quickly becomes polluted with such comments, or with workarounds to deal with different type-checkers' quirks. I don't think any library maintainer wants a codebase that looks like that. Surely there's a better way?

Instead of putting all your internals through multiple type-checkers, why not start by testing that all major type-checkers can be used with your library's public API? That's much more useful, so it's easier to justify spending time on it. But it's also easier, because you're just ensuring that, if your library gets used as-intended, then there are no type errors. In the case of DataType.__eq__, there's a test for it that looks like this:

DTYPE_TEMPORAL_UNITS: Final[frozenset[TimeUnit]] = frozenset(["ns", "us", "ms"])

def test_dtype_time_units() -> None:
    
    for time_unit in DTYPE_TEMPORAL_UNITS:
        assert pl.Datetime == pl.Datetime(time_unit)
        assert pl.Duration == pl.Duration(time_unit)

        assert pl.Datetime(time_unit) == pl.Datetime
        assert pl.Duration(time_unit) == pl.Duration

What's pleasing to see is that mypy, Pyrefly, Pyright, ty, Zuban all type-check this fine without reporting any errors! So even though the type-checkers disagree a bit on how the implementation should be written, they all agree about the effects on the public API. And that's what your users care about!

Getting Pyrefly to run on the whole Polars test suite was relatively painless, you can check out the PR to verify this. To ease Polars' own internal development, we've also been exploring using Pyrefly on their source code, though that is a larger effort and is being tackled incrementally.

What about my source code? Why are there so many type checkers anyway?

The typing spec outlines a standard set of rules that type checkers are expected to adhere to. There are aspects of it that are a bit hazy, however, such as in cases where users under-specify typing information. In those cases, different type checkers make different design decisions:

Some choose to be as strict as possible, emitting false-positives if necessary, but doing as much as possible to guard you from potential bugs.
Others are more lenient and allow you to add type information to your codebase more gradually.

When it comes to type-checking your source code, it's good to ask yourself where on the strict vs lenient spectrum you want to be. Pyrefly is not only strict (though this can be configured), but also fast and conformant, making it an excellent choice. If you try it out on your projects and run into any issues, please report them so that both you and all its other users can benefit from fixes!

The bottom line

There are 5 Python type-checkers which get attention these days: mypy, Pyrefly, Pyright, ty, Zuban. Library maintainers may rightfully feel like running all 5 of them over their source code is too much maintenance effort and requires polluting their code with too many type-ignore comments. We have made the case that such effort would be better spent by running multiple type-checkers over their tests instead, as that will test how well the library can be type-checked when users interact with it.