我的前十五个编译器 (2019)
My first fifteen compilers (2019)

原始链接: https://blog.sigplan.org/2019/07/09/my-first-fifteen-compilers/

## 纳米通道编译器开发方法 传统的编译器开发可能令人望而却步,因为它涉及复杂的分析、转换和优化,通常需要许多通道。 “纳米通道”方法提供了一种更易于接近的替代方案:将编译器结构化为*大量小型、单一用途的通道*。 这种方法通过肯特·戴布维格的编译器课程(P523)而广为人知,在该课程中,学生们每周逐步构建一个Scheme到x86的编译器,添加通道,最终得到一个43通道的编译器。一个关键的好处是每一步之后都能立即获得一个可用的编译器,尤其是在采用“从后向前”的开发方式(从代码生成开始)时。 最初用于研究,纳米通道框架被证明出奇地高效,现在被Chez Scheme(Racket的基础)等生产编译器使用。它类似于解析器组合器,从简单的组件构建复杂性。这种方法简化了编译器构造,使其感觉像编写解释器一样可行,并促进模块化、可维护的代码。对于林赛·库珀来说,这段经历至关重要,激励她为Rust编译器做出贡献,尽管实现细节不同,仅仅是因为它让她相信构建编译器是*可以实现*的。

这个Hacker News讨论围绕着sigplan.org上的一篇名为“我的前十五个编译器”的博客文章链接。原始文章和2019年的重新提交之前在该平台上引起了广泛关注。 评论主要偏离到无关话题。一位用户抱怨网站强制平滑滚动,并得到建议直接向网站报告,而不是在Hacker News上。另一位用户询问“从后往前”构建编译器的原理,质疑如何在不知道未来输入/输出需求的情况下实现通道——这个问题通过解释中间语言独立演化来回答。最后,一位评论员建议初始通道充当汇编到汇编的编译器,作为后续优化的目标。 该帖子还包含YC(Y Combinator)冬季2026批次的申请公告。
相关文章

原文

Compilers are sophisticated software artifacts that transform a source-language program into a target-language program, usually by taking a series of passes over the input program.  Each compiler pass may perform a transformation, such as closure conversion; it may perform an optimization, such as dead code elimination; or it may perform an analysis, the results of which can be used in later transformation and optimization passes.

We sometimes think of the number of passes in a compiler as a measure of the compiler’s complexity.  The classic paper “From System F to Typed Assembly Language”, for example, explains that a compiler “may make as many as 20 passes over a single program, performing sophisticated analyses and transformations such as CPS conversion, closure conversion, unboxing, subsumption elimination, or region inference.”  In this context, 20 is intended to sound like a large number, and indeed, it does sound a bit daunting.  But what if we could make compiler development more approachable by fully embracing the idea that a compiler should be structured as a large number of small passes, each performing a single specific task?

The nanopass approach

The first compiler I ever worked on was the Scheme-to-x86-64 compiler I wrote for Kent Dybvig‘s compilers course, known as P523, in my first year as a grad student at Indiana University, back in spring 2009. Actually, I didn’t write just one compiler that semester; I wrote fifteen compilers, one for each week of the course. The first week, my compiler had an input language that was more or less just parenthesized assembly language, and its target language was x86-64 assembly. Each week, we added more passes to the front of the previous week’s compiler, resulting in a new compiler with the same target language as the compiler of the previous week, but a slightly higher-level input language.

By the end of the course, I had a compiler that compiled a substantial subset of Scheme to x86-64, structured as 43 small passes. Each pass translated from its input language to a slightly lower-level language, or had the same input and output language but performed some analysis or optimization on it.  (I named my compiler SALL-E, which stood for “Space Allocation Lambda Lifter, Earth-class”, riffing on a recent-at-the-time movie.)

Building a 43-pass compiler in fifteen weeks was feasible because P523 used the “nanopass” approach to compiler development, in which one structures a compiler as a series of many small passes, each with a well-defined input and output language.  This approach is supported by an open-source software framework that provides a domain-specific language for defining the intermediate languages of a compiler and the compiler passes that will translate between them, and provides facilities for doing this in a low-overhead way.

The nanopass approach was originally described in the ICFP ’04 paper “A Nanopass Infrastructure for Compiler Education” by Dipa Sarkar, Oscar Waddell, and Dybvig (an expanded version of which later appeared in JFP).  Interestingly, the nanopass work was originally not intended to be specifically for education, but the ICFP reviewers required that the authors position it that way in the paper out of concern that a nanopass-style compiler would not be efficient enough for actual production use. Nine years later, Andy Keep and Dybvig documented this bit of history (and refuted the efficiency concern) in their ICFP ’13 paper “A Nanopass Framework for Commercial Compiler Development”, which describes their rewrite of the Chez Scheme compiler using the nanopass approach.  Chez Scheme itself was open-sourced in 2016 and, excitingly, is now the foundation of Racket.

I like to think of the nanopass approach as taking the idea of parser combinator libraries and extending that idea to the development of an entire compiler. With a parser combinator library, you write a parser by starting with a bunch of primitive parsers (say, that parse numbers or characters) and combining them, eventually building up the ability to parse a sophisticated language. The language one can parse gets fancier and fancier, but at every step of the way, the thing one has is a parser. Likewise, when developing a compiler, it’s useful to be able to think of the thing that you have at each stage of the process as already being a compiler; as you go along, it becomes a compiler for a language that’s increasingly different from the target language.

Backend-first compiler development

Although the nanopass approach doesn’t specifically mandate implementing a compiler in a back-to-front manner — starting with code generation and working upward from there — the back-to-front approach was a hallmark of P523 in the year I took it.  For me, a first-year grad student who had never worked on compilers before, this way of organizing the work was incredibly motivating: at the end of week one of the course (and at the end of week two, and so on for each week), I had written a compiler! Admittedly, what I had at the end of week one was a compiler for an input language that wasn’t very different from the output language. But it converted code in its input language to honest-to-goodness x86 assembly code on which I could then run an off-the-shelf assembler and produce a working executable.

Some compiler-development experiences are long slogs where you write code for months without ever having a thing that produces an actual executable that you can run. But with the back-to-front nanopass approach, we got that hit of gratification every week! Furthermore, thinking of each component of the compiler as itself being a compiler was useful because it encouraged us to structure our code in a readable, modular, and maintainable way, in much the same way that parser combinator libraries support the development of readable, modular, maintainable parsers.

It’s unusual to see compiler courses or books structured in this back-to-front way. The innovative “From NAND to Tetris” course seems to come close – projects 7 and 8 cover the back end of a compiler, while projects 10 and 11 cover the front end – but, even then, projects 10 and 11 go in front-to-back order, rather than back-to-front.  A 2006 Scheme Workshop paper by Aziz Ghuloum, though, advocates another approach to incremental compiler development that is a cousin to the back-to-front nanopass approach, which is to implement a complete compiler for a subset of the source language and then gradually expand that subset.  (Nada Amin has a repository containing such a step-by-step development, and Indiana’s current compiler course still uses such an approach.)

Both Ghuloum’s incremental approach and the back-to-front nanopass approach share the property that each step produces real assembly code that can be executed directly on the hardware after assembly, and each step results in a working compiler (for increasingly bigger subsets of the source language for the former; for increasingly higher-level languages for the latter).  Ghuloum convincingly argues that this way of doing compiler development can make writing a compiler as approachable as writing an interpreter, concluding, “Compiler construction is not as complex as it is commonly perceived to be. […] Once the basic compiler is mastered, the novice implementor is better equipped for tackling more ambitious tasks.”

From Scheme to Rust

For me, seeing how a compiler’s implementation could be broken down into a series of relatively small, well-defined, and approachable steps was vital to my career’s development. I began contributing to the implementation of Rust as an intern at Mozilla Research starting in 2011.  I learned a great deal from working on Rust for two summers, and even more importantly, I got to know a lot of people whose presence in my life has helped me build a research career.

Mozilla didn’t hire me to work on Rust because of any specific compiler implementation skill that I learned in P523; in fact, there was very little overlap between what I did in the course and what I did working on Rust. For the P523 compiler, for instance, I implemented register allocation, whereas Rust compiles to LLVM, which takes care of register allocation for you. Conversely, since Scheme is an S-expression-based language, the parser for the P523 compiler was incredibly simple, whereas parsing Rust is pretty involved; and because Scheme is not statically typed, we didn’t implement type checking in P523, whereas much of my time working on Rust was spent on the parts of the compiler responsible for type checking and type inference.

Nevertheless, it was only because of having taken P523 that I even considered applying to work on Rust at Mozilla, because P523 made me believe that a compiler was something that I could work on and wanted to work on.  I count myself lucky that my first exposure to compiler implementation showed me that writing a real compiler doesn’t necessarily have to be a monolithic and unapproachable task that only a heroic few people could ever hope to accomplish.

Bio: Lindsey Kuper is an assistant professor at UC Santa Cruz, where she works on language-based approaches to building parallel and distributed software systems that are correct and efficient.  She co-founded !!Con, the conference of ten-minute talks on the joy, excitement, and surprise of computing, and its sister conference !!Con West.  A preliminary version of this post appeared on her blog in 2017.

Disclaimer: These posts are written by individual contributors to share their thoughts on the SIGPLAN blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGPLAN or its parent organization, ACM.

联系我们 contact @ memedata.com