编译一个 Forth

编译一个 Forth
Compiling a Forth

原始链接: https://healeycodes.com/compiling-a-forth

## Forth类语言：编译器与虚拟机总结该项目详细介绍了为一种受Forth启发的语言创建的字节码编译器和虚拟机（VM）。Forth是一种面向栈的编程语言。作者构建此项目是为了理解Forth的内部工作原理，并提供了可视化效果来说明该过程。该系统支持Forth的核心特性：数据栈和返回栈、词定义（函数）、DO/LOOP控制流和变量。词使用返回栈进行函数调用，而DO/LOOP通过返回栈管理循环控制，从而实现嵌套循环。变量使用特定命令声明、加载和存储。编译器通过词法分析（将代码分解为有意义的单元）将源代码转换为字节码，然后直接转换为字节码，跳过传统的抽象语法树。VM执行此字节码，管理栈和变量表。关键字节码操作包括推入值、从内存存储/加载、跳转、调用/从函数返回以及栈操作。使用React构建的可视化效果通过逐步展示词法分析、字节码生成和VM执行来增强理解。虽然这不是一个完美的Forth实现（它采用提前编译，使用字节码而不是线程代码，并且缺乏动态字典），但该项目成功地捕捉了Forth基于栈的核心原理。代码在线提供，允许交互式探索。

## Forth的核心：交互性和可扩展性一则Hacker News讨论围绕Forth语言的本质展开。原作者在构建类似Forth的编译器时，引发了关于“Forth精神”真正定义的争论。这种精神的关键在于**交互性**：编辑、编译和执行无缝集成于单一流程中。这允许快速开发和演进，即使在资源受限的环境下，也消除了对编译器版本的担忧。像即时执行词这样的特性提供了类似`comptime`的功能。评论者一致认为，Forth的优势并不在于语言本身（与C或汇编语言相比），而在于其**交互式环境**。这种交互性在调试过程中表现出色，允许直接检查变量。此外，一个核心原则是**自举**——从最小的汇编核心构建系统，然后完全用Forth扩展它。虽然有人争论最小编译器的大小（几百字节的可行性受到质疑），但讨论强调Forth作为解释器/语言设计学习工具的价值。

原文

I was curious how Forth worked so I built a bytecode compiler and a VM for a Forth-like language, as well as some visualizations to show how it all works.

You don't need to know anything about Forth to follow along, aside from the fact it's a stack-oriented language.

Here's a small program that prints the number three.

The number (3) is pushed to the data stack, and then the dot (.) pops it from the data stack and prints it.
We'll need more Forth features than this to build interesting programs.
Forth has two built-in stacks. The data stack (sometimes just called "the stack") and the return stack. When a word is called in Forth (words are like functions) the address of the next instruction is pushed to the return stack. When the word finishes executing, the return stack is popped into the instruction pointer.
\ (1) word declaration
: PRINT10
  \ (3) the word body is executed
  10 .
  \ (4) ";" compiles an exit – at runtime it pops the return stack
  \     into the instruction pointer.
;
\   (2) instruction pointer lands on a word,
\       the next address is pushed to the return stack,
\       and the instruction pointer is set to the word address
PRINT10
\   (5) next address is executed
As well as words, my compiler also supports DO/LOOPs. These use the return stack too. When DO executes, it pops the limit and the iterator from the data stack and stores them in the return stack. This allows the inner loop to freely operate on the data stack. When LOOP executes, it pops the limit and iterator from the return stack, adds one to the iterator and compares it to the limit (and exits or loops again).
There are also variables, which can be declared with VARIABLE X, loaded with X @, and stored with 1 X !.
Putting these features together, here's how you can build 10 by adding 1 repeatedly.
VARIABLE A
: RUN
  0 A !          \ initialize A
  10 0 DO        \ push limit and iterator for DO
                 \ DO places these on the return stack
    A @ 1 + A !  \ A = A + 1
  LOOP           \ increment i and exits when i == limit
  A @ .          \ prints 10
;
RUN
This set of features is enough for us to calculate numbers from the Fibonacci series, which is the example program I'll be using throughout the rest of this post.
Tokenizing
Tokenization translates raw text into meaningful symbols.
To turn source code into tokens, we scan through the code, skipping over whitespace and appending tokens to a list. Syntax that's a single character is turned straight into a token but multi-character syntax needs to be grouped together. For example, entire comments are discarded, and while they are being discarded, we need to track that we're "within" a comment.
Identifiers, like keywords like DO or LOOP, or custom variables like MYLONGVAR, become single tokens.
First, a visualization of what's happening:
And here's a trimmed version of my tokenizer:
function tokenize(source: string): Token[] {
    const tokens: Token[] = [];
    let index = 0;
    while (index < source.length) {
        // Consume and discard everything on a line after '\'
        if (source[index] === "\\") {
            const commentStart = index;
            while (index < source.length && source[index] !== "\n") {
                index++;
            }
            index++;
            continue;
        }
        // Skip over whitespace
        if (isWhitespace(source[index])) {
            index++;
            continue;
        }
        if (source[index] === "@") {
            tokens.push({ type: "load" });
            index++;
            continue;
        }
        // Handle identifiers
        if (isLetter(source[index])) {
            const start = index;
            let value = "";
            while (isLetter(source[index])) {
                value += source[index];
                index++;
            }
            // Special-case the keywords
            if (value === "DO") {
                tokens.push({ type: "do" });
                continue;
            }
            if (value === "LOOP") {
                tokens.push({ type: "loop" });
                continue;
            }
            tokens.push({ type: "identifier", value });
            continue;
        }
        // .. trimmed other tokens, see source
    }
    return tokens;
}
With our list of tokens, we're ready to start generating bytecode for the VM.
Generating Bytecode
Usually, in a compiler, the step after tokenization is parsing where an abstract syntax tree is built. However, the feature set of my Forth is so small, that I decided to generate bytecode directly from the list of tokens.
After bytecode generation, my VM needs two things:
A list of operations for the VM's instruction pointer to navigate
The number of variables that the program refers to
The latter tells the VM how many variables to allocate (a zero-initialized array). Variables in source (e.g., A, B) become integer indices into this array.
This means that my bytecode generation step needs to keep track of variables that have been seen before so that I can output the correct memory address (i.e. an index into the variable table).
I'll show the full list of bytecode operations and then a few of the steps for handling specific tokens.
type Op = {
    op: "lit",     // Push value or address to DS
    value: number;
} | {
    op: "load",    // Pop address from DS, push value at address
} | {
    op: "store",   // Pop address from DS, pop value from DS, store value at address
} | {
    op: "dup2",    // Duplicate top two values on DS [a, b] -> [a, b, a, b]
} | {
    op: "add",     // Pop top two values from DS, push sum to DS
} | {
    op: "eq",      // Pop top two values from DS, push 1 if equal, 0 if not
} | {
    op: "jz",      // Pop value from DS, if zero, jump to address
    address: number;
} | {
    op: "jmp",     // Jump to address
    address: number;
} | {
    op: "call",    // Push IP to RS, jump to address
    address: number;
} | {
    op: "ret",     // Pop IP from RS, jump to IP
} | {
    op: "rs_push", // Pop from DS, push to RS
} | {
    op: "rs_pop",  // Pop from RS, push to DS
} | {
    op: "drop",    // Discard top value from DS
} | {
    op: "print",   // Pop value from DS, print it
}
The bytecode generation step scans through the list of tokens and, as it processes them, it appends to a list of bytecode and increments the variable count to set up the correct references.
Identifier tokens are either variable references, or words (function calls).
let index = 0;
while (index < tokens.length) {
    const token = tokens[index];
    if (token.type === "identifier") {
        if (token.value === "VARIABLE") {
            const nextToken = tokens[index + 1];
            // Store a binding of variable name to memory address
            variableTable[nextToken.value] = Object.keys(variableTable).length;
            index += 2;
            continue;
        }
        // If the variable has been declared as a word like `: FIB10`
        // then we have previously stored the bytecode offset which we
        // will set the instruction pointer to at runtime
        if (wordTable[token.value] !== undefined) {
            bytecode.push({ op: "call", address: wordTable[token.value] });
            index++;
            continue;
        }
        // If it's not a variable declaration, or a word, then we
        // look up the memory address
        bytecode.push({ op: "lit", value: variableTable[token.value] });
        index++;
        continue;
    }
    // ..
}
Setting up the DO/LOOP bytecode generation was the trickiest part of this project. It's a minefield of possible off-by-one errors. It's also not easy to read and understand but I've chosen to put it here anyway because even just glancing over it should help you understand how the loop variables (limit, iterator) and instruction pointer jumps are combined to execute loops in Forth.
  // ..
  if (token.type === "do") {
      index++;
      // Expect: DS has [limit, start] (start is top)
      // Move both to RS: start then limit (RS top becomes limit)
      bytecode.push({ op: "rs_push" }) // start -> RS
      bytecode.push({ op: "rs_push" }) // limit -> RS
      // Mark first instruction of loop body
      loopStart.push(bytecode.length);
      continue;
  }
  if (token.type === "loop") {
      // Pop limit and i from RS (RS top is limit)
      bytecode.push({ op: "rs_pop" }) // limit -> DS
      bytecode.push({ op: "rs_pop" }) // i -> DS
      // Increment i
      bytecode.push({ op: "lit", value: 1 })
      bytecode.push({ op: "add" }) // i on DS
      // Duplicate i and limit for compare and possible restore
      bytecode.push({ op: "dup2" })
      bytecode.push({ op: "eq" }) // eq flag on DS
      const loopStartAddress = loopStart.pop(); // first instr of loop body
      // Branch: continue when not equal (eq==0), exit when equal
      const continueAddress = bytecode.length + 4; // skip equal-path (2 drops + jmp)
      bytecode.push({ op: "jz", address: continueAddress })
      // Equal path (fallthrough): cleanup and exit
      bytecode.push({ op: "drop" }) // drop i
      bytecode.push({ op: "drop" }) // drop limit
      const afterBlockAddress = bytecode.length + 1 /* jmp */ + 3 /* continue block */;
      bytecode.push({ op: "jmp", address: afterBlockAddress })
      // Continue path:
      // address == continueAddress
      bytecode.push({ op: "rs_push" }) // i -> RS (top)
      bytecode.push({ op: "rs_push" }) // limit -> RS
      bytecode.push({ op: "jmp", address: loopStartAddress })
      index++;
      continue;
  }
  // ..
The rest of the token branches are more straightforward. Tokens like dot, store, load, and print all map directly to bytecode operations.
The colon token branch sets the bytecode offset for the word name which allows identifiers to become word calls as we saw above.
Now we've earned a visualization break.
VM
Writing the VM felt a little bit like dessert. Manually stepping through the bytecode as I worked on the generation logic gave me fairly good confidence that I was heading in the right direction, I only came across one or two off-by-one bugs when putting the VM together. Essentially, I had designed it ahead-of-time.
The VM scans through the bytecode operations using the instruction pointer (which starts at 0). The instruction pointer can jump around as it encounters jmp (jump to offset) or jz (conditional jump).
It manages the data stack, return stack, and the variable table (i.e. memory addresses).
Here's a trimmed version of the VM:
function vm(program: Program) => {
    const dataStack: number[] = [];
    const returnStack: number[] = [];
    const variableTable: number[] = new Array(program.variableCount).fill(0);
    let ip = 0;
    while (ip < program.bytecode.length) {
        const cur = program.bytecode[ip];
        if (cur.op === "lit") {
            dataStack.push(cur.value); // Literal or memory address
            ip++;
            continue;
        } else if (cur.op === "store") {
            const address = dsPop();
            const value = dsPop();
            variableTable[address] = value;
            ip++;
            continue;
        } else if (cur.op === "jmp") {
            ip = cur.address;
            continue;
        } else if (cur.op === "jz") {
            if (dsPop() === 0) {
                ip = cur.address;
                continue;
            }
            ip++;
            continue;
        } else if (cur.op === "call") {
            ip++
            returnStack.push(ip);
            ip = cur.address;
            continue;
        } else if (cur.op === "ret") {
            ip = rsPop();
            continue;
        }
        // .. trimmed other ops, see source
    }
}
The code for my compiler and VM are embedded in this website. I've been iterating on it by just running the TypeScript file:
bun ./components/visuals/forth/components.tsx
55  # 10th Fibonacci number
The visuals are React components with sleeps. In order to display the progress of the different steps (tokenizing, bytecode generation, VM), I first got each working and then added a callback which takes the current data and then sleeps.
So the VM function is actually async and accepts this callback:
// VM
async function vm(program: Program, callback:
  (
    highlight: { ip: number },
    dataStack: number[],
    returnStack: number[],
    variableTable: number[]
  ) => Promise<void>) {
  // .. inside VM loop
  await callback({ ip }, dataStack, returnStack, variableTable);
  // ..
}
And the component calls it and passes setState functions:
// Component
export function VM() {
    // .. inside useEffect
    await vm(program, async (highlight, newDataStack, newReturnStack, newVariableTable) => {
        setHighlightIP(highlight.ip);
        setDataStack([...newDataStack]);
        setReturnStack([...newReturnStack]);
        setVariableTable([...newVariableTable]);
        await new Promise(resolve => setTimeout(resolve, 500));
    });
    // ..
}
For the Forth code snippets in this post, I had to write a Prism plugin to get syntax highlighting working. Now that I've learned how to do this, I'll be using this method for syntax highlighting for the more esoteric (or, original) programming languages I write about!
Discrepancies
I described my compiler/VM as Forth-like because it's a little bit different from how Forth works.
My implementation compiles to bytecode ahead-of-time. Forth is traditionally interactive. Words are interpreted and executed as they are entered, and only colon definitions are compiled. Forth uses threaded code where words contain lists of addresses pointing to other words instead of a different bytecode offset.
Real Forth uses a dynamic dictionary that can be altered at runtime with new variables or word definitions. As I mentioned earlier, my word bodies are compiled with jump-over logic in the main execution stream. Also, my variables compile to lit address operations but real Forth variables return their address when executed directly.
These are just a few of the differences but I feel like my Forth-like compiler and VM capture enough of the spirit of Forth!

编译一个 Forth Compiling a Forth

Tokenizing

Generating Bytecode

VM

Discrepancies

编译一个 Forth
Compiling a Forth