Go, PET, Let Hen - Curious adventures in (Commodore) BASIC tokenizing

原文

July 5, 2025

Curious adventures in (Commodore) BASIC tokenizing.

Stylized illustration involving a Commodore PET 2001 computer and a hen.

No, this is not about a one-hit-wonder from the charts of 1977, rather, it’s about something that could have happened on a Commodore PET around this time. More generally, it is about a curious bug/feature in early BASICs and how this gave way to an additional keyword in Commodore BASIC for the PET.

Both are related to BASIC syntax and how white space in BASIC text is handled by the interpreter. — Which sets up the topic of today’s installment, namely, the tokenizer routine in early MS 9-digit BASIC for the 6502 and how this evolved over time. As this promises some fun with #BASIC and #6502 code on the #PET and involves some #archeology, as well, we really ought to have a look at this…

To begin with, let’s fire up a PET 2001 with BASIC 1.0, the first ROM version shipped with the PET 2001, also known as “Old ROM”, to start a little experiment:

*** COMMODORE BASIC ***

 7167 BYTES FREE

READY.

10 IF LS = LE THEN GOTO 100
LIST

 10 IF LS = LETHEN GOTO 100
READY.

RUN

?SYNTAX ERROR IN  10
READY.
█

So, what’s happening here?
And why does a perfectly valid BASIC statement throw a sntax error?

Let’s investigate…

As you may probably know, all white space in BASIC is optional and we do not require any of this as a keyword delimiter or separator of any kind. The two following statements work perfectly the same:

100 FOR  I = 0 TO 10  STEP 2

100FORI=0TO10STEP2

However, there’s actually a small difference: while the first one may be more readable, the second one executes a bit faster and requires less memory, for which it is preferred by the greybeards.

Going back to our little experiment, something that may spring to the eye is a tiny, but crucial difference in the BASIC text as entered and the BASIC text as listed:

10 IF LS = LE THEN GOTO 100
LIST

 10 IF LS = LETHEN GOTO 100
READY.

Did you see it? — No, there is no paranormal entity. — Rather, it’s something missing in the listing, namely the blank between “LE” and “THEN”, indicating that this is about a misshap in white space handling.

In order to investigate this further, we have to ask first

Words and Letters, or, What is “Tokenizing”?

An important consequence of not requiring any separation between keywords and other program text is that every letter in the program text can be the beginning of a keyword. E.g., as we encounter

20 LET A=V

We do not know, yet, whether this is the start of a keyword, as in

20 LET A=VAL(A$)

or “V” is part of an identifier (variable name), as in

20 LET A=V1+5

The only way to discern this is by scanning ahead and finding out if this adds up to a valid keyword (a BASIC command) or not.

BTW, as you probably know, the “LET” in this example is entirely optional.
Initially, any BASIC statement started with a keyword (command), and the keyword for an assignment was “LET”. However, this is a bit, well, clunky, it would be nice to have FORTRAN-like formulas for assignments. Since any assignment starts with a variable name (identifier) and every name must start with a letter, we may still tell the start of the statement from the line number, even without a dedicated keyword in front of this. Thus, we make an exception: if it’s not a keyword, this is a variable name and the statement must be an assignment. But, for compatibiliy, we make “LET” still a valid command, but now it’s just optional and not required in any way.

Now, going over a list of all keywords for each letter, comparing each letter to a valid start of a keyword and then scanning ahead to see, if this does indeed match, is a rather costly operation. It would be nice, if we could somehow avoid this.
And this is where the tokenizing comes to the rescue: every time, we enter a line of BASIC, we’ll copied the line to an input buffer, and there will be routine to go over this, in order to discern any keywords. Every time the routine does find a keyword, it will replace it by an integer value not allowed in regular BASIC text. As BASIC is generally restricted to 7-bit ASCII (that is, outside strings), we can use any values starting at 128 (hex 0x80) and above. This way, we do it once and for all. Later, when the runtime routine of the interpreter encounters such a value (a token), it will immediately know what this is about and it will be even able to use this as an index into a lookup table for quick access.

Moreover, in times when memory was still expensive and thus scarce, this reduced the memory requirements for storing a BASIC program quite significantly. In an average BASIC listing, keywords make up for about half of the text, besides line numbers, so the savings will quite dramatically extend the reach of our memory.

What’s not to love about this? Tokens are really the key to BASIC.

In-Memory BASIC Programs

Knowing this, we may have a deeper look into BASIC text as is represented in memory. As for instance, on the example of our offending line of BASIC:

10 IF LS = LE THEN GOTO 100
LIST

 10 IF LS = LETHEN GOTO 100
READY.

The equivalent tokenized program text is stored in the user area of the memory, which starts on the PET at 0x0401:

addr  code                semantics

0401  17 04               link: $0417    (addr. of next line, 16-bit little endian)
0403  0A 00               line# 10       (16-bit binary)
0405  8B                  token IF
0406  20 4C 53 20         ascii « LS »
040A  B2                  token =
040B  20                  ascii « »
040C  88                  token LET
040D  48 45 4E 20         ascii «HEN »
0411  89                  token GOTO
0412  20 31 30 30         ascii « 100»
0416  00                  -EOL-
0417  00 00               -EOT-          (link = null)

At first glance, we can see that BASIC has converted our input string “LE THEN ” into the token for “LET” (0x88) and the remaining text “HEN ”. Moreover, we can see that any keyword tokens are encoded as a single byte with a set sign-bit (n ≥ 0x80) and anything else is preserved as-is in plain ASCII text.

We may also observe that every line starts with a link to the address of the next line and line numbers are stored as binary numbers. (Each line number consumes 2 bytes, regardless of the value. “1” will be stored as a 16-bit number, as will “10000”.) Further, each line is terminated by a binary zero as an end of line marker (EOL) and a null-link to the next line marks the end of the program text (EOT).

An important takeaway from this is that BASIC programs are forward-linked lists. This has consequences for runtime performance, since, for any jumps, like with GOTO and GOSUB, the interpreter has to pick up at the very beginning of this list to search forward for an entry with a matching line number. As a small optimization, Commodore BASIC first compares this to the current line number, which is at a known location, and, if the target is greater or equal, the search starts from the current line. Which also implies that jumping back to the previous line is among the costlier operations.

So, quite clearly, BASIC skipped over the blank separating “LE” from “THEN”, recognizing the keyword (command) “LET” and encoding this as the token ID 0x88. — But, why should it do so?

BASIC and White Space

Initially, BASIC simply ignored any white space, just like FORTRAN and Algol did. The parser simply skipped over any blanks and these were of no significance, whatsoever. This is how Dartmouth BASIC did it and how many versions of BASIC did it in the 1970s. Notably, is this also what you would find in listings.

But this had consequences: e.g., at this time, it wasn’t that clear whether “GOTO” was a single word or two, as in “GO TO”. In English, these are certainly two words and to the interpreter it didn’t matter. So we may be rather undecided on this. — And, indeed, there were plenty of listings, where you could find constructs like:

120 IF A > 10 THEN GO TO 200

In MS BASIC, this is taken care of by the tokenizer routine. As we may observe in our above peek into the in-memory representation, it’s particularly in keywords, where the white space was ignored, while other instances of white space are preserved and faithfully reproduced in any listings.

That is, AppleSoft BASIC, another early version of MS BASIC for the 6502 and a close relative to Commodore BASIC, skips any white space and inserts a single blank after each keyword.
Thus, on the Apple ][, we get (even more obviously giving away, what might have gone wrong):

AppleSoft BASIC on the Apple ][:

]10 IF LS = LE THEN GOTO  100

]LIST

10 IF LS = LET HEN GOTO 100

]█

This is not dissimilar from what Commodore BASIC does with line numbers:

10A=2*0.5
20     PRINT A

LIST

 10 A=2*0.5
 20 PRINT A
READY.

Like AppleSoft BASIC in any other context, it skips over any white space after the line number and inserts a single blank for listings.

Which is also why we can’t have indented text and have to resort to tricks, if we really want indentations:

100 FOR Y=1 TO 20
110 : FOR X = 1 TO 40
120 :   PRINT X*Y
130 : NEXT X
140 NEXT Y

LIST

 100 FOR Y=1 TO 20
 110 : FOR X = 1 TO 40
 120 :   PRINT X*Y
 130 : NEXT X
 140 NEXT Y
READY.

As we may see, Commodore BASIC doesn’t skip any blanks after a separating colon (“:”).

And Numbers? (BTW)

Glad you should ask. It’s a lesser known fact that numbers are another area where Commodore BASIC deliberately skips over any blanks:

### COMMODORE BASIC ###

 15359 BYTES FREE

READY.
10 A= 1 000 000 .00
20 A= A*2
30 PRINT A
LIST

 10 A= 1 000 000 .00
 20 A= A*2
 30 PRINT A
READY.
RUN
 2000000

READY.

While the listing proves that the constant “1 000 000 .00” is faithfully stored in memory, the interpreter doesn’t hick up over the interspersed blanks, rather, it elegantly skips over them.

We can see this also, as we peek into memory:

addr  code                semantics

0401  16 04               link: $0416
0403  0A 00               line# 10
0405  41                  ascii «A»
0406  B2                  token =
0407  20 31 20 30 30 30   ascii « 1 000 000 .00»
040D  20 30 30 30 20 2E   
0413  30 30               
0415  00                  -EOL-
0416  21 04               link: $0421
0418  14 00               line# 20
041A  41                  ascii «A»
041B  B2                  token =
041C  20 41               ascii « A»
041E  AC                  token *
041F  32                  ascii «2»
0420  00                  -EOL-
0421  29 04               link: $0429
0423  1E 00               line# 30
0425  99                  token PRINT
0426  20 41               ascii « A»
0428  00                  -EOL-
0429  00 00               -EOT- (link = null)

But why should it do so? Obviously, this is slow and cumbersome, since the interpreter has to parse each of those constants anew and skip any of those blanks, whenever it encounters a number, often repeatedly so, like in a FOR–NEXT loop.

Well, there‘s more than a single way to represent a number.E.g., all those are the same:

1000
1000.0000
1 000
1 000.00
1E3
1.0E+3
(and so on)

It wouldn’t be that user friendly to type

100 A= 1000.00

just to have this listed back as, say,

LIST

 100 A=1.0E3

The same goes for identifiers (variable names). The interpreter could maintain a reference table and just insert offsets into the program, but, for a dialog-like system, it’s much better to maintain a meaningful symbolic representation for listing and editing (instead of maybe an offset marker or hash, like #34e2). Or, as only the first two characters are significant, it could easily truncate any longer names.

A compiler, on the other hand, can convert this at compile time, it can also resolve any calculations at compile time and even optimize over such precomputations. E.g, our above example

10 A= 1 000 000 .00
20 A= A*2

could become just something along the lines of

<STORE-REAL@FRAME-OFFSET>  0x48  2e6

Meaning, if BASIC is slow (and it is), it is so, because it’s user-friendly.
It’s slow so that it can preserve all the intricacies of the input, in order to reproduce them in a meaningful listing.

That is but for the instances where it doesn’t, like with our “LET HEN” example.

The BASIC of the Grundy NewBrain, the BBC‘s first choice for its computer literacy programme until it became clear that it wouldn’t be near ready in time, did a clever thing, combining the best of both worlds: Grundy’s BASIC maintained an additional pointer for each line, and, as it encountered a line, it wouldn‘t just parse the line and forget this context immediately, rather, it stored a representation of the parsed line on a heap and set the pointer to this stored representation. Therefore, if a line was encountered again, no additional parsing was required. At the same time, the original input was still available for listing and editing. Should the system run low on memory, the stored object code could be still purged and the related pointer reset. While such a purge afforded a new parsing effort for any future visits, the overhead as compared to a tradional BASIC implementation was neglible and the potential gain in speed significant.

Crunch Time

As for now, we have already established that our syntax error isn’t a bug, but a user error: if the keyword routine can skip over blanks to correctly identify constructs like “GO TO”, this implies that no adjacent strings of letters may add up to something that contains a valid BASIC command, even across word boundaries. (That is, outside of quoted string constants, of course.)
In this case, we got the legitimate keyword “LET”, which isn’t a valid right-hand part of a comparison. So, yes, this a syntax error, indeed.

There isn’t any argueing over this: if skipping blanks is a feature, the error is in front of the screen.

Notably, this is not an issue on a PET with the upgraded Commodore BASIC 2.0 (AKA “New ROM”) or, for the matter, with any other BASIC for Commodore 8-bits, thereafter.

Commodore BASIC 2.0, “New ROM”:

### COMMODORE BASIC ###

 7167 BYTES FREE

READY.

10 IF LS = LE THEN GOTO 100
100 PRINT "OK"
LIST

 10 IF LS = LE THEN GOTO 100
 100 PRINT "OK"
READY.

RUN
OK

READY.

So, what is special about Commodore BASIC 1.0 (or any other early flavor of MS BASIC for the 6502) and what does it do differently?

PET BASIC, “Old ROM”

What‘s needed for keyword parsing comes in to parts: a list of keywords and a routine to do the actual work, suggestively named “CRUNCH”, which splits kewords from other text in the input stream and converts them to tokens.

Let‘s have a look at the data first.
The list of keywords starts at 0xC092 (underlined characters indicate a set sign-bit):

addr  code                     petscii

C092        45 4E C4 46 4F D2    ENDFOR
C098  4E 45 58 D4 44 41 54 C1  NEXTDATA
C0A0  49 4E 50 55 54 A3 49 4E  INPUT#IN
C0A8  50 55 D4 44 49 CD 52 45  PUTDIMRE
C0B0  41 C4 4C 45 D4 47 4F 54  ADLETGOT
C0B8  CF 52 55 CE 49 C6 52 45  ORUNIFRE
C0C0  53 54 4F 52 C5 47 4F 53  STOREGOS
C0C8  55 C2 52 45 54 55 52 CE  UBRETURN
C0D0  52 45 CD 53 54 4F D0 4F  REMSTOPO
C0D8  CE 57 41 49 D4 4C 4F 41  NWAITLOA
C0E0  C4 53 41 56 C5 56 45 52  DSAVEVER
C0E8  49 46 D9 44 45 C6 50 4F  IFYDEFPO
C0F0  4B C5 50 52 49 4E 54 A3  KEPRINT#
C0F8  50 52 49 4E D4 43 4F 4E  PRINTCON
C100  D4 4C 49 53 D4 43 4C D2  TLISTCLR
C108  43 4D C4 53 59 D3 4F 50  CMDSYSOP
C110  45 CE 43 4C 4F 53 C5 47  ENCLOSEG
C118  45 D4 4E 45 D7 54 41 42  ETNEWTAB
C120  A8 54 CF 46 CE 53 50 43  (TOFNSPC
C128  A8 54 48 45 CE 4E 4F D4  (THENNOT
C130  53 54 45 D0 AB AD AA AF  STEP+-*/
C138  DE 41 4E C4 4F D2 BE BD  ^ANDOR>=
C140  BC 53 47 CE 49 4E D4 41  <SGNINTA
C148  42 D3 55 53 D2 46 52 C5  BSUSRFRE
C150  50 4F D3 53 51 D2 52 4E  POSSQRRN
C158  C4 4C 4F C7 45 58 D0 43  DLOGEXPC
C160  4F D3 53 49 CE 54 41 CE  OSSINTAN
C168  41 54 CE 50 45 45 CB 4C  ATNPEEKL
C170  45 CE 53 54 52 A4 56 41  ENSTR$VA
C178  CC 41 53 C3 43 48 52 A4  LASCCHR$
C180  4C 45 46 54 A4 52 49 47  LEFT$RIG
C188  48 54 A4 4D 49 44 A4 00  HT$MID$~

The tokenizer routine (AKA CRUNCH) traverses over this list, using a set sign-bit for a word termination after that given character. In other words: a set sign-bit indicates the last character of any keyword. If a match is found, the index into this list added by 0x80 (again, the sign-bit set) provides the token ID.
If no match is found, as indicated by reaching the terminating zero-byte, the character, which was just inspected, must be a plain ASCII character (maybe part of a name) and is consequently copied to the program text as-is.

Thus, the above list translates to the following keyword-token table:

code                   keyword  token  index

45 4E C4               END      0x80   ( 0)
46 4F D2               FOR      0x81   ( 1)
4E 45 58 D4            NEXT     0x82   ( 2)
44 41 54 C1            DATA     0x83   ( 3)
49 4E 50 55 54 A3      INPUT#   0x84   ( 4)
49 4E 50 55 D4         INPUT    0x85   ( 5)
44 49 CD               DIM      0x86   ( 6)
52 45 41 C4            READ     0x87   ( 7)
4C 45 D4               LET      0x88   ( 8)
47 4F 54 CF            GOTO     0x89   ( 9)
52 55 CE               RUN      0x8A   (10)
49 C6                  IF       0x8B   (11)
52 45 53 54 4F 52 C5   RESTORE  0x8C   (12)
47 4F 53 55 C2         GOSUB    0x8D   (13)
52 45 54 55 52 CE      RETURN   0x8E   (14)
52 45 CD               REM      0x8F   (15)
53 54 4F D0            STOP     0x90   (16)
4F CE                  ON       0x91   (17)
57 41 49 D4            WAIT     0x92   (18)
4C 4F 41 C4            LOAD     0x93   (19)
53 41 56 C5            SAVE     0x94   (20)
56 45 52 49 46 D9      VERIFY   0x95   (21)
44 45 C6               DEF      0x96   (22)
50 4F 4B C5            POKE     0x97   (23)
50 52 49 4E 54 A3      PRINT#   0x98   (24)
50 52 49 4E D4         PRINT    0x99   (25)
43 4F 4E D4            CONT     0x9A   (26)
4C 49 53 D4            LIST     0x9B   (27)
43 4C D2               CLR      0x9C   (28)
43 4D C4               CMD      0x9D   (29)
53 59 D3               SYS      0x9E   (30)
4F 50 45 CE            OPEN     0x9F   (31)
43 4C 4F 53 C5         CLOSE    0xA0   (32)
47 45 D4               GET      0xA1   (33)
4E 45 D7               NEW      0xA2   (34)
54 41 42 A8            TAB(     0xA3   (35)
54 CF                  TO       0xA4   (36)
46 CE                  FN       0xA5   (37)
53 50 43 A8            SPC(     0xA6   (38)
54 48 45 CE            THEN     0xA7   (39)
4E 4F D4               NOT      0xA8   (40)
53 54 45 D0            STEP     0xA9   (41)
AB                     +        0xAA   (42)
AD                     -        0xAB   (43)
AA                     *        0xAC   (44)
AF                     /        0xAD   (45)
DE                     ^        0xAE   (46)
41 4E C4               AND      0xAF   (47)
4F D2                  ON       0xB0   (48)
BE                     >        0xB1   (49)
BD                     =        0xB2   (50)
BC                             0xB3   (51)
53 47 CE               SGN      0xB4   (52)
49 4E D4               INT      0xB5   (53)
41 42 D3               ABS      0xB6   (54)
55 53 D2               USR      0xB7   (55)
46 52 C5               FRE      0xB8   (56)
50 4F D3               POS      0xB9   (57)
53 51 D2               SQR      0xBA   (58)
52 4E C4               RND      0xBB   (59)
4C 4F C7               LOG      0xBC   (60)
45 58 D0               EXP      0xBD   (61)
43 4F D3               COS      0xBE   (62)
53 49 CE               SIN      0xBF   (63)
54 41 CE               TAN      0xC0   (64)
41 54 CE               ATN      0xC1   (65)
50 45 45 CB            PEEK     0xC2   (66)
4C 45 CE               LEN      0xC3   (67)
53 54 52 A4            STR$     0xC4   (68)
56 41 CC               VAL      0xC5   (69)
41 53 C3               ASC      0xC6   (70)
43 48 52 A4            CHR$     0xC7   (71)
4C 45 46 54 A4         LEFT$    0xC8   (72)
52 49 47 48 54 A4      RIGHT$   0xC9   (73)
4D 49 44 A4            MID$     0xCA   (74)
00                     <end-of-list>

We may observe a few pecularities:

There are 75 tokens in total.
There’s some optimization in the order of keywords, with frequently occuring keywords like “FOR”, “NEXT”, “INPUT”, “DATA”, “READ”, “DIM”, “LET”, “GOTO” listed first. (The token 0x80, index #0, is a “natural” candidate for “END”.)
The “#” in “INPUT#” and “PRINT#” is not just a decoration, rather, these are keywords in their own right, represented by their own, unique tokens.
This also means that “INPUT#” must be inspected before “INPUT”, as is “PRINT#” before “PRINT”. Otherwise, we’d always match on the respective latter and never catch the former.
(As we’ll see later, this has some consequences.)
Interestingly, “GET#” is not a token, meaning, here, “#” is a decoration that has to be checked for separately.
“SPC(” and “TAB(” include the opening paranthesis, while other function keywords do not.
This also means that “SPC” and “TAB” are valid variable names!
(Only the first two characters will be significant, but the tokenizer routine will not trigger on those names and will copy the full strings to the program text.)
I guess, this another remarkable effort for broader comaptibilty with pre-existing BASIC listings?
While operators are generally tokens, there are no tokens for “>=”, “<=”, or “<>” (or, alternatively, “=>”, “=<”, “><” — yes, BASIC has a double-arrow-operator and a bow-tie operator, but these are just alternative forms of the common relational operators.)
Instead of having their own tokens, these operators are built from combinations of “<”, “>”, and “=”.

The Code — a Tour

So, as we know what’s actually processed, we may have a look at the CRUNCH-y routine.

This starts $C48D and processes zero-terminated input data from the BASIC buffer, which is for BASIC 1.0 located at $000A–$0059 (10–89, 80 bytes long). The initial index from $0000 into this will be available in $C9, pointing to the first non-blank character after the line number (at least 0x0B or decimal 11).

The BASIC input buffer also doubles as the output buffer, meaning, CRUNCH converts the contents of the BASIC buffer in situ. (Due to tokenization, the output will be almost always shorter than the input and is guaranteed to not exceed it. Moreover, while reading starts at the first character after the line number, which is at least at 0x0B, the ouput will follow behind, starting at 0x0A.)
The X register is generally used as an index or cursor for reading from the BASIC buffer, while the Y register is used either as an index into the keyword list or as an output cursor.

Note: With JavaScript enabled, you can hover over a labeled target to highlight the related label.)

C48D           LDX $C9       ;set read cursor to initial char in BASIC buffer
C48F           LDY #$04      ;initialize output cursor ($0005+4 = one before $000A)
C491           STY $60       ;also set this as mode flag (statement/string/DATA)
C493   iC493   LDA $00,X     ;read next char
C495           BPL iC49E     ;sign-bit not set: it's ASCII, skip ahead…
C497           CMP #$FF      ;is it the token for pi ("π")?
C499           BEQ iC4DC     ;yes, branch to copy it…
C49B           INX           ;ignore (graphics) & advance read cursor
C49C           BNE iC493     ;redo for next char, unless zero… (unconditional)
C49E   iC49E   CMP #$20      ;process character; is it a blank?
C4A0           BEQ iC4DC     ;yes, branch to copy it…
C4A2           STA $5B       ;store it as a delimiter
C4A4           CMP #$22      ;is it a quotation mark (`"`)?
C4A6           BEQ iC500     ;yes, branch to copy the string…
C4A8           BIT $60       ;check mode flag
C4AA           BVS iC4DC     ;if bit #6 set ($49: DATA), branch to copy as-is…
C4AC           CMP #$3F      ;is it a question mark ("?")
C4AE           BNE iC4B4     ;no, skip ahead…
C4B0           LDA #$99      ;load token for PRINT
C4B2           BNE iC4DC     ;unconditional branch to save…
C4B4   iC4B4   CMP #$30      ;compare to ASCII "0"
C4B6           BCC iC4BC     ;smaller, not a digit, skip ahead…
C4B8           CMP #$3C      ;compare it to "<"
C4BA           BCC iC4DC     ;branch to copy "0"…"9", ":", ";" as-is…
C4BC   iC4BC   STY $C0       ;keyword search; store a backup of the output cursor
C4BE           LDY #$00      ;load 0
C4C0           STY $5C       ;reset keyword index
C4C2           DEY           ;prepare Y for pre-increment loop
C4C3           STX $C9       ;store a backup of the read cursor position
C4C5           DEX           ;prepare X for pre-increment loop
C4C6   iC4C6   INY           ;increment index into the keyword list
C4C7   iC4C7   INX           ;advance read cursor
C4C8   iC4C8   LDA $00,X     ;read char
C4CA           CMP #$20      ;is it a blank?
C4CC           BEQ iC4C7     ;yes: skip and branch to read the next char…
C4CE           SEC           ;prepare subtraction
C4CF           SBC $C092,Y   ;subtract current keyword char for comparison
C4D2           BEQ iC4C6     ;exact match: continue with next char…
C4D4           CMP #$80      ;is the difference the sign-bit (end of word)?
C4D6           BNE iC507     ;it's a mismatch: prepare for next keyword…
C4D8           ORA $5C       ;it’s a token: $80 OR keyword index (in $5C)
C4DA   iC4DA   LDY $C0       ;restore ouput cursor from backup
C4DC   iC4DC   INX           ;save a byte; pre-increment read cursor
C4DD           INY           ;advance output cursor
C4DE           STA $0005,Y   ;store the processed byte (char or token)
C4E1           LDA $0005,Y   ;post-processing; load again to set flags
C4E4           BEQ iC51A     ;was it zero (EOL)? branch to finalize…
C4E6           SEC           ;prepare subtraction
C4E7           SBC #$3A      ;subtract ":" (statement separator)
C4E9           BEQ iC4EF     ;found end of BASIC statement, skip to set mode flag…
C4EB           CMP #$49      ;was it $83 ($3A+$49, token for DATA)?
C4ED           BNE iC4F1     ;no, skip ahead…
C4EF   iC4EF   STA $60       ;mode-flag: $00,$04=statement, $49=DATA
C4F1   iC4F1   SEC           ;prepare subtraction
C4F2           SBC #$55      ;was it $8F ($3A+$55, token for REM)?
C4F4           BNE iC493     ;no: branch to read & process next input char…
C4F6           STA $5B       ;set the delimiter to zero (EOL)
C4F8   iC4F8   LDA $00,X     ;get next input char
C4FA           BEQ iC4DC     ;zero (EOL), branch to store it as-is…
C4FC           CMP $5B       ;compare it to delimiter byte ($00 or $22)
C4FE           BEQ iC4DC     ;found delimiter, branch to store it…
C500   iC500   INY           ;simple copy (REM, string), increment output cursor
C501           STA $0005,Y   ;store current byte
C504           INX           ;advance read cursor
C505           BNE iC4F8     ;branch to copy next byte, unless zero… (unconditional)
C507   iC507   LDX $C9       ;restore read cursor from backup
C509           INC $5C       ;increment keyword index
C50B   iC50B   INY           ;search ahead in list for end of keyword
C50C           LDA $C091,Y   ;read keyword byte
C50F           BPL iC50B     ;last char (sign-bit set)? if not, get next byte…
C511           LDA $C092,Y   ;peek into next byte
C514           BNE iC4C8     ;branch to start over and try the next keyword…
C516           LDA $00,X     ;end of list, no match: re-read the input character
C518           BPL iC4DA     ;unconditional branch to copy it as-is…
C51A   iC51A   STA $0007,Y   ;end of line; store zero byte (could be again)
C51D           LDA #$09      ;reset start position for read cursor to $09
C51F           STA $C9       ;store it
C521           RTS           ;done

Well, this may be quite a bit to swollow at once. Let’s have look at this in decent bytes.
— (I am not sorry!) —

The first part is just a bit of initialization, which also introduces a few important concepts:

C48D  A6 C9              LDX $C9         ;first char position from $0000
C48F  A0 04              LDY #$04        ;initial write index from $0005
C491  84 60              STY $60

First, we set up our read cursor, reading it from location $C9. This will be initially set (and reset) to 0x09, but as we enter the routine, this will have been advanced to point to the very first non-blank character after the line number. (So we have already skipped over any leading white space.) Notably, this will be used as an index from $0000.
The output/write cursor is initialized to 0x04. This an offset from $0005, so this will point to $0009, just before the start of the BASIC buffer.

Finally, we also store 0x04 in $60, which is used as mode flag. (0x04 is coincidentially the ASCII code for EOT.) Over the run of the routine, this mode flag may also receive the values 0 or 0x49. But we’ll only care about bit #6 of this value, which will only set in 0x49, indicating that we’re reading a “DATA” section.

C493  B5 00      iC493   LDA $00,X      ;fetch char from buffer
C495  10 07              BPL iC49E
C497  C9 FF              CMP #$FF       ;`π`?
C499  F0 41              BEQ iC4DC
C49B  E8                 INX
C49C  D0 F5              BNE iC493

This is the head of the main read-process loop.
“LDA $00,X” loads the next byte/character to inspect. First, we check the sign-bit. If it’s unset, this means we’re dealing with a regular ASCII character and skip ahead to the next block. Otherwise, we test for 0xFF, the code for pi (π). If it is pi, we branch to write it to the output position as-is.

If we’re still going, the sign-bit is set and it’s not pi. Meaning, it must be an illegal (shifted or graphics) character, which is not valid input text. Hence, we advance the read cursor and branch unconditionally to the beginning of the loop for the next character, ignoring the invalid byte.

C49E  C9 20      iC49E   CMP #$20       ;blank?
C4A0  F0 3A              BEQ iC4DC
C4A2  85 5B              STA $5B
C4A4  C9 22              CMP #$22       ;`"`?
C4A6  F0 58              BEQ iC500
C4A8  24 60              BIT $60
C4AA  70 30              BVS iC4DC
C4AC  C9 3F              CMP #$3F       ;`?`?
C4AE  D0 04              BNE iC4B4
C4B0  A9 99              LDA #$99
C4B2  D0 28              BNE iC4DC
C4B4  C9 30      iC4B4   CMP #$30       ;`0`?
C4B6  90 04              BCC iC4BC
C4B8  C9 3C              CMP #$3C       ;`<`?
C4BA  90 20              BCC iC4DC

Here, we’re back for valid input characters. Time to handle some further special cases:

First, we check for a blank (0x20), and if it is, we branch to $C4DC to copy it as-is.

Then, we store the current character in $5B to be used as a string delimiter later on. (This will only matter, if the current character is a quotation mark.)
Let’s check, if it actually is a quotation mark (0x22): if so, we branch to $C500 to copy the entire string.

Now it‘s $60’s heyday: the BIT instruction transfers (besides other things) bit #6 from our mode flag into the V flag of the processor status register. If it’s set, this indicates that we’re currently reading a DATA section and we branch (again) to that part at $C4DC to copy this as-is.

The next check is for 0x39 (“?”): if it’s not a question mark, we skip ahead. But, if it is, this is the standard shorthand for PRINT (which is not in our keyword list) and we load the corresponding token number (0x99) and branch unconditionally (we just loaded the non-zero value 0x99, therefor the zero flag is unset) to $C4DC to write this byte.

Finally, we check for digits (0…9), “:” (0x3A) and “;” (0x3B), which can be copied at $C4DC, as well.

Now, if we’re still processing, this means that it will be probably a letter or an operator. Time to check for keywords…

C4BC  84 C0      iC4BC   STY $C0
C4BE  A0 00              LDY #$00
C4C0  84 5C              STY $5C        ;keyword index
C4C2  88                 DEY
C4C3  86 C9              STX $C9
C4C5  CA                 DEX

Let’s prepare for the keyword search: First, we have to backup our write cursor, since we will now use the Y register as an index into our keyword list, and also, as we’re at it, we reset Y to zero. Next, we reset the the counter in $5C, which keeps track of the current keyword index, to zero, as well. Then, we decrement the read index into the keyword list (in Y) to -1, because our main search loop will start with an increment.

We also store a backup of our buffer read cursor (in X) at $C9 (the same pointer, we initialized it from), since we’re going to probingly read ahead and will probably have to start over again. And, just as before, we decrement it to prepare for the coming pre-increment loop.

C4C6  C8         iC4C6   INY
C4C7  E8         iC4C7   INX
C4C8  B5 00      iC4C8   LDA $00,X
C4CA  C9 20              CMP #$20       ;blank?
C4CC  F0 F9              BEQ iC4C7
C4CE  38                 SEC
C4CF  F9 92 C0           SBC $C092,Y
C4D2  F0 F2              BEQ iC4C6
C4D4  C9 80              CMP #$80       ;sign-bit?
C4D6  D0 2F              BNE iC507
C4D8  05 5C              ORA $5C        ;keyword index

Welcome to the main attraction, this is why we came here: the inner loop of the keyword search!

We start by advancing both the index into the keyword list and the read cursor and read the next byte. If it’s a blank (0x20), we branch back to get read the next character from the buffer.
That’s it: this is how the tokenizer routine skips specially over any white space in keywords!

If it’s a non-blank character, we’re engaging into some destructive tests:
First, we subtract the corresponding byte from the keyword list. If the result is zero, we have an exact match, meaning, we have still a valid candidate, but not yet reached the final character, and redo for the next pair.

Next, we check for the end of a keyword, as indicated by the difference amounting to exactly 0x80. If the difference amounts to anything else, we have a mismatch and branch to $C507, in order to prepare for another go with the next keyword candidate.
If, on the other hand, the difference is 0x80, which is a set sign-bit, we successfully advanced to the very last character of the given keyword and matched it. By OR-ing this remaining 0x80 with the current keyword index (as stored in $5C) we get our token.

This should be worth a brief contemplation: in a byte-size subtraction, it doesn’t matter, which of the two operands has the sign-bit set and which one not: (0x80+x)-x ≡ x-(0x80+x) ≡ 0x80.
Since there is no carry involved to propagate the sign, this is essentially an XOR operation. (Which is also the raison d'être of the overflow flag: otherwise, we’d never know.)
Indeed, the same could have been achieved by an EOR instruction — and even a bit more economically so. (That it is not so, hints at an origin in a more general codebase adn option to encode this in a different manner.)

Coincidentially (or, rather, very much so by design), a shifted PETSCII character is the same as its unshifted counterpart, with a set sign-bit. This is how BASIC abbreviations work in Commodore BASIC: simply by tricking the tokenizer routine into matching a keyword early. If we were successfully matching a candidate word up to here and the difference is 0x80, the routine assumes that it reached the end of the keyword and matched the entire word: voila, here is your token!

On the other hand, this is also responsible for some awkwardness in this: e.g., if we enter “iN” or “inpU”, this matches “INPUT#” rather than the much more frequent “INPUT”, because the former is tested first (compare the order of the keyword list). And necessarily so, otherwise the tokenizer routine couldn’t match “INPUT#”, at all. And the same applies to “PRINT#”.

Skipping out-of-order ahead to $C507, we get to where the loop is prepared for a flying start-over to check against the next keyword candidate in the list (which is where to branched to in case of a missmatch):

C507  A6 C9      iC507   LDX $C9
C509  E6 5C              INC $5C        ;keyword index
C50B  C8         iC50B   INY
C50C  B9 91 C0           LDA $C091,Y
C50F  10 FA              BPL iC50B      ;last keyword char?
C511  B9 92 C0           LDA $C092,Y
C514  D0 B2              BNE iC4C8      ;end of list?
C516  B5 00              LDA $00,X
C518  10 C0              BPL iC4DA      ;write it (unconditional)

First, we restore our buffer read cursor to where we started trying to match a keyword, then, we increment our counter into the keyword list (as a base for a potential token ID).

Then we advance our index into the keyword list to the next end of word: starting with the usual pre-increment, we read the current byte again. Mind that we’re not reading from $C092 as before, but from $C091. And we repeat this until we find a byte with a set sign-bit.

Having thus reached the end of the word, we peek ahead, this time reading from $C092, to see, if this the zero-byte marking the end of the keyword list, or if there’s more to do. If there is, we branch to the entrance of our keyword search loop with X and Y already set up.

In case we exhausted the list and read a zero-byte, we just write the current input character as-is. For this, we read the current input character from the BASIC buffer fresh again and branch unconditionally (there is no way of an overflow of the index register for a buffer that ends at $0059) to $C4DA, where we will write this to the output.

So far, we’ve seen the greater workings of the loop. If we previously fell through with a valid BASIC token in the accumulator, we first restore the output cursor from its backup:

C4DA  A4 C0      iC4DA   LDY $C0

Then it’s time to actually store a processed byte, which is either a token or a raw character (this is the very address, we branched to earlier for various special cases):

C4DC  E8         iC4DC   INX
C4DD  C8                 INY
C4DE  99 05 00           STA $0005,Y
C4E1  B9 05 00           LDA $0005,Y
C4E4  F0 34              BEQ iC51A      ;EOL?

After a now customary preincrement to advance both the read and write cursors to the respective next buffer position, we simply store the current byte — just to load it again, do do some post-processing tests. First, we check if we just stored a zero-byte (EOL), indicative of having reached the of the line. If so, we branch to $C51A to finalize.

(We may observe that both instructions for write and read-back are of a 3-byte, full address format, where a zero-byte address mode would have done perfectly. There’s no technical reason for this, since there’s no chance that we could exceed the indexing range for a buffer, which ends at $59. There’s obvious opportunity for further optimzation. But, more importantlty, it’s another hint at this having been derived from a more general codebase.)

If still in business and there’s still more input to process, we engage in another round of destructive testing:

C4E6  38                 SEC
C4E7  E9 3A              SBC #$3A       ;`:`
C4E9  F0 04              BEQ iC4EF
C4EB  C9 49              CMP #$49       ;$83 = DATA?
C4ED  D0 02              BNE iC4F1
C4EF  85 60      iC4EF   STA $60
C4F1  38         iC4F1   SEC
C4F2  E9 55              SBC #$55       ;$8F = REM?
C4F4  D0 9D              BNE iC493      ;to entrance of main loop

First, we subtract 0x3A, the code for a colon (“:”). If it is now zero, it’s a match and we skip ahead to store this as the current mode flag in 0x60. Otherwise, we compare this to 0x49.
0x3A + 0x49 = 0x83, which is — consulting our keyword-token table — the token for “DATA”. If it is now 0x49, it must have been a DATA token before the subtraction, and we skip ahead to store this remainder as the new mode flag.

In any other case, we skip ahead and subtract 0x55:
0x3A + 0x55 = 0x8F, which happens to be the token for “REM”, and, if the subtraction yields a zero in the accumulator, we have a match.
If it’s not a REM statement, we branch back to the entrance of our read-process loop at $C493 to fetch and process the next input character.

C4F6  85 5B              STA $5B       ;delimiter
C4F8  B5 00      iC4F8   LDA $00,X
C4FA  F0 E0              BEQ iC4DC     ;EOL?
C4FC  C5 5B              CMP $5B       ;delimiter met?
C4FE  F0 DC              BEQ iC4DC
C500  C8         iC500   INY
C501  99 05 00           STA $0005,Y
C504  E8                 INX
C505  D0 F1              BNE iC4F8      ;write it (unconditional)

If it was “REM”, we clear the delimiter in 0x5B, in order to read to the end of line (EOL) and to consume the remaining line entirely.

What follows is the loop that copies a continous string of characters, as for a REM clause or a quoted string. We load the current character, and, if it’s not the end of the line, we compare it to the delimiter. If it is the end of line or the delimiter, we jump back to $C4F8 to write the character as usual (where also the EOL zero-byte will be finally caught).

Otherwise, we advance the output cursor and write the byte. And, after a pre-increment of the input cursor, we unconditionally branch to the entrance of this copy loop. (Again, there is no way to overflow on X for a buffer of 80 bytes length.)

Mind that this close copy loop applies to quoted strings and REM statements only. Notably, REM statements may contain shifted characters, while, due to a bug in the LIST routine, these will be expanded to their full keyword equivalent in any listing, as if these were tokens.

C51A  99 07 00   iC51A   STA $0007,Y
C51D  A9 09              LDA #$09
C51F  85 C9              STA $C9
C521  60                 RTS

The last few lines are finalizing the transformation and we only get here, if we read a terminating zero-byte. First we write this zero-byte to the buffer (we may have done so previously, but better to be sure) and after resetting the index for the input cursor to 0x09, we finally exit the routine.

— Phew! —

Closing Observations

By this, the BASIC line

10 PRINT A+5

will be transformed in the buffer from

0000: 4C 30 D1 00 00 00 7F 7F  L0......
0008: 0A 00 31 30 20 50 52 49  ..10 PRI
0010: 4E 54 20 41 2B 35 00 44  NT A+5.D

into the equivalent tokenized code:

0000: 4C 30 D1 00 00 00 7F 7F  L0......
0008: 0A 00 99 20 41 AA 35 00  ... A.5.
0010: 4E 00 20 41 2B 35 00 44  N. A+5.D

(Mind the stray zero-byte at $0011, an artefact of the final better-sure-than-sorry write operation, which happened after the byte was already written and the pointers (cursors in X and Y) were pre-incremented again. We may also observe that this does no harm and that there’s also a termination in the correct memory location.)

If in direct mode, this code can be executed immediately from here, otherwise, it will be copied to user memory as (part of) a stored program.
We may also observe the 16-bit binary equivalent (0A 00) of the already processed line number immediately before this, at locations $0008 and $0009.

A few observations:

Overall, this is a rather efficient, self-contained routine. There are only branch instructions and no jumps, saving a few of those precious ROM bytes.
There are also a few inefficiencies, owed to the fact that this is from a more general codebase and wasn’t specially adapted to the BASIC buffer being located in the zero-page (as for the codebase, it could be located anywhere). Or the end-of-keyword matching that may have been performed by an EOR instruction more compactly.
There’s also a major inefficiency, namely with the handling of a terminating zero-byte. Not only does the routine write this in most cases repeatedly, it’s also only checked for after a byte (character or token) has been written and reloaded, which happens only after the attempt to match a keyword. In other words, as we load a zero-byte at $C493, we pass all the checks for special cases and enter the keyword matching routine at $C4BC, just to turn a victory lap over all the 75 possible keywords. Only then we fall through to writing the byte at $C4E4, where the zero-byte will be finally detected, but only after the write operation, and from here we finally jump to the end of the routine, just to write this byte again.
The implementation of the special skipping over any white space in keywords is rather unspectacular: just a simple “CMP #$20” followed by a “BEQ” branch instruction (at $C4CA and $C4CC).
And there’s an actual bug. A subtle one and only concerning an edge-case, but still a bug!

The CRUNCH Bug

This bug is exclusive to Commodre BASIC, as it is related to PETSCII encoding. As we have seen, keyword abbreviations are really a side effect of how shifted characters are encoding in PETSCII, matching a character with a set sign-bit as stored in the keword list. The tokenizer routine is entirely agnostic of this.

This also means that we can enter a keyword terminating character. While this will be rejected in a leading position, it will be processed like any other byte, when we read ahead trying to match any keyword candidates. As far as the keyword search code is concerned, any such premature matching of the terminating difference of 0x80 is just a side effect of the subtraction working either way, regardless of which of the two compared bytes has the sign-bit set.

Now, what happens, if we enter a shifted a shifted character at the final position of a keyword? We’ll have a perfect character match with zero difference! Meaning, the routine will still go on, trying to match what must remaining characters of the keyword candidate, thus spilling over into next keyword candidate, but without incrementing the count of keywords which have been inspected already. Should this eventually match (be it on this next keyword, or on any consecuitive try), the keyword count will be lagging behind by one, resulting in an incorrect token.

We can demonstrate this by a little trick of magic:

My attractive assistant will now enter some nonsensical BASIC code containing two consecutive keywords, and by the magic touch of the RETURN key, one of these keywords will vanish and the line will be transformed into perfectly valid BASIC!

To improve the performance and to amuse our esteemed audience, we will switch for this to the lower-case/upper-case character set:

*** commodore basic ***

7167 bytes free

ready.
100 rem a subroutine:
110 print "hocus-pocus!"
120 return

rem *and now with a touch of magic*

ready.
200 gosuBreturn 100

run 200
hocus-pocus!

ready.
list 200

200 gosub 100
ready. — Our little demonstration of magic.

I hope, you’re suitably impressed!

Peeking into memory, we can prove that the magical transformation is complete and not a fluke:

addr  code                semantics

0430  3A 04               link: $043A
0432  C8 00               line# 200
0434  8D                  token GOSUB
0435  20 31 30 30         ascii « 100»
0439  00                  -EOL-
043A  00 00               -EOT-

Now, how does this work?

Well, “gosuB” is number 13 in the keyword list (token 0x8D) and “returN” comes immedtiatly after this, at index 14 (token 0x8E).
As the keyword matching code runs away over the perfect match of the shifted “B” in both the input and the keyword candidate (indicating that we’re still in the middle of a valid keyword), it eventually matches “return”, but without having incremented the keyword count. Thus, the resulting token is 0x8D, the token for GOSUB, while the entire string has been consumed and the tokenized line amounts to just “200 GOSUB 100”.
The input string “return”, though, is gone, gone with the magic wind of a run-away error!

And, by, this, we‘ve seen what‘s to be seen, in tokenizing with Commodore BASIC 1.0.

Some other BASICs do interesting and clever things with tokenizing, as well.

Sinclair BASIC for the ZX computers is particularly clever: it outsources tokenizing entirely to the user. When you press a key on the infamous membrane keyboard for one of the color-coded BASIC commands, you already enter a token. This is not just a shorthand, on the contrary, there is no way of typing a BASIC command letter by letter — and there’s simply no code to handle this.
But this is not all: as there is no need to parse the line for tokenizing, the BASIC input routine can engage in a syntax check, instead, and reject any erroneous lines before even running them. And, since the syntax has been checked on input already, there is no need to check this again on runtime, when the little machines are already pretty taxed by sharing their processor between BASIC execution and screen rendering. By doing the syntax check early and once for all, this has a pay off for any execution of a line of BASIC further on. Thus, the ZX computers can save some ROM space and some run time, as well.

(There’s no way for MS BASIC to pull off a similar trick, since it had been designed for serial IO via a terminal originally, and has thus to handle a serial stream of ASICII-like input. And there are remenants of this spread all over the code, even in its incarnation for the Commodore 8-bits, which already incorporate a keyboard and screen rendering. After all, Commodore BASIC is just another iteration of MS 9-digit BASIC, like for the KIM-1, which features none of these amenities.)

Further Developments

As already indicated, this was not to last. The upgraded “New ROM” with BASIC 2.0 (the one introducing proper IEEE488 disk support), brought various changes and modifications, also doing away with skipping over white space in keywords. And “LET HEN” was no more.

BASIC 2.0 “New ROM”

All that had to be done to mend the “LET HEN” bug was the ommision of two instructions.

The BASIC input buffer is now located at $0200 to $0250 (moved to outside of the zero-page — we can see now why those write instructions hadn’t been zero-page instructions, before) and the there’s a related extra instruction near the end for resetting the read cursor (for use as a double-byte pointer at $77 / $78).

Otherwise the code is still the same, as can be seen in this comparison:

BASIC 1.0 “Old ROM”

C48D            LDX $C9
C48F            LDY #$04
C491            STY $60
C493    iC493   LDA $00,X
C495            BPL iC49E
C497            CMP #$FF
C499            BEQ iC4DC
C49B            INX
C49C            BNE iC493
C49E    iC49E   CMP #$20
C4A0            BEQ iC4DC
C4A2            STA $5B
C4A4            CMP #$22
C4A6            BEQ iC500
C4A8            BIT $60
C4AA            BVS iC4DC
C4AC            CMP #$3F
C4AE            BNE iC4B4
C4B0            LDA #$99
C4B2            BNE iC4DC
C4B4    iC4B4   CMP #$30
C4B6            BCC iC4BC
C4B8            CMP #$3C
C4BA            BCC iC4DC
C4BC    iC4BC   STY $C0
C4BE            LDY #$00
C4C0            STY $5C
C4C2            DEY
C4C3            STX $C9
C4C5            DEX
C4C6    iC4C6   INY
C4C7    iC4C7   INX
C4C8    iC4C8   LDA $00,X
C4CA            CMP #$20    
C4CC            BEQ iC4C7   
C4CE            SEC
C4CF            SBC $C092,Y
C4D2            BEQ iC4C6
C4D4            CMP #$80
C4D6            BNE iC507
C4D8            ORA $5C
C4DA    iC4DA   LDY $C0
C4DC    iC4DC   INX
C4DD            INY
C4DE            STA $0005,Y
C4E1            LDA $0005,Y
C4E4            BEQ iC51A
C4E6            SEC
C4E7            SBC #$3A
C4E9            BEQ iC4EF
C4EB            CMP #$49
C4ED            BNE iC4F1
C4EF    iC4EF   STA $60
C4F1    iC4F1   SEC
C4F2            SBC #$55
C4F4            BNE iC493
C4F6            STA $5B
C4F8    iC4F8   LDA $00,X
C4FA            BEQ iC4DC
C4FC            CMP $5B
C4FE            BEQ iC4DC
C500    iC500   INY
C501            STA $0005,Y
C504            INX
C505            BNE iC4F8
C507    iC507   LDX $C9
C509            INC $5C
C50B    iC50B   INY
C50C            LDA $C091,Y
C50F            BPL iC50B
C511            LDA $C092,Y
C514            BNE iC4C8
C516            LDA $00,X
C518            BPL iC4DA
C51A    iC51A   STA $0007,Y

C51D            LDA #$09
C51F            STA $C9
C521            RTS

BASIC 2.0 “New ROM”

C495            LDX $77
C497            LDY #$04
C499            STY $09
C49B    iC49B   LDA $0200,X
C49E            BPL iC4A7
C4A0            CMP #$FF
C4A2            BEQ iC4E2
C4A4            INX
C4A5            BNE iC49B
C4A7    iC4A7   CMP #$20
C4A9            BEQ iC4E2
C4AB            STA $04
C4AD            CMP #$22
C4AF            BEQ iC507
C4B1            BIT $09
C4B3            BVS iC4E2
C4B5            CMP #$3F
C4B7            BNE iC4BD
C4B9            LDA #$99
C4BB            BNE iC4E2
C4BD    iC4BD   CMP #$30
C4BF            BCC iC4C5
C4C1            CMP #$3C
C4C3            BCC iC4E2
C4C5    iC4C5   STY $6E
C4C7            LDY #$00
C4C9            STY $05
C4CB            DEY
C4CC            STX $77
C4CE            DEX
C4CF    iC4CF   INY
C4D0            INX
C4D1    iC4D1   LDA $0200,X


C4D4            SEC
C4D5            SBC $C092,Y
C4D8            BEQ iC4CF
C4DA            CMP #$80
C4DC            BNE iC50E
C4DE            ORA $05
C4E0    iC4E0   LDY $6E
C4E2    iC4E2   INX
C4E3            INY
C4E4            STA $01FB,Y
C4E7            LDA $01FB,Y
C4EA            BEQ iC522
C4EC            SEC
C4ED            SBC #$3A
C4EF            BEQ iC4F5
C4F1            CMP #$49
C4F3            BNE iC4F7
C4F5    iC4F5   STA $09
C4F7    iC4F7   SEC
C4F8            SBC #$55
C4FA            BNE iC49B
C4FC            STA $04
C4FE    iC4FE   LDA $0200,X
C501            BEQ iC4E2
C503            CMP $04
C505            BEQ iC4E2
C507    iC507   INY
C508            STA $01FB,Y
C50B            INX
C50C            BNE iC4FE
C50E    iC50E   LDX $77
C510            INC $05
C512    iC512   INY
C513            LDA $C091,Y
C516            BPL iC512
C518            LDA $C092,Y
C51B            BNE iC4D1
C51D            LDA $0200,X
C520            BPL iC4E0
C522    iC522   STA $01FD,Y
C525            DEC $78     
C527            LDA #$FF
C529            STA $77
C52B            RTS

Accordingly, “LE THEN” is tokenized as expected:

### COMMODORE BASIC ###

 7167 BYTES FREE

READY.

10 IF LS = LE THEN GOTO 100

LIST

 10 IF LS = LE THEN GOTO 100
READY.

And, as in memory:

addr  code                semantics

0401  17 04               link: $0417
0403  0A 00               line# 10
0405  8B                  token IF
0406  20 4C 53 20         ascii « LS »
040A  B2                  token =
040B  20 4C 45 20         ascii « LE »
040F  A7                  token THEN
0410  20                  ascii « »
0411  89                  token GOTO
0412  20 31 30 30         ascii « 100»
0416  00                  -EOL-
0417  00 00               -EOT-

But now, there was a problem with two-word “GO TO”, the traditional and most frequent use of white space skipping in keywords. For this, the keyword list was amended for another token, “GO”, token ID 0xCB:

code                   keyword  token  index

...
4C 45 46 54 A4         LEFT$    0xC8   (72)
52 49 47 48 54 A4      RIGHT$   0xC9   (73)
4D 49 44 A4            MID$     0xCA   (74)
47 CF                  GO       0xCB   (75)
00                     <end-of-list>

The new token “GO” bears all the hallmarks of an afterthought: It works only in place of a plain “GOTO” and relies on the already existing token “TO” (as in “FOR … TO”). As there is no token for “SUB”, we can‘t use it for “GOSUB” and a two-word “GO TO” isn’t checked for in a “ON … GOTO …” construct, either.

But, it’s for this new token that we can still write and successfully execute

100 GO TO 200

and even:

100 IF A > B THEN GO TO 200

But, we can’t use “GO TO’ in place of “THEN”, like “GOTO”:

100 IF A > B GO TO 200
RUN

?SYNTAX ERROR IN  100
█

While more of a superficial fix, “GO” makes for a rather baffling new effect of the CRUNCH run-away bug:
(here for better readability in the lower-case/upper-case character set)

### commodore basic ###

15359 bytes free

ready.
10 gosuB 100

list

10 mid$su 100
ready. — A curious run-away bug in conjunction with “`GO`”.

What happens here? Well, when the routine runs away over the shifted “B” in “gosubB” and subsequently fails on matching the next keyword candidate, “RETURN” as the remaining part, it continues the keyword search as usual, but with the keyword counter now lagging behind by one. As it finally finds “GO” at the very end of the list, the resulting token is thus 0xCA instead of the correct 0xCB. — And 0xCA is really the token for “MID$”, found just before “GO” in the keword list!

Subsequently, the remaining “suB” is parsed as plain ASCII text, with the shifted letter “B” rejected.
Thus, we end up with the following tokenized line of BASIC:

addr  code                semantics

0401  0D 04               link: $040D
0403  0A 00               line# 10
0405  CA                  token MID$
0406  53 55 20 31 30 30   ascii «SU 100»
040C  00                  -EOL-
040D  00 00               -EOT-

— Yay, this worked out greatly! —

Having seen the major attraction, we may reassueredly turn our attention to the next major iteration, which is

Commodore BASIC 4.0

This one came with new commands for disk control, some new PETSCII control characters (like for switching the character set or setting tabulators), and featured an enhanced editor with support for things like sub-windowing for proper business use on the 40xx/80xx series. Since this required another 4K of ROM, at 0xB000– 0XBFFF, this wasn’t available for the original PET 2001 (the one with the built-in Datasette), which lacked the required ROM slot. But it was available as an upgrade for the PET 2001-N and other non-CRTC PETs (but here, the code for the new editor features was disabled, while the BASIC ROMs where the same).

Here’s the new version in comparison to Commodore BASIC 2.0 (“New ROM”), it shared the system addresses (zero-page, etc.) with:

BASIC 2.0 “New ROM”

C495            LDX $77
C497            LDY #$04
C499            STY $09
C49B    iC49B   LDA $0200,X
C49E            BPL iC4A7
C4A0            CMP #$FF
C4A2            BEQ iC4E2
C4A4            INX
C4A5            BNE iC49B
C4A7    iC4A7   CMP #$20
C4A9            BEQ iC4E2
C4AB            STA $04
C4AD            CMP #$22
C4AF            BEQ iC507
C4B1            BIT $09
C4B3            BVS iC4E2
C4B5            CMP #$3F
C4B7            BNE iC4BD
C4B9            LDA #$99
C4BB            BNE iC4E2
C4BD    iC4BD   CMP #$30
C4BF            BCC iC4C5
C4C1            CMP #$3C
C4C3            BCC iC4E2
C4C5    iC4C5   STY $6E
C4C7            LDY #$00
C4C9            STY $05

C4CB            DEY       
C4CC            STX $77   
C4CE            DEX       




C4CF    iC4CF   INY       
C4D0            INX       



C4D1    iC4D1   LDA $0200,X
C4D4            SEC
C4D5            SBC $C092,Y 
C4D8            BEQ iC4CF
C4DA            CMP #$80
C4DC            BNE iC50E
C4DE            ORA $05
C4E0    iC4E0   LDY $6E
C4E2    iC4E2   INX
C4E3            INY
C4E4            STA $01FB,Y
C4E7            LDA $01FB,Y
C4EA            BEQ iC522
C4EC            SEC
C4ED            SBC #$3A
C4EF            BEQ iC4F5
C4F1            CMP #$49
C4F3            BNE iC4F7
C4F5    iC4F5   STA $09
C4F7    iC4F7   SEC
C4F8            SBC #$55
C4FA            BNE iC49B
C4FC            STA $04
C4FE    iC4FE   LDA $0200,X
C501            BEQ iC4E2
C503            CMP $04
C505            BEQ iC4E2
C507    iC507   INY
C508            STA $01FB,Y
C50B            INX
C50C            BNE iC4FE
C50E    iC50E   LDX $77
C510            INC $05

C512    iC512   INY         
C513            LDA $C091,Y 





C516            BPL iC512
C518            LDA $C092,Y 

C51B            BNE iC4D1
C51D            LDA $0200,X
C520            BPL iC4E0
C522    iC522   STA $01FD,Y
C525            DEC $78
C527            LDA #$FF
C529            STA $77
C52B            RTS

BASIC 4.0

B4FB            LDX $77
B4FD            LDY #$04
B4FF            STY $09
B501    iB501   LDA $0200,X
B504            BPL iB50D
B506            CMP #$FF
B508            BEQ iB554
B50A            INX
B50B            BNE iB501
B50D    iB50D   CMP #$20
B50F            BEQ iB554
B511            STA $04
B513            CMP #$22
B515            BEQ iB579
B517            BIT $09
B519            BVS iB554
B51B            CMP #$3F
B51D            BNE iB523
B51F            LDA #$99
B521            BNE iB554
B523    iB523   CMP #$30
B525            BCC iB52B
B527            CMP #$3C
B529            BCC iB554
B52B    iB52B   STY $6E
B52D            LDY #$00
B52F            STY $05

B531            STX $77    
B533            LDA #$B0   
B535            STA $20    
B537            LDA #$B2   
B539            STA $1F    
B53B            BNE iB544  

B53D    iB53D   INX        
B53E            INC $1F    
B540            BNE iB544  
B542            INC $20    

B544    iB544   LDA $0200,X
B547            SEC
B548            SBC ($1F),Y 
B54A            BEQ iB53D
B54C            CMP #$80
B54E            BNE iB580
B550            ORA $05
B552    iB552   LDY $6E
B554    iB554   INX
B555            INY
B556            STA $01FB,Y
B559            LDA $01FB,Y
B55C            BEQ iB599
B55E            SEC
B55F            SBC #$3A
B561            BEQ iB567
B563            CMP #$49
B565            BNE iB569
B567    iB567   STA $09
B569    iB569   SEC
B56A            SBC #$55
B56C            BNE iB501
B56E            STA $04
B570    iB570   LDA $0200,X
B573            BEQ iB554
B575            CMP $04
B577            BEQ iB554
B579    iB579   INY
B57A            STA $01FB,Y
B57D            INX
B57E            BNE iB570
B580    iB580   LDX $77
B582            INC $05

B584    iB584   LDA ($1F),Y 
B586            PHP         
B587            INC $1F     
B589            BNE iB58D   
B58B            INC $20     
B58D    iB58D   PLP         

B58E            BPL iB584
B590            LDA ($1F),Y 

B592            BNE iB544
B594            LDA $0200,X
B597            BPL iB552
B599    iB599   STA $01FD,Y
B59C            DEC $78
B59E            LDA #$FF
B5A0            STA $77
B5A2            RTS

The changes are really just about how the keyword list is accessed. While it was previously done as in

SBC $C092,Y         ;$C4D5
LDA $C091,Y         ;$C513

the keyword list is now accessed in post-incremented indirect address mode via a pointer in 0x1F / 0x20:

SBC ($1F),Y         ;$B548
LDA ($1F),Y         ;$B590

Where previously the buffer was transformed in-place, just to be copied to memory in another step, now, thanks to the use of this pointer, we may write the output to anywhere in memory, just as we please.

All other changes are related to this, setting and updating the pointer in 0x1F and 0x20 accordingly.
(Moreover, the read cursor in X is now generally adjusted first, before the new pointer is handled.)

But there’s also a curious peculiarity to this new code, found at $B584:

B584    iB584   LDA ($1F),Y
B586            PHP              ;push A to stack
B587            INC $1F
B589            BNE iB58D
B58B            INC $20
B58D    iB58D   PLP              ;pull A from stack

What are PHA and PLA meant to achieve? Neither BNE nor INC affect the accumulator!
It‘s more like someone familiar with an other machine, say, the DEC PDP-1, where the IDX instruction, similar to INC, leaves the incremented value in the accumulator as a side effect, had a hand in this. On the 6502, though, this is not the case and these two instructions are entirely luxurious, achieving nothing but burning a few processor cycles.

(This piece of code is run whenever we failed to match a keyword candidate and read ahead in the list, in order to advance to the beginning of the next keyword candidate. Luckily, the input of a BASIC line isn’t that time critical.)

Anyways, none of these changes affect the run-away bug, which is still what it ever was:

*** commodore basic 4.0 ***

 31743 bytes free

ready.
10 gosuB 100
20 gosuBreturn 100
list

 10 mid$su 100
 20 gosub 100
ready.
█

Beyond the PET

BASIC V.2 for the VIC-20 and C64 is essentially derived from BASIC 2.0 for the PET. BASIC V.2 does away with the second cassette buffer, adds support for color (i.e., code maintaining the color RAM), implements IEEE-over serial (somewhat unfortunately so, as the clock synchronization didn’t work out as expected), adds support for RS-232C via the user port, and adds a few control characters (much like it’s found in BASIC 4.0 for the PET). But, otherwise it’s still the same.

Commodore 64 BASIC V.2, Kernal Rev. 2

Hence, it’s small wonder that the CRUNCH routine for rev. 2 of the C64 ROM is still the same.
(We may safely assume it’s also the same for the VIC and the rarer rev. 1 Kernal. I’ve to admit, I didn’t even check.)

System addresses are now slightly different, otherwise, it’s still the same, as is the keyword/token list:

PET BASIC 2.0 “New ROM”

C495            LDX $77
C497            LDY #$04
C499            STY $09
C49B    iC49B   LDA $0200,X
C49E            BPL iC4A7
C4A0            CMP #$FF
C4A2            BEQ iC4E2
C4A4            INX
C4A5            BNE iC49B
C4A7    iC4A7   CMP #$20
C4A9            BEQ iC4E2
C4AB            STA $04
C4AD            CMP #$22
C4AF            BEQ iC507
C4B1            BIT $09
C4B3            BVS iC4E2
C4B5            CMP #$3F
C4B7            BNE iC4BD
C4B9            LDA #$99
C4BB            BNE iC4E2
C4BD    iC4BD   CMP #$30
C4BF            BCC iC4C5
C4C1            CMP #$3C
C4C3            BCC iC4E2
C4C5    iC4C5   STY $6E
C4C7            LDY #$00
C4C9            STY $05
C4CB            DEY
C4CC            STX $77
C4CE            DEX
C4CF    iC4CF   INY
C4D0            INX
C4D1    iC4D1   LDA $0200,X
C4D4            SEC
C4D5            SBC $C092,Y
C4D8            BEQ iC4CF
C4DA            CMP #$80
C4DC            BNE iC50E
C4DE            ORA $05
C4E0    iC4E0   LDY $6E
C4E2    iC4E2   INX
C4E3            INY
C4E4            STA $01FB,Y
C4E7            LDA $01FB,Y
C4EA            BEQ iC522
C4EC            SEC
C4ED            SBC #$3A
C4EF            BEQ iC4F5
C4F1            CMP #$49
C4F3            BNE iC4F7
C4F5    iC4F5   STA $09
C4F7    iC4F7   SEC
C4F8            SBC #$55
C4FA            BNE iC49B
C4FC            STA $04
C4FE    iC4FE   LDA $0200,X
C501            BEQ iC4E2
C503            CMP $04
C505            BEQ iC4E2
C507    iC507   INY
C508            STA $01FB,Y
C50B            INX
C50C            BNE iC4FE
C50E    iC50E   LDX $77
C510            INC $05
C512    iC512   INY
C513            LDA $C091,Y
C516            BPL iC512
C518            LDA $C092,Y
C51B            BNE iC4D1
C51D            LDA $0200,X
C520            BPL iC4E0
C522    iC522   STA $01FD,Y
C525            DEC $78
C527            LDA #$FF
C529            STA $77
C52B            RTS

C64 BASIC V.2, Kernal Rev. 2

A57C            LDX $7A
A57E            LDY #$04
A580            STY $0F
A582   iA582    LDA $0200,X
A585            BPL iA58E
A587            CMP #$FF
A589            BEQ iA5C9
A58B            INX
A58C            BNE iA582
A58E   iA58E    CMP #$20
A590            BEQ iA5C9
A592            STA $08
A594            CMP #$22
A596            BEQ iA5EE
A598            BIT $0F
A59A            BVS iA5C9
A59C            CMP #$3F
A59E            BNE iA5A4
A5A0            LDA #$99
A5A2            BNE iA5C9
A5A4   iA5A4    CMP #$30
A5A6            BCC iA5AC
A5A8            CMP #$3C
A5AA            BCC iA5C9
A5AC   iA5AC    STY $71
A5AE            LDY #$00
A5B0            STY $0B
A5B2            DEY
A5B3            STX $7A
A5B5            DEX
A5B6   iA5B6    INY
A5B7            INX
A5B8   iA5B8    LDA $0200,X
A5BB            SEC
A5BC            SBC $A09E,Y
A5BF            BEQ iA5B6
A5C1            CMP #$80
A5C3            BNE iA5F5
A5C5            ORA $0B
A5C7   iA5C7    LDY $71
A5C9   iA5C9    INX
A5CA            INY
A5CB            STA $01FB,Y
A5CE            LDA $01FB,Y
A5D1            BEQ iA609
A5D3            SEC
A5D4            SBC #$3A
A5D6            BEQ iA5DC
A5D8            CMP #$49
A5DA            BNE iA5DE
A5DC   iA5DC    STA $0F
A5DE   iA5DE    SEC
A5DF            SBC #$55
A5E1            BNE iA582
A5E3            STA $08
A5E5   iA5E5    LDA $0200,X
A5E8            BEQ iA5C9
A5EA            CMP $08
A5EC            BEQ iA5C9
A5EE   iA5EE    INY
A5EF            STA $01FB,Y
A5F2            INX
A5F3            BNE iA5E5
A5F5   iA5F5    LDX $7A
A5F7            INC $0B
A5F9   iA5F9    INY
A5FA            LDA $A09D,Y
A5FD            BPL iA5F9
A5FF            LDA $A09E,Y
A602            BNE iA5B8
A604            LDA $0200,X
A607            BPL iA5C7
A609   iA609    STA $01FD,Y
A60C            DEC $7B
A60E            LDA #$FF
A610            STA $7A
A612            RTS

As may be expected from this, the code is also functionally the same.

Commodore 64 BASIC V.2, Kernal Rev. 3

In Kernal Rev. 3, the last incarnation of this code, which is also found in the C128, the tokenizing is still the same, while its address has moved by 4 bytes in ROM:

C64 Kernal Rev. 2

A57C            LDX $7A
A57E            LDY #$04
A580            STY $0F
A582   iA582    LDA $0200,X
A585            BPL iA58E
A587            CMP #$FF
A589            BEQ iA5C9
A58B            INX
A58C            BNE iA582
A58E   iA58E    CMP #$20
A590            BEQ iA5C9
A592            STA $08
A594            CMP #$22
A596            BEQ iA5EE
A598            BIT $0F
A59A            BVS iA5C9
A59C            CMP #$3F
A59E            BNE iA5A4
A5A0            LDA #$99
A5A2            BNE iA5C9
A5A4   iA5A4    CMP #$30
A5A6            BCC iA5AC
A5A8            CMP #$3C
A5AA            BCC iA5C9
A5AC   iA5AC    STY $71
A5AE            LDY #$00
A5B0            STY $0B
A5B2            DEY
A5B3            STX $7A
A5B5            DEX
A5B6   iA5B6    INY
A5B7            INX
A5B8   iA5B8    LDA $0200,X
A5BB            SEC
A5BC            SBC $A09E,Y
A5BF            BEQ iA5B6
A5C1            CMP #$80
A5C3            BNE iA5F5
A5C5            ORA $0B
A5C7   iA5C7    LDY $71
A5C9   iA5C9    INX
A5CA            INY
A5CB            STA $01FB,Y
A5CE            LDA $01FB,Y
A5D1            BEQ iA609
A5D3            SEC
A5D4            SBC #$3A
A5D6            BEQ iA5DC
A5D8            CMP #$49
A5DA            BNE iA5DE
A5DC   iA5DC    STA $0F
A5DE   iA5DE    SEC
A5DF            SBC #$55
A5E1            BNE iA582
A5E3            STA $08
A5E5   iA5E5    LDA $0200,X
A5E8            BEQ iA5C9
A5EA            CMP $08
A5EC            BEQ iA5C9
A5EE   iA5EE    INY
A5EF            STA $01FB,Y
A5F2            INX
A5F3            BNE iA5E5
A5F5   iA5F5    LDX $7A
A5F7            INC $0B
A5F9   iA5F9    INY
A5FA            LDA $A09D,Y
A5FD            BPL iA5F9
A5FF            LDA $A09E,Y
A602            BNE iA5B8
A604            LDA $0200,X
A607            BPL iA5C7
A609   iA609    STA $01FD,Y
A60C            DEC $7B
A60E            LDA #$FF
A610            STA $7A
A612            RTS

C64 Kernal Rev. 3

A578            LDX $7A
A57A            LDY #$04
A57C            STY $0F
A57E   iA57E    LDA $0200,X
A581            BPL iA58A
A583            CMP #$FF
A585            BEQ iA5C5
A587            INX
A588            BNE iA57E
A58A   iA58A    CMP #$20
A58C            BEQ iA5C5
A58E            STA $08
A590            CMP #$22
A592            BEQ iA5EA
A594            BIT $0F
A596            BVS iA5C5
A598            CMP #$3F
A59A            BNE iA5A0
A59C            LDA #$99
A59E            BNE iA5C5
A5A0   iA5A0    CMP #$30
A5A2            BCC iA5A8
A5A4            CMP #$3C
A5A6            BCC iA5C5
A5A8   iA5A8    STY $71
A5AA            LDY #$00
A5AC            STY $0B
A5AE            DEY
A5AF            STX $7A
A5B1            DEX
A5B2   iA5B2    INY
A5B3            INX
A5B4   iA5B4    LDA $0200,X
A5B7            SEC
A5B8            SBC $A09E,Y
A5BB            BEQ iA5B2
A5BD            CMP #$80
A5BF            BNE iA5F1
A5C1            ORA $0B
A5C3   iA5C3    LDY $71
A5C5   iA5C5    INX
A5C6            INY
A5C7            STA $01FB,Y
A5CA            LDA $01FB,Y
A5CD            BEQ iA605
A5CF            SEC
A5D0            SBC #$3A
A5D2            BEQ iA5D8
A5D4            CMP #$49
A5D6            BNE iA5DA
A5D8   iA5D8    STA $0F
A5DA   iA5DA    SEC
A5DB            SBC #$55
A5DD            BNE iA57E
A5DF            STA $08
A5E1   iA5E1    LDA $0200,X
A5E4            BEQ iA5C5
A5E6            CMP $08
A5E8            BEQ iA5C5
A5EA   iA5EA    INY
A5EB            STA $01FB,Y
A5EE            INX
A5EF            BNE iA5E1
A5F1   iA5F1    LDX $7A
A5F3            INC $0B
A5F5   iA5F5    INY
A5F6            LDA $A09D,Y
A5F9            BPL iA5F5
A5FB            LDA $A09E,Y
A5FE            BNE iA5B4
A600            LDA $0200,X
A603            BPL iA5C3
A605   iA605    STA $01FD,Y
A608            DEC $7B
A60A            LDA #$FF
A60C            STA $7A
A60E            RTS

Therefor, all the observation, we made for PET BASIC 1.0 (less the two instructions for white space skipping in keywords) and BASIC 2.0 are still valid, half a decade later.

BTW, you can easily discern the Kernal revision of a C64 by inspecting the contents of ROM location $FF80 (as in “PEEK (65408)”):

0xAA (dec. 170) … Rev. 1 (the address contains just the usual fill byte)
0x00 (dec. 0) … Rev. 2
0x03 (dec. 3) … Rev. 3
0x43 (dec. 67) … SX-64 (similar to Rev. 3, but there are no tape routines and screen colors differ)
0x64 (dec. 100) … CBM 4064 / Educator 64

And that’s it, for today. — With some of the questions actually answered, we drop the curtain (in lesser dismay.)

Norbert Landsteiner,
Vienna, 2025-07-05