Curious adventures in (Commodore) BASIC tokenizing.

No, this is not about a one-hit-wonder from the charts of 1977, rather, it’s about something that could have happened on a Commodore PET around this time. More generally, it is about a curious bug/feature in early BASICs and how this gave way to an additional keyword in Commodore BASIC for the PET.
Both are related to BASIC syntax and how white space in BASIC text is handled by the interpreter. — Which sets up the topic of today’s installment, namely, the tokenizer routine in early MS 9-digit BASIC for the 6502 and how this evolved over time. As this promises some fun with #BASIC and #6502 code on the #PET and involves some #archeology, as well, we really ought to have a look at this…
To begin with, let’s fire up a PET 2001 with BASIC 1.0, the first ROM version shipped with the PET 2001, also known as “Old ROM”, to start a little experiment:
*** COMMODORE BASIC *** 7167 BYTES FREE READY. 10 IF LS = LE THEN GOTO 100 LIST 10 IF LS = LETHEN GOTO 100 READY. RUN ?SYNTAX ERROR IN 10 READY. █
So, what’s happening here?
And why does a perfectly valid BASIC statement throw a sntax error?
Let’s investigate…
As you may probably know, all white space in BASIC is optional and we do not require any of this as a keyword delimiter or separator of any kind. The two following statements work perfectly the same:
100 FOR I = 0 TO 10 STEP 2 100FORI=0TO10STEP2
However, there’s actually a small difference: while the first one may be more readable, the second one executes a bit faster and requires less memory, for which it is preferred by the greybeards.
Going back to our little experiment, something that may spring to the eye is a tiny, but crucial difference in the BASIC text as entered and the BASIC text as listed:
10 IF LS = LE THEN GOTO 100 LIST 10 IF LS = LETHEN GOTO 100 READY.
Did you see it? — No, there is no paranormal entity. — Rather, it’s something missing in the listing, namely the blank between “LE
” and “THEN
”, indicating that this is about a misshap in white space handling.
In order to investigate this further, we have to ask first
Words and Letters, or, What is “Tokenizing”?
An important consequence of not requiring any separation between keywords and other program text is that every letter in the program text can be the beginning of a keyword. E.g., as we encounter
20 LET A=V
We do not know, yet, whether this is the start of a keyword, as in
20 LET A=VAL(A$)
or “V
” is part of an identifier (variable name), as in
20 LET A=V1+5
The only way to discern this is by scanning ahead and finding out if this adds up to a valid keyword (a BASIC command) or not.
Now, going over a list of all keywords for each letter, comparing each letter to a valid start of a keyword and then scanning ahead to see, if this does indeed match, is a rather costly operation. It would be nice, if we could somehow avoid this.
And this is where the tokenizing comes to the rescue: every time, we enter a line of BASIC, we’ll copied the line to an input buffer, and there will be routine to go over this, in order to discern any keywords. Every time the routine does find a keyword, it will replace it by an integer value not allowed in regular BASIC text. As BASIC is generally restricted to 7-bit ASCII (that is, outside strings), we can use any values starting at 128
(hex 0x80
) and above. This way, we do it once and for all. Later, when the runtime routine of the interpreter encounters such a value (a token), it will immediately know what this is about and it will be even able to use this as an index into a lookup table for quick access.
Moreover, in times when memory was still expensive and thus scarce, this reduced the memory requirements for storing a BASIC program quite significantly. In an average BASIC listing, keywords make up for about half of the text, besides line numbers, so the savings will quite dramatically extend the reach of our memory.
What’s not to love about this? Tokens are really the key to BASIC.
In-Memory BASIC Programs
Knowing this, we may have a deeper look into BASIC text as is represented in memory. As for instance, on the example of our offending line of BASIC:
10 IF LS = LE THEN GOTO 100 LIST 10 IF LS = LETHEN GOTO 100 READY.
The equivalent tokenized program text is stored in the user area of the memory, which starts on the PET at 0x0401
:
addr code semantics 0401 17 04 link: $0417 (addr. of next line, 16-bit little endian) 0403 0A 00 line# 10 (16-bit binary) 0405 8B token IF 0406 20 4C 53 20 ascii « LS » 040A B2 token = 040B 20 ascii « » 040C 88 token LET 040D 48 45 4E 20 ascii «HEN » 0411 89 token GOTO 0412 20 31 30 30 ascii « 100» 0416 00 -EOL- 0417 00 00 -EOT- (link = null)
At first glance, we can see that BASIC has converted our input string “LE THEN
” into the token for “LET
” (0x88
) and the remaining text “HEN
”. Moreover, we can see that any keyword tokens are encoded as a single byte with a set sign-bit (n ≥ 0x80
) and anything else is preserved as-is in plain ASCII text.
So, quite clearly, BASIC skipped over the blank separating “LE
” from “THEN
”, recognizing the keyword (command) “LET
” and encoding this as the token ID 0x88
. — But, why should it do so?
BASIC and White Space
Initially, BASIC simply ignored any white space, just like FORTRAN and Algol did. The parser simply skipped over any blanks and these were of no significance, whatsoever. This is how Dartmouth BASIC did it and how many versions of BASIC did it in the 1970s. Notably, is this also what you would find in listings.
But this had consequences: e.g., at this time, it wasn’t that clear whether “GOTO
” was a single word or two, as in “GO TO
”. In English, these are certainly two words and to the interpreter it didn’t matter. So we may be rather undecided on this. — And, indeed, there were plenty of listings, where you could find constructs like:
120 IF A > 10 THEN GO TO 200
In MS BASIC, this is taken care of by the tokenizer routine. As we may observe in our above peek into the in-memory representation, it’s particularly in keywords, where the white space was ignored, while other instances of white space are preserved and faithfully reproduced in any listings.
That is, AppleSoft BASIC, another early version of MS BASIC for the 6502 and a close relative to Commodore BASIC, skips any white space and inserts a single blank after each keyword.
Thus, on the Apple ][, we get (even more obviously giving away, what might have gone wrong):
AppleSoft BASIC on the Apple ][:
]10 IF LS = LE THEN GOTO 100
]LIST
10 IF LS = LET HEN GOTO 100
]█
This is not dissimilar from what Commodore BASIC does with line numbers:
10A=2*0.5 20 PRINT A LIST 10 A=2*0.5 20 PRINT A READY.
Like AppleSoft BASIC in any other context, it skips over any white space after the line number and inserts a single blank for listings.
Which is also why we can’t have indented text and have to resort to tricks, if we really want indentations:
100 FOR Y=1 TO 20 110 : FOR X = 1 TO 40 120 : PRINT X*Y 130 : NEXT X 140 NEXT Y LIST 100 FOR Y=1 TO 20 110 : FOR X = 1 TO 40 120 : PRINT X*Y 130 : NEXT X 140 NEXT Y READY.
As we may see, Commodore BASIC doesn’t skip any blanks after a separating colon (“:
”).
And Numbers? (BTW)
Glad you should ask. It’s a lesser known fact that numbers are another area where Commodore BASIC deliberately skips over any blanks:
### COMMODORE BASIC ### 15359 BYTES FREE READY. 10 A= 1 000 000 .00 20 A= A*2 30 PRINT A LIST 10 A= 1 000 000 .00 20 A= A*2 30 PRINT A READY. RUN 2000000 READY.
While the listing proves that the constant “1 000 000 .00
” is faithfully stored in memory, the interpreter doesn’t hick up over the interspersed blanks, rather, it elegantly skips over them.
We can see this also, as we peek into memory:
addr code semantics 0401 16 04 link: $0416 0403 0A 00 line# 10 0405 41 ascii «A» 0406 B2 token = 0407 20 31 20 30 30 30 ascii « 1 000 000 .00» 040D 20 30 30 30 20 2E 0413 30 30 0415 00 -EOL- 0416 21 04 link: $0421 0418 14 00 line# 20 041A 41 ascii «A» 041B B2 token = 041C 20 41 ascii « A» 041E AC token * 041F 32 ascii «2» 0420 00 -EOL- 0421 29 04 link: $0429 0423 1E 00 line# 30 0425 99 token PRINT 0426 20 41 ascii « A» 0428 00 -EOL- 0429 00 00 -EOT- (link = null)
But why should it do so? Obviously, this is slow and cumbersome, since the interpreter has to parse each of those constants anew and skip any of those blanks, whenever it encounters a number, often repeatedly so, like in a FOR
–NEXT
loop.
Well, there‘s more than a single way to represent a number.E.g., all those are the same:
1000
1000.0000
1 000
1 000.00
1E3
1.0E+3
(and so on)
It wouldn’t be that user friendly to type
100 A= 1000.00
just to have this listed back as, say,
LIST 100 A=1.0E3
The same goes for identifiers (variable names). The interpreter could maintain a reference table and just insert offsets into the program, but, for a dialog-like system, it’s much better to maintain a meaningful symbolic representation for listing and editing (instead of maybe an offset marker or hash, like #34e2
). Or, as only the first two characters are significant, it could easily truncate any longer names.
A compiler, on the other hand, can convert this at compile time, it can also resolve any calculations at compile time and even optimize over such precomputations. E.g, our above example
10 A= 1 000 000 .00 20 A= A*2
could become just something along the lines of
<STORE-REAL@FRAME-OFFSET> 0x48 2e6
Meaning, if BASIC is slow (and it is), it is so, because it’s user-friendly.
It’s slow so that it can preserve all the intricacies of the input, in order to reproduce them in a meaningful listing.
That is but for the instances where it doesn’t, like with our “LET HEN
” example.
Crunch Time
As for now, we have already established that our syntax error
isn’t a bug, but a user error: if the keyword routine can skip over blanks to correctly identify constructs like “GO TO
”, this implies that no adjacent strings of letters may add up to something that contains a valid BASIC command, even across word boundaries. (That is, outside of quoted string constants, of course.)
In this case, we got the legitimate keyword “LET
”, which isn’t a valid right-hand part of a comparison. So, yes, this a syntax error, indeed.
There isn’t any argueing over this: if skipping blanks is a feature, the error is in front of the screen.
Notably, this is not an issue on a PET with the upgraded Commodore BASIC 2.0 (AKA “New ROM”) or, for the matter, with any other BASIC for Commodore 8-bits, thereafter.
Commodore BASIC 2.0, “New ROM”:
### COMMODORE BASIC ### 7167 BYTES FREE READY. 10 IF LS = LE THEN GOTO 100 100 PRINT "OK" LIST 10 IF LS = LE THEN GOTO 100 100 PRINT "OK" READY. RUN OK READY.
So, what is special about Commodore BASIC 1.0 (or any other early flavor of MS BASIC for the 6502) and what does it do differently?
PET BASIC, “Old ROM”
What‘s needed for keyword parsing comes in to parts: a list of keywords and a routine to do the actual work, suggestively named “CRUNCH
”, which splits kewords from other text in the input stream and converts them to tokens.
Let‘s have a look at the data first.
The list of keywords starts at 0xC092
(underlined characters indicate a set sign-bit):
addr code petscii C092 45 4E C4 46 4F D2 ENDFOR C098 4E 45 58 D4 44 41 54 C1 NEXTDATA C0A0 49 4E 50 55 54 A3 49 4E INPUT#IN C0A8 50 55 D4 44 49 CD 52 45 PUTDIMRE C0B0 41 C4 4C 45 D4 47 4F 54 ADLETGOT C0B8 CF 52 55 CE 49 C6 52 45 ORUNIFRE C0C0 53 54 4F 52 C5 47 4F 53 STOREGOS C0C8 55 C2 52 45 54 55 52 CE UBRETURN C0D0 52 45 CD 53 54 4F D0 4F REMSTOPO C0D8 CE 57 41 49 D4 4C 4F 41 NWAITLOA C0E0 C4 53 41 56 C5 56 45 52 DSAVEVER C0E8 49 46 D9 44 45 C6 50 4F IFYDEFPO C0F0 4B C5 50 52 49 4E 54 A3 KEPRINT# C0F8 50 52 49 4E D4 43 4F 4E PRINTCON C100 D4 4C 49 53 D4 43 4C D2 TLISTCLR C108 43 4D C4 53 59 D3 4F 50 CMDSYSOP C110 45 CE 43 4C 4F 53 C5 47 ENCLOSEG C118 45 D4 4E 45 D7 54 41 42 ETNEWTAB C120 A8 54 CF 46 CE 53 50 43 (TOFNSPC C128 A8 54 48 45 CE 4E 4F D4 (THENNOT C130 53 54 45 D0 AB AD AA AF STEP+-*/ C138 DE 41 4E C4 4F D2 BE BD ^ANDOR>= C140 BC 53 47 CE 49 4E D4 41 <SGNINTA C148 42 D3 55 53 D2 46 52 C5 BSUSRFRE C150 50 4F D3 53 51 D2 52 4E POSSQRRN C158 C4 4C 4F C7 45 58 D0 43 DLOGEXPC C160 4F D3 53 49 CE 54 41 CE OSSINTAN C168 41 54 CE 50 45 45 CB 4C ATNPEEKL C170 45 CE 53 54 52 A4 56 41 ENSTR$VA C178 CC 41 53 C3 43 48 52 A4 LASCCHR$ C180 4C 45 46 54 A4 52 49 47 LEFT$RIG C188 48 54 A4 4D 49 44 A4 00 HT$MID$~
The tokenizer routine (AKA CRUNCH
) traverses over this list, using a set sign-bit for a word termination after that given character. In other words: a set sign-bit indicates the last character of any keyword. If a match is found, the index into this list added by 0x80
(again, the sign-bit set) provides the token ID.
If no match is found, as indicated by reaching the terminating zero-byte, the character, which was just inspected, must be a plain ASCII character (maybe part of a name) and is consequently copied to the program text as-is.
Thus, the above list translates to the following keyword-token table:
code keyword token index 45 4E C4 END 0x80 ( 0) 46 4F D2 FOR 0x81 ( 1) 4E 45 58 D4 NEXT 0x82 ( 2) 44 41 54 C1 DATA 0x83 ( 3) 49 4E 50 55 54 A3 INPUT# 0x84 ( 4) 49 4E 50 55 D4 INPUT 0x85 ( 5) 44 49 CD DIM 0x86 ( 6) 52 45 41 C4 READ 0x87 ( 7) 4C 45 D4 LET 0x88 ( 8) 47 4F 54 CF GOTO 0x89 ( 9) 52 55 CE RUN 0x8A (10) 49 C6 IF 0x8B (11) 52 45 53 54 4F 52 C5 RESTORE 0x8C (12) 47 4F 53 55 C2 GOSUB 0x8D (13) 52 45 54 55 52 CE RETURN 0x8E (14) 52 45 CD REM 0x8F (15) 53 54 4F D0 STOP 0x90 (16) 4F CE ON 0x91 (17) 57 41 49 D4 WAIT 0x92 (18) 4C 4F 41 C4 LOAD 0x93 (19) 53 41 56 C5 SAVE 0x94 (20) 56 45 52 49 46 D9 VERIFY 0x95 (21) 44 45 C6 DEF 0x96 (22) 50 4F 4B C5 POKE 0x97 (23) 50 52 49 4E 54 A3 PRINT# 0x98 (24) 50 52 49 4E D4 PRINT 0x99 (25) 43 4F 4E D4 CONT 0x9A (26) 4C 49 53 D4 LIST 0x9B (27) 43 4C D2 CLR 0x9C (28) 43 4D C4 CMD 0x9D (29) 53 59 D3 SYS 0x9E (30) 4F 50 45 CE OPEN 0x9F (31) 43 4C 4F 53 C5 CLOSE 0xA0 (32) 47 45 D4 GET 0xA1 (33) 4E 45 D7 NEW 0xA2 (34) 54 41 42 A8 TAB( 0xA3 (35) 54 CF TO 0xA4 (36) 46 CE FN 0xA5 (37) 53 50 43 A8 SPC( 0xA6 (38) 54 48 45 CE THEN 0xA7 (39) 4E 4F D4 NOT 0xA8 (40) 53 54 45 D0 STEP 0xA9 (41) AB + 0xAA (42) AD - 0xAB (43) AA * 0xAC (44) AF / 0xAD (45) DE ^ 0xAE (46) 41 4E C4 AND 0xAF (47) 4F D2 ON 0xB0 (48) BE > 0xB1 (49) BD = 0xB2 (50) BC 0xB3 (51) 53 47 CE SGN 0xB4 (52) 49 4E D4 INT 0xB5 (53) 41 42 D3 ABS 0xB6 (54) 55 53 D2 USR 0xB7 (55) 46 52 C5 FRE 0xB8 (56) 50 4F D3 POS 0xB9 (57) 53 51 D2 SQR 0xBA (58) 52 4E C4 RND 0xBB (59) 4C 4F C7 LOG 0xBC (60) 45 58 D0 EXP 0xBD (61) 43 4F D3 COS 0xBE (62) 53 49 CE SIN 0xBF (63) 54 41 CE TAN 0xC0 (64) 41 54 CE ATN 0xC1 (65) 50 45 45 CB PEEK 0xC2 (66) 4C 45 CE LEN 0xC3 (67) 53 54 52 A4 STR$ 0xC4 (68) 56 41 CC VAL 0xC5 (69) 41 53 C3 ASC 0xC6 (70) 43 48 52 A4 CHR$ 0xC7 (71) 4C 45 46 54 A4 LEFT$ 0xC8 (72) 52 49 47 48 54 A4 RIGHT$ 0xC9 (73) 4D 49 44 A4 MID$ 0xCA (74) 00 <end-of-list>
We may observe a few pecularities:
- There are 75 tokens in total.
- There’s some optimization in the order of keywords, with frequently occuring keywords like “
FOR
”, “NEXT
”, “INPUT
”, “DATA
”, “READ
”, “DIM
”, “LET
”, “GOTO
” listed first. (The token0x80
, index #0, is a “natural” candidate for “END
”.) - The “
#
” in “INPUT#
” and “PRINT#
” is not just a decoration, rather, these are keywords in their own right, represented by their own, unique tokens. - This also means that “
INPUT#
” must be inspected before “INPUT
”, as is “PRINT#
” before “PRINT
”. Otherwise, we’d always match on the respective latter and never catch the former.
(As we’ll see later, this has some consequences.) - Interestingly, “
GET#
” is not a token, meaning, here, “#
” is a decoration that has to be checked for separately. - “
SPC(
” and “TAB(
” include the opening paranthesis, while other function keywords do not. - This also means that “
SPC
” and “TAB
” are valid variable names!
(Only the first two characters will be significant, but the tokenizer routine will not trigger on those names and will copy the full strings to the program text.)
I guess, this another remarkable effort for broader comaptibilty with pre-existing BASIC listings? - While operators are generally tokens, there are no tokens for “
>=
”, “<=
”, or “<>
” (or, alternatively, “=>
”, “=<
”, “><
” — yes, BASIC has a double-arrow-operator and a bow-tie operator, but these are just alternative forms of the common relational operators.)
Instead of having their own tokens, these operators are built from combinations of “<
”, “>
”, and “=
”.
The Code — a Tour
So, as we know what’s actually processed, we may have a look at the CRUNCH
-y routine.
This starts $C48D
and processes zero-terminated input data from the BASIC buffer, which is for BASIC 1.0 located at $000A
–$0059
(10–89, 80 bytes long). The initial index from $0000
into this will be available in $C9
, pointing to the first non-blank character after the line number (at least 0x0B
or decimal 11).
The BASIC input buffer also doubles as the output buffer, meaning, CRUNCH
converts the contents of the BASIC buffer in situ. (Due to tokenization, the output will be almost always shorter than the input and is guaranteed to not exceed it. Moreover, while reading starts at the first character after the line number, which is at least at 0x0B
, the ouput will follow behind, starting at 0x0A
.)
The X
register is generally used as an index or cursor for reading from the BASIC buffer, while the Y
register is used either as an index into the keyword list or as an output cursor.
Note: With JavaScript enabled, you can hover over a labeled target to highlight the related label.)
C48D LDX $C9 ;set read cursor to initial char in BASIC buffer C48F LDY #$04 ;initialize output cursor ($0005+4 = one before $000A) C491 STY $60 ;also set this as mode flag (statement/string/DATA) C493 iC493 LDA $00,X ;read next char C495 BPL iC49E ;sign-bit not set: it's ASCII, skip ahead… C497 CMP #$FF ;is it the token for pi ("π")? C499 BEQ iC4DC ;yes, branch to copy it… C49B INX ;ignore (graphics) & advance read cursor C49C BNE iC493 ;redo for next char, unless zero… (unconditional) C49E iC49E CMP #$20 ;process character; is it a blank? C4A0 BEQ iC4DC ;yes, branch to copy it… C4A2 STA $5B ;store it as a delimiter C4A4 CMP #$22 ;is it a quotation mark (`"`)? C4A6 BEQ iC500 ;yes, branch to copy the string… C4A8 BIT $60 ;check mode flag C4AA BVS iC4DC ;if bit #6 set ($49: DATA), branch to copy as-is… C4AC CMP #$3F ;is it a question mark ("?") C4AE BNE iC4B4 ;no, skip ahead… C4B0 LDA #$99 ;load token for PRINT C4B2 BNE iC4DC ;unconditional branch to save… C4B4 iC4B4 CMP #$30 ;compare to ASCII "0" C4B6 BCC iC4BC ;smaller, not a digit, skip ahead… C4B8 CMP #$3C ;compare it to "<" C4BA BCC iC4DC ;branch to copy "0"…"9", ":", ";" as-is… C4BC iC4BC STY $C0 ;keyword search; store a backup of the output cursor C4BE LDY #$00 ;load 0 C4C0 STY $5C ;reset keyword index C4C2 DEY ;prepare Y for pre-increment loop C4C3 STX $C9 ;store a backup of the read cursor position C4C5 DEX ;prepare X for pre-increment loop C4C6 iC4C6 INY ;increment index into the keyword list C4C7 iC4C7 INX ;advance read cursor C4C8 iC4C8 LDA $00,X ;read char C4CA CMP #$20 ;is it a blank? C4CC BEQ iC4C7 ;yes: skip and branch to read the next char… C4CE SEC ;prepare subtraction C4CF SBC $C092,Y ;subtract current keyword char for comparison C4D2 BEQ iC4C6 ;exact match: continue with next char… C4D4 CMP #$80 ;is the difference the sign-bit (end of word)? C4D6 BNE iC507 ;it's a mismatch: prepare for next keyword… C4D8 ORA $5C ;it’s a token: $80 OR keyword index (in $5C) C4DA iC4DA LDY $C0 ;restore ouput cursor from backup C4DC iC4DC INX ;save a byte; pre-increment read cursor C4DD INY ;advance output cursor C4DE STA $0005,Y ;store the processed byte (char or token) C4E1 LDA $0005,Y ;post-processing; load again to set flags C4E4 BEQ iC51A ;was it zero (EOL)? branch to finalize… C4E6 SEC ;prepare subtraction C4E7 SBC #$3A ;subtract ":" (statement separator) C4E9 BEQ iC4EF ;found end of BASIC statement, skip to set mode flag… C4EB CMP #$49 ;was it $83 ($3A+$49, token for DATA)? C4ED BNE iC4F1 ;no, skip ahead… C4EF iC4EF STA $60 ;mode-flag: $00,$04=statement, $49=DATA C4F1 iC4F1 SEC ;prepare subtraction C4F2 SBC #$55 ;was it $8F ($3A+$55, token for REM)? C4F4 BNE iC493 ;no: branch to read & process next input char… C4F6 STA $5B ;set the delimiter to zero (EOL) C4F8 iC4F8 LDA $00,X ;get next input char C4FA BEQ iC4DC ;zero (EOL), branch to store it as-is… C4FC CMP $5B ;compare it to delimiter byte ($00 or $22) C4FE BEQ iC4DC ;found delimiter, branch to store it… C500 iC500 INY ;simple copy (REM, string), increment output cursor C501 STA $0005,Y ;store current byte C504 INX ;advance read cursor C505 BNE iC4F8 ;branch to copy next byte, unless zero… (unconditional) C507 iC507 LDX $C9 ;restore read cursor from backup C509 INC $5C ;increment keyword index C50B iC50B INY ;search ahead in list for end of keyword C50C LDA $C091,Y ;read keyword byte C50F BPL iC50B ;last char (sign-bit set)? if not, get next byte… C511 LDA $C092,Y ;peek into next byte C514 BNE iC4C8 ;branch to start over and try the next keyword… C516 LDA $00,X ;end of list, no match: re-read the input character C518 BPL iC4DA ;unconditional branch to copy it as-is… C51A iC51A STA $0007,Y ;end of line; store zero byte (could be again) C51D LDA #$09 ;reset start position for read cursor to $09 C51F STA $C9 ;store it C521 RTS ;done
Well, this may be quite a bit to swollow at once. Let’s have look at this in decent bytes.
— (I am not sorry!) —
The first part is just a bit of initialization, which also introduces a few important concepts:
C48D A6 C9 LDX $C9 ;first char position from $0000 C48F A0 04 LDY #$04 ;initial write index from $0005 C491 84 60 STY $60
First, we set up our read cursor, reading it from location $C9
. This will be initially set (and reset) to 0x09
, but as we enter the routine, this will have been advanced to point to the very first non-blank character after the line number. (So we have already skipped over any leading white space.) Notably, this will be used as an index from $0000
.
The output/write cursor is initialized to 0x04
. This an offset from $0005
, so this will point to $0009
, just before the start of the BASIC buffer.
Finally, we also store 0x04
in $60
, which is used as mode flag. (0x04
is coincidentially the ASCII code for EOT
.) Over the run of the routine, this mode flag may also receive the values 0
or 0x49
. But we’ll only care about bit #6 of this value, which will only set in 0x49
, indicating that we’re reading a “DATA
” section.
C493 B5 00 iC493 LDA $00,X ;fetch char from buffer C495 10 07 BPL iC49E C497 C9 FF CMP #$FF ;`π`? C499 F0 41 BEQ iC4DC C49B E8 INX C49C D0 F5 BNE iC493
This is the head of the main read-process loop.
“LDA $00,X
” loads the next byte/character to inspect. First, we check the sign-bit. If it’s unset, this means we’re dealing with a regular ASCII character and skip ahead to the next block. Otherwise, we test for 0xFF
, the code for pi (π). If it is pi, we branch to write it to the output position as-is.
If we’re still going, the sign-bit is set and it’s not pi. Meaning, it must be an illegal (shifted or graphics) character, which is not valid input text. Hence, we advance the read cursor and branch unconditionally to the beginning of the loop for the next character, ignoring the invalid byte.
C49E C9 20 iC49E CMP #$20 ;blank? C4A0 F0 3A BEQ iC4DC C4A2 85 5B STA $5B C4A4 C9 22 CMP #$22 ;`"`? C4A6 F0 58 BEQ iC500 C4A8 24 60 BIT $60 C4AA 70 30 BVS iC4DC C4AC C9 3F CMP #$3F ;`?`? C4AE D0 04 BNE iC4B4 C4B0 A9 99 LDA #$99 C4B2 D0 28 BNE iC4DC C4B4 C9 30 iC4B4 CMP #$30 ;`0`? C4B6 90 04 BCC iC4BC C4B8 C9 3C CMP #$3C ;`<`? C4BA 90 20 BCC iC4DC
Here, we’re back for valid input characters. Time to handle some further special cases:
First, we check for a blank (0x20
), and if it is, we branch to $C4DC
to copy it as-is.
Then, we store the current character in $5B
to be used as a string delimiter later on. (This will only matter, if the current character is a quotation mark.)
Let’s check, if it actually is a quotation mark (0x22
): if so, we branch to $C500
to copy the entire string.
Now it‘s $60
’s heyday: the BIT
instruction transfers (besides other things) bit #6 from our mode flag into the V
flag of the processor status register. If it’s set, this indicates that we’re currently reading a DATA
section and we branch (again) to that part at $C4DC
to copy this as-is.
The next check is for 0x39
(“?
”): if it’s not a question mark, we skip ahead. But, if it is, this is the standard shorthand for PRINT
(which is not in our keyword list) and we load the corresponding token number (0x99
) and branch unconditionally (we just loaded the non-zero value 0x99
, therefor the zero flag is unset) to $C4DC
to write this byte.
Finally, we check for digits (0…9), “:
” (0x3A
) and “;
” (0x3B
), which can be copied at $C4DC
, as well.
Now, if we’re still processing, this means that it will be probably a letter or an operator. Time to check for keywords…
C4BC 84 C0 iC4BC STY $C0 C4BE A0 00 LDY #$00 C4C0 84 5C STY $5C ;keyword index C4C2 88 DEY C4C3 86 C9 STX $C9 C4C5 CA DEX
Let’s prepare for the keyword search: First, we have to backup our write cursor, since we will now use the Y
register as an index into our keyword list, and also, as we’re at it, we reset Y
to zero. Next, we reset the the counter in $5C
, which keeps track of the current keyword index, to zero, as well. Then, we decrement the read index into the keyword list (in Y
) to -1
, because our main search loop will start with an increment.
We also store a backup of our buffer read cursor (in X
) at $C9
(the same pointer, we initialized it from), since we’re going to probingly read ahead and will probably have to start over again. And, just as before, we decrement it to prepare for the coming pre-increment loop.
C4C6 C8 iC4C6 INY C4C7 E8 iC4C7 INX C4C8 B5 00 iC4C8 LDA $00,X C4CA C9 20 CMP #$20 ;blank? C4CC F0 F9 BEQ iC4C7 C4CE 38 SEC C4CF F9 92 C0 SBC $C092,Y C4D2 F0 F2 BEQ iC4C6 C4D4 C9 80 CMP #$80 ;sign-bit? C4D6 D0 2F BNE iC507 C4D8 05 5C ORA $5C ;keyword index
Welcome to the main attraction, this is why we came here: the inner loop of the keyword search!
We start by advancing both the index into the keyword list and the read cursor and read the next byte. If it’s a blank (0x20
), we branch back to get read the next character from the buffer.
That’s it: this is how the tokenizer routine skips specially over any white space in keywords!
If it’s a non-blank character, we’re engaging into some destructive tests:
First, we subtract the corresponding byte from the keyword list. If the result is zero, we have an exact match, meaning, we have still a valid candidate, but not yet reached the final character, and redo for the next pair.
Next, we check for the end of a keyword, as indicated by the difference amounting to exactly 0x80
. If the difference amounts to anything else, we have a mismatch and branch to $C507
, in order to prepare for another go with the next keyword candidate.
If, on the other hand, the difference is 0x80
, which is a set sign-bit, we successfully advanced to the very last character of the given keyword and matched it. By OR
-ing this remaining 0x80
with the current keyword index (as stored in $5C
) we get our token.
This should be worth a brief contemplation: in a byte-size subtraction, it doesn’t matter, which of the two operands has the sign-bit set and which one not: (0x80+x)-x ≡ x-(0x80+x) ≡ 0x80
.
Since there is no carry involved to propagate the sign, this is essentially an XOR
operation. (Which is also the raison d'être of the overflow flag: otherwise, we’d never know.)
Indeed, the same could have been achieved by an EOR
instruction — and even a bit more economically so. (That it is not so, hints at an origin in a more general codebase adn option to encode this in a different manner.)
Coincidentially (or, rather, very much so by design), a shifted PETSCII character is the same as its unshifted counterpart, with a set sign-bit. This is how BASIC abbreviations work in Commodore BASIC: simply by tricking the tokenizer routine into matching a keyword early. If we were successfully matching a candidate word up to here and the difference is 0x80
, the routine assumes that it reached the end of the keyword and matched the entire word: voila, here is your token!
On the other hand, this is also responsible for some awkwardness in this: e.g., if we enter “iN” or “inpU”, this matches “INPUT#
” rather than the much more frequent “INPUT
”, because the former is tested first (compare the order of the keyword list). And necessarily so, otherwise the tokenizer routine couldn’t match “INPUT#
”, at all. And the same applies to “PRINT#
”.
Skipping out-of-order ahead to $C507
, we get to where the loop is prepared for a flying start-over to check against the next keyword candidate in the list (which is where to branched to in case of a missmatch):
C507 A6 C9 iC507 LDX $C9 C509 E6 5C INC $5C ;keyword index C50B C8 iC50B INY C50C B9 91 C0 LDA $C091,Y C50F 10 FA BPL iC50B ;last keyword char? C511 B9 92 C0 LDA $C092,Y C514 D0 B2 BNE iC4C8 ;end of list? C516 B5 00 LDA $00,X C518 10 C0 BPL iC4DA ;write it (unconditional)
First, we restore our buffer read cursor to where we started trying to match a keyword, then, we increment our counter into the keyword list (as a base for a potential token ID).
Then we advance our index into the keyword list to the next end of word: starting with the usual pre-increment, we read the current byte again. Mind that we’re not reading from $C092
as before, but from $C091
. And we repeat this until we find a byte with a set sign-bit.
Having thus reached the end of the word, we peek ahead, this time reading from $C092
, to see, if this the zero-byte marking the end of the keyword list, or if there’s more to do. If there is, we branch to the entrance of our keyword search loop with X
and Y
already set up.
In case we exhausted the list and read a zero-byte, we just write the current input character as-is. For this, we read the current input character from the BASIC buffer fresh again and branch unconditionally (there is no way of an overflow of the index register for a buffer that ends at $0059
) to $C4DA
, where we will write this to the output.
So far, we’ve seen the greater workings of the loop. If we previously fell through with a valid BASIC token in the accumulator, we first restore the output cursor from its backup:
C4DA A4 C0 iC4DA LDY $C0
Then it’s time to actually store a processed byte, which is either a token or a raw character (this is the very address, we branched to earlier for various special cases):
C4DC E8 iC4DC INX C4DD C8 INY C4DE 99 05 00 STA $0005,Y C4E1 B9 05 00 LDA $0005,Y C4E4 F0 34 BEQ iC51A ;EOL?
After a now customary preincrement to advance both the read and write cursors to the respective next buffer position, we simply store the current byte — just to load it again, do do some post-processing tests. First, we check if we just stored a zero-byte (EOL
), indicative of having reached the of the line. If so, we branch to $C51A
to finalize.
(We may observe that both instructions for write and read-back are of a 3-byte, full address format, where a zero-byte address mode would have done perfectly. There’s no technical reason for this, since there’s no chance that we could exceed the indexing range for a buffer, which ends at $59
. There’s obvious opportunity for further optimzation. But, more importantlty, it’s another hint at this having been derived from a more general codebase.)
If still in business and there’s still more input to process, we engage in another round of destructive testing:
C4E6 38 SEC C4E7 E9 3A SBC #$3A ;`:` C4E9 F0 04 BEQ iC4EF C4EB C9 49 CMP #$49 ;$83 = DATA? C4ED D0 02 BNE iC4F1 C4EF 85 60 iC4EF STA $60 C4F1 38 iC4F1 SEC C4F2 E9 55 SBC #$55 ;$8F = REM? C4F4 D0 9D BNE iC493 ;to entrance of main loop
First, we subtract 0x3A
, the code for a colon (“:
”). If it is now zero, it’s a match and we skip ahead to store this as the current mode flag in 0x60
. Otherwise, we compare this to 0x49
.0x3A + 0x49 = 0x83
, which is — consulting our keyword-token table — the token for “DATA
”. If it is now 0x49
, it must have been a DATA
token before the subtraction, and we skip ahead to store this remainder as the new mode flag.
In any other case, we skip ahead and subtract 0x55
:0x3A + 0x55 = 0x8F
, which happens to be the token for “REM
”, and, if the subtraction yields a zero in the accumulator, we have a match.
If it’s not a REM
statement, we branch back to the entrance of our read-process loop at $C493
to fetch and process the next input character.
C4F6 85 5B STA $5B ;delimiter C4F8 B5 00 iC4F8 LDA $00,X C4FA F0 E0 BEQ iC4DC ;EOL? C4FC C5 5B CMP $5B ;delimiter met? C4FE F0 DC BEQ iC4DC C500 C8 iC500 INY C501 99 05 00 STA $0005,Y C504 E8 INX C505 D0 F1 BNE iC4F8 ;write it (unconditional)
If it was “REM
”, we clear the delimiter in 0x5B
, in order to read to the end of line (EOL) and to consume the remaining line entirely.
What follows is the loop that copies a continous string of characters, as for a REM
clause or a quoted string. We load the current character, and, if it’s not the end of the line, we compare it to the delimiter. If it is the end of line or the delimiter, we jump back to $C4F8
to write the character as usual (where also the EOL
zero-byte will be finally caught).
Otherwise, we advance the output cursor and write the byte. And, after a pre-increment of the input cursor, we unconditionally branch to the entrance of this copy loop. (Again, there is no way to overflow on X
for a buffer of 80 bytes length.)
Mind that this close copy loop applies to quoted strings and REM
statements only. Notably, REM
statements may contain shifted characters, while, due to a bug in the LIST
routine, these will be expanded to their full keyword equivalent in any listing, as if these were tokens.
C51A 99 07 00 iC51A STA $0007,Y C51D A9 09 LDA #$09 C51F 85 C9 STA $C9 C521 60 RTS
The last few lines are finalizing the transformation and we only get here, if we read a terminating zero-byte. First we write this zero-byte to the buffer (we may have done so previously, but better to be sure) and after resetting the index for the input cursor to 0x09
, we finally exit the routine.
— Phew! —
Closing Observations
By this, the BASIC line
10 PRINT A+5
will be transformed in the buffer from
0000: 4C 30 D1 00 00 00 7F 7F L0...... 0008: 0A 00 31 30 20 50 52 49 ..10 PRI 0010: 4E 54 20 41 2B 35 00 44 NT A+5.D
into the equivalent tokenized code:
0000: 4C 30 D1 00 00 00 7F 7F L0...... 0008: 0A 00 99 20 41 AA 35 00 ... A.5. 0010: 4E 00 20 41 2B 35 00 44 N. A+5.D
(Mind the stray zero-byte at $0011
, an artefact of the final better-sure-than-sorry write operation, which happened after the byte was already written and the pointers (cursors in X
and Y
) were pre-incremented again. We may also observe that this does no harm and that there’s also a termination in the correct memory location.)
If in direct mode, this code can be executed immediately from here, otherwise, it will be copied to user memory as (part of) a stored program.
We may also observe the 16-bit binary equivalent (0A 00
) of the already processed line number immediately before this, at locations $0008
and $0009
.
A few observations:
- Overall, this is a rather efficient, self-contained routine. There are only branch instructions and no jumps, saving a few of those precious ROM bytes.
- There are also a few inefficiencies, owed to the fact that this is from a more general codebase and wasn’t specially adapted to the BASIC buffer being located in the zero-page (as for the codebase, it could be located anywhere). Or the end-of-keyword matching that may have been performed by an
EOR
instruction more compactly. - There’s also a major inefficiency, namely with the handling of a terminating zero-byte. Not only does the routine write this in most cases repeatedly, it’s also only checked for after a byte (character or token) has been written and reloaded, which happens only after the attempt to match a keyword. In other words, as we load a zero-byte at
$C493
, we pass all the checks for special cases and enter the keyword matching routine at$C4BC
, just to turn a victory lap over all the 75 possible keywords. Only then we fall through to writing the byte at$C4E4
, where the zero-byte will be finally detected, but only after the write operation, and from here we finally jump to the end of the routine, just to write this byte again. - The implementation of the special skipping over any white space in keywords is rather unspectacular: just a simple “
CMP #$20
” followed by a “BEQ
” branch instruction (at$C4CA
and$C4CC
). - And there’s an actual bug. A subtle one and only concerning an edge-case, but still a bug!
The CRUNCH Bug
This bug is exclusive to Commodre BASIC, as it is related to PETSCII encoding. As we have seen, keyword abbreviations are really a side effect of how shifted characters are encoding in PETSCII, matching a character with a set sign-bit as stored in the keword list. The tokenizer routine is entirely agnostic of this.
This also means that we can enter a keyword terminating character. While this will be rejected in a leading position, it will be processed like any other byte, when we read ahead trying to match any keyword candidates. As far as the keyword search code is concerned, any such premature matching of the terminating difference of 0x80
is just a side effect of the subtraction working either way, regardless of which of the two compared bytes has the sign-bit set.
Now, what happens, if we enter a shifted a shifted character at the final position of a keyword? We’ll have a perfect character match with zero difference! Meaning, the routine will still go on, trying to match what must remaining characters of the keyword candidate, thus spilling over into next keyword candidate, but without incrementing the count of keywords which have been inspected already. Should this eventually match (be it on this next keyword, or on any consecuitive try), the keyword count will be lagging behind by one, resulting in an incorrect token.
We can demonstrate this by a little trick of magic:
My attractive assistant will now enter some nonsensical BASIC code containing two consecutive keywords, and by the magic touch of the RETURN key, one of these keywords will vanish and the line will be transformed into perfectly valid BASIC!
To improve the performance and to amuse our esteemed audience, we will switch for this to the lower-case/upper-case character set:

I hope, you’re suitably impressed!
Peeking into memory, we can prove that the magical transformation is complete and not a fluke:
addr code semantics 0430 3A 04 link: $043A 0432 C8 00 line# 200 0434 8D token GOSUB 0435 20 31 30 30 ascii « 100» 0439 00 -EOL- 043A 00 00 -EOT-
Now, how does this work?
Well, “gosuB
” is number 13 in the keyword list (token 0x8D
) and “returN
” comes immedtiatly after this, at index 14 (token 0x8E
).
As the keyword matching code runs away over the perfect match of the shifted “B
” in both the input and the keyword candidate (indicating that we’re still in the middle of a valid keyword), it eventually matches “return
”, but without having incremented the keyword count. Thus, the resulting token is 0x8D
, the token for GOSUB
, while the entire string has been consumed and the tokenized line amounts to just “200 GOSUB 100
”.
The input string “return
”, though, is gone, gone with the magic wind of a run-away error!
And, by, this, we‘ve seen what‘s to be seen, in tokenizing with Commodore BASIC 1.0.
Further Developments
As already indicated, this was not to last. The upgraded “New ROM” with BASIC 2.0 (the one introducing proper IEEE488 disk support), brought various changes and modifications, also doing away with skipping over white space in keywords. And “LET HEN
” was no more.
BASIC 2.0 “New ROM”
All that had to be done to mend the “LET HEN
” bug was the ommision of two instructions.
The BASIC input buffer is now located at $0200
to $0250
(moved to outside of the zero-page — we can see now why those write instructions hadn’t been zero-page instructions, before) and the there’s a related extra instruction near the end for resetting the read cursor (for use as a double-byte pointer at $77
/ $78
).
Otherwise the code is still the same, as can be seen in this comparison:
BASIC 1.0 “Old ROM” C48D LDX $C9 C48F LDY #$04 C491 STY $60 C493 iC493 LDA $00,X C495 BPL iC49E C497 CMP #$FF C499 BEQ iC4DC C49B INX C49C BNE iC493 C49E iC49E CMP #$20 C4A0 BEQ iC4DC C4A2 STA $5B C4A4 CMP #$22 C4A6 BEQ iC500 C4A8 BIT $60 C4AA BVS iC4DC C4AC CMP #$3F C4AE BNE iC4B4 C4B0 LDA #$99 C4B2 BNE iC4DC C4B4 iC4B4 CMP #$30 C4B6 BCC iC4BC C4B8 CMP #$3C C4BA BCC iC4DC C4BC iC4BC STY $C0 C4BE LDY #$00 C4C0 STY $5C C4C2 DEY C4C3 STX $C9 C4C5 DEX C4C6 iC4C6 INY C4C7 iC4C7 INX C4C8 iC4C8 LDA $00,X C4CA CMP #$20 C4CC BEQ iC4C7 C4CE SEC C4CF SBC $C092,Y C4D2 BEQ iC4C6 C4D4 CMP #$80 C4D6 BNE iC507 C4D8 ORA $5C C4DA iC4DA LDY $C0 C4DC iC4DC INX C4DD INY C4DE STA $0005,Y C4E1 LDA $0005,Y C4E4 BEQ iC51A C4E6 SEC C4E7 SBC #$3A C4E9 BEQ iC4EF C4EB CMP #$49 C4ED BNE iC4F1 C4EF iC4EF STA $60 C4F1 iC4F1 SEC C4F2 SBC #$55 C4F4 BNE iC493 C4F6 STA $5B C4F8 iC4F8 LDA $00,X C4FA BEQ iC4DC C4FC CMP $5B C4FE BEQ iC4DC C500 iC500 INY C501 STA $0005,Y C504 INX C505 BNE iC4F8 C507 iC507 LDX $C9 C509 INC $5C C50B iC50B INY C50C LDA $C091,Y C50F BPL iC50B C511 LDA $C092,Y C514 BNE iC4C8 C516 LDA $00,X C518 BPL iC4DA C51A iC51A STA $0007,Y C51D LDA #$09 C51F STA $C9 C521 RTS
BASIC 2.0 “New ROM” C495 LDX $77 C497 LDY #$04 C499 STY $09 C49B iC49B LDA $0200,X C49E BPL iC4A7 C4A0 CMP #$FF C4A2 BEQ iC4E2 C4A4 INX C4A5 BNE iC49B C4A7 iC4A7 CMP #$20 C4A9 BEQ iC4E2 C4AB STA $04 C4AD CMP #$22 C4AF BEQ iC507 C4B1 BIT $09 C4B3 BVS iC4E2 C4B5 CMP #$3F C4B7 BNE iC4BD C4B9 LDA #$99 C4BB BNE iC4E2 C4BD iC4BD CMP #$30 C4BF BCC iC4C5 C4C1 CMP #$3C C4C3 BCC iC4E2 C4C5 iC4C5 STY $6E C4C7 LDY #$00 C4C9 STY $05 C4CB DEY C4CC STX $77 C4CE DEX C4CF iC4CF INY C4D0 INX C4D1 iC4D1 LDA $0200,X C4D4 SEC C4D5 SBC $C092,Y C4D8 BEQ iC4CF C4DA CMP #$80 C4DC BNE iC50E C4DE ORA $05 C4E0 iC4E0 LDY $6E C4E2 iC4E2 INX C4E3 INY C4E4 STA $01FB,Y C4E7 LDA $01FB,Y C4EA BEQ iC522 C4EC SEC C4ED SBC #$3A C4EF BEQ iC4F5 C4F1 CMP #$49 C4F3 BNE iC4F7 C4F5 iC4F5 STA $09 C4F7 iC4F7 SEC C4F8 SBC #$55 C4FA BNE iC49B C4FC STA $04 C4FE iC4FE LDA $0200,X C501 BEQ iC4E2 C503 CMP $04 C505 BEQ iC4E2 C507 iC507 INY C508 STA $01FB,Y C50B INX C50C BNE iC4FE C50E iC50E LDX $77 C510 INC $05 C512 iC512 INY C513 LDA $C091,Y C516 BPL iC512 C518 LDA $C092,Y C51B BNE iC4D1 C51D LDA $0200,X C520 BPL iC4E0 C522 iC522 STA $01FD,Y C525 DEC $78 C527 LDA #$FF C529 STA $77 C52B RTS
Accordingly, “LE THEN
” is tokenized as expected:
### COMMODORE BASIC ### 7167 BYTES FREE READY. 10 IF LS = LE THEN GOTO 100 LIST 10 IF LS = LE THEN GOTO 100 READY.
And, as in memory:
addr code semantics 0401 17 04 link: $0417 0403 0A 00 line# 10 0405 8B token IF 0406 20 4C 53 20 ascii « LS » 040A B2 token = 040B 20 4C 45 20 ascii « LE » 040F A7 token THEN 0410 20 ascii « » 0411 89 token GOTO 0412 20 31 30 30 ascii « 100» 0416 00 -EOL- 0417 00 00 -EOT-
But now, there was a problem with two-word “GO TO
”, the traditional and most frequent use of white space skipping in keywords. For this, the keyword list was amended for another token, “GO
”, token ID 0xCB
:
code keyword token index ... 4C 45 46 54 A4 LEFT$ 0xC8 (72) 52 49 47 48 54 A4 RIGHT$ 0xC9 (73) 4D 49 44 A4 MID$ 0xCA (74) 47 CF GO 0xCB (75) 00 <end-of-list>
The new token “GO
” bears all the hallmarks of an afterthought: It works only in place of a plain “GOTO
” and relies on the already existing token “TO
” (as in “FOR … TO
”). As there is no token for “SUB
”, we can‘t use it for “GOSUB
” and a two-word “GO TO
” isn’t checked for in a “ON … GOTO …
” construct, either.
But, it’s for this new token that we can still write and successfully execute
100 GO TO 200
and even:
100 IF A > B THEN GO TO 200
But, we can’t use “GO TO
’ in place of “THEN
”, like “GOTO
”:
100 IF A > B GO TO 200 RUN ?SYNTAX ERROR IN 100 █
While more of a superficial fix, “GO
” makes for a rather baffling new effect of the CRUNCH
run-away bug:
(here for better readability in the lower-case/upper-case character set)

GO
”.What happens here? Well, when the routine runs away over the shifted “B
” in “gosubB
” and subsequently fails on matching the next keyword candidate, “RETURN
” as the remaining part, it continues the keyword search as usual, but with the keyword counter now lagging behind by one. As it finally finds “GO
” at the very end of the list, the resulting token is thus 0xCA
instead of the correct 0xCB
. — And 0xCA
is really the token for “MID$
”, found just before “GO
” in the keword list!
Subsequently, the remaining “suB
” is parsed as plain ASCII text, with the shifted letter “B
” rejected.
Thus, we end up with the following tokenized line of BASIC:
addr code semantics 0401 0D 04 link: $040D 0403 0A 00 line# 10 0405 CA token MID$ 0406 53 55 20 31 30 30 ascii «SU 100» 040C 00 -EOL- 040D 00 00 -EOT-
— Yay, this worked out greatly! —
Having seen the major attraction, we may reassueredly turn our attention to the next major iteration, which is
Commodore BASIC 4.0
This one came with new commands for disk control, some new PETSCII control characters (like for switching the character set or setting tabulators), and featured an enhanced editor with support for things like sub-windowing for proper business use on the 40xx/80xx series. Since this required another 4K of ROM, at 0xB000
– 0XBFFF
, this wasn’t available for the original PET 2001 (the one with the built-in Datasette), which lacked the required ROM slot. But it was available as an upgrade for the PET 2001-N and other non-CRTC PETs (but here, the code for the new editor features was disabled, while the BASIC ROMs where the same).
Here’s the new version in comparison to Commodore BASIC 2.0 (“New ROM”), it shared the system addresses (zero-page, etc.) with:
BASIC 2.0 “New ROM” C495 LDX $77 C497 LDY #$04 C499 STY $09 C49B iC49B LDA $0200,X C49E BPL iC4A7 C4A0 CMP #$FF C4A2 BEQ iC4E2 C4A4 INX C4A5 BNE iC49B C4A7 iC4A7 CMP #$20 C4A9 BEQ iC4E2 C4AB STA $04 C4AD CMP #$22 C4AF BEQ iC507 C4B1 BIT $09 C4B3 BVS iC4E2 C4B5 CMP #$3F C4B7 BNE iC4BD C4B9 LDA #$99 C4BB BNE iC4E2 C4BD iC4BD CMP #$30 C4BF BCC iC4C5 C4C1 CMP #$3C C4C3 BCC iC4E2 C4C5 iC4C5 STY $6E C4C7 LDY #$00 C4C9 STY $05 C4CB DEY C4CC STX $77 C4CE DEX C4CF iC4CF INY C4D0 INX C4D1 iC4D1 LDA $0200,X C4D4 SEC C4D5 SBC $C092,Y C4D8 BEQ iC4CF C4DA CMP #$80 C4DC BNE iC50E C4DE ORA $05 C4E0 iC4E0 LDY $6E C4E2 iC4E2 INX C4E3 INY C4E4 STA $01FB,Y C4E7 LDA $01FB,Y C4EA BEQ iC522 C4EC SEC C4ED SBC #$3A C4EF BEQ iC4F5 C4F1 CMP #$49 C4F3 BNE iC4F7 C4F5 iC4F5 STA $09 C4F7 iC4F7 SEC C4F8 SBC #$55 C4FA BNE iC49B C4FC STA $04 C4FE iC4FE LDA $0200,X C501 BEQ iC4E2 C503 CMP $04 C505 BEQ iC4E2 C507 iC507 INY C508 STA $01FB,Y C50B INX C50C BNE iC4FE C50E iC50E LDX $77 C510 INC $05 C512 iC512 INY C513 LDA $C091,Y C516 BPL iC512 C518 LDA $C092,Y C51B BNE iC4D1 C51D LDA $0200,X C520 BPL iC4E0 C522 iC522 STA $01FD,Y C525 DEC $78 C527 LDA #$FF C529 STA $77 C52B RTS
BASIC 4.0 B4FB LDX $77 B4FD LDY #$04 B4FF STY $09 B501 iB501 LDA $0200,X B504 BPL iB50D B506 CMP #$FF B508 BEQ iB554 B50A INX B50B BNE iB501 B50D iB50D CMP #$20 B50F BEQ iB554 B511 STA $04 B513 CMP #$22 B515 BEQ iB579 B517 BIT $09 B519 BVS iB554 B51B CMP #$3F B51D BNE iB523 B51F LDA #$99 B521 BNE iB554 B523 iB523 CMP #$30 B525 BCC iB52B B527 CMP #$3C B529 BCC iB554 B52B iB52B STY $6E B52D LDY #$00 B52F STY $05 B531 STX $77 B533 LDA #$B0 B535 STA $20 B537 LDA #$B2 B539 STA $1F B53B BNE iB544 B53D iB53D INX B53E INC $1F B540 BNE iB544 B542 INC $20 B544 iB544 LDA $0200,X B547 SEC B548 SBC ($1F),Y B54A BEQ iB53D B54C CMP #$80 B54E BNE iB580 B550 ORA $05 B552 iB552 LDY $6E B554 iB554 INX B555 INY B556 STA $01FB,Y B559 LDA $01FB,Y B55C BEQ iB599 B55E SEC B55F SBC #$3A B561 BEQ iB567 B563 CMP #$49 B565 BNE iB569 B567 iB567 STA $09 B569 iB569 SEC B56A SBC #$55 B56C BNE iB501 B56E STA $04 B570 iB570 LDA $0200,X B573 BEQ iB554 B575 CMP $04 B577 BEQ iB554 B579 iB579 INY B57A STA $01FB,Y B57D INX B57E BNE iB570 B580 iB580 LDX $77 B582 INC $05 B584 iB584 LDA ($1F),Y B586 PHP B587 INC $1F B589 BNE iB58D B58B INC $20 B58D iB58D PLP B58E BPL iB584 B590 LDA ($1F),Y B592 BNE iB544 B594 LDA $0200,X B597 BPL iB552 B599 iB599 STA $01FD,Y B59C DEC $78 B59E LDA #$FF B5A0 STA $77 B5A2 RTS
The changes are really just about how the keyword list is accessed. While it was previously done as in
SBC $C092,Y ;$C4D5 LDA $C091,Y ;$C513
the keyword list is now accessed in post-incremented indirect address mode via a pointer in 0x1F
/ 0x20
:
SBC ($1F),Y ;$B548 LDA ($1F),Y ;$B590
Where previously the buffer was transformed in-place, just to be copied to memory in another step, now, thanks to the use of this pointer, we may write the output to anywhere in memory, just as we please.
All other changes are related to this, setting and updating the pointer in 0x1F
and 0x20
accordingly.
(Moreover, the read cursor in X
is now generally adjusted first, before the new pointer is handled.)
But there’s also a curious peculiarity to this new code, found at $B584
:
B584 iB584 LDA ($1F),Y B586 PHP ;push A to stack B587 INC $1F B589 BNE iB58D B58B INC $20 B58D iB58D PLP ;pull A from stack
What are PHA
and PLA
meant to achieve? Neither BNE
nor INC
affect the accumulator!
It‘s more like someone familiar with an other machine, say, the DEC PDP-1, where the IDX
instruction, similar to INC
, leaves the incremented value in the accumulator as a side effect, had a hand in this. On the 6502, though, this is not the case and these two instructions are entirely luxurious, achieving nothing but burning a few processor cycles.
(This piece of code is run whenever we failed to match a keyword candidate and read ahead in the list, in order to advance to the beginning of the next keyword candidate. Luckily, the input of a BASIC line isn’t that time critical.)
Anyways, none of these changes affect the run-away bug, which is still what it ever was:
*** commodore basic 4.0 *** 31743 bytes free ready. 10 gosuB 100 20 gosuBreturn 100 list 10 mid$su 100 20 gosub 100 ready. █
Beyond the PET
BASIC V.2 for the VIC-20 and C64 is essentially derived from BASIC 2.0 for the PET. BASIC V.2 does away with the second cassette buffer, adds support for color (i.e., code maintaining the color RAM), implements IEEE-over serial (somewhat unfortunately so, as the clock synchronization didn’t work out as expected), adds support for RS-232C via the user port, and adds a few control characters (much like it’s found in BASIC 4.0 for the PET). But, otherwise it’s still the same.
Commodore 64 BASIC V.2, Kernal Rev. 2
Hence, it’s small wonder that the CRUNCH
routine for rev. 2 of the C64 ROM is still the same.
(We may safely assume it’s also the same for the VIC and the rarer rev. 1 Kernal. I’ve to admit, I didn’t even check.)
System addresses are now slightly different, otherwise, it’s still the same, as is the keyword/token list:
PET BASIC 2.0 “New ROM” C495 LDX $77 C497 LDY #$04 C499 STY $09 C49B iC49B LDA $0200,X C49E BPL iC4A7 C4A0 CMP #$FF C4A2 BEQ iC4E2 C4A4 INX C4A5 BNE iC49B C4A7 iC4A7 CMP #$20 C4A9 BEQ iC4E2 C4AB STA $04 C4AD CMP #$22 C4AF BEQ iC507 C4B1 BIT $09 C4B3 BVS iC4E2 C4B5 CMP #$3F C4B7 BNE iC4BD C4B9 LDA #$99 C4BB BNE iC4E2 C4BD iC4BD CMP #$30 C4BF BCC iC4C5 C4C1 CMP #$3C C4C3 BCC iC4E2 C4C5 iC4C5 STY $6E C4C7 LDY #$00 C4C9 STY $05 C4CB DEY C4CC STX $77 C4CE DEX C4CF iC4CF INY C4D0 INX C4D1 iC4D1 LDA $0200,X C4D4 SEC C4D5 SBC $C092,Y C4D8 BEQ iC4CF C4DA CMP #$80 C4DC BNE iC50E C4DE ORA $05 C4E0 iC4E0 LDY $6E C4E2 iC4E2 INX C4E3 INY C4E4 STA $01FB,Y C4E7 LDA $01FB,Y C4EA BEQ iC522 C4EC SEC C4ED SBC #$3A C4EF BEQ iC4F5 C4F1 CMP #$49 C4F3 BNE iC4F7 C4F5 iC4F5 STA $09 C4F7 iC4F7 SEC C4F8 SBC #$55 C4FA BNE iC49B C4FC STA $04 C4FE iC4FE LDA $0200,X C501 BEQ iC4E2 C503 CMP $04 C505 BEQ iC4E2 C507 iC507 INY C508 STA $01FB,Y C50B INX C50C BNE iC4FE C50E iC50E LDX $77 C510 INC $05 C512 iC512 INY C513 LDA $C091,Y C516 BPL iC512 C518 LDA $C092,Y C51B BNE iC4D1 C51D LDA $0200,X C520 BPL iC4E0 C522 iC522 STA $01FD,Y C525 DEC $78 C527 LDA #$FF C529 STA $77 C52B RTS
C64 BASIC V.2, Kernal Rev. 2 A57C LDX $7A A57E LDY #$04 A580 STY $0F A582 iA582 LDA $0200,X A585 BPL iA58E A587 CMP #$FF A589 BEQ iA5C9 A58B INX A58C BNE iA582 A58E iA58E CMP #$20 A590 BEQ iA5C9 A592 STA $08 A594 CMP #$22 A596 BEQ iA5EE A598 BIT $0F A59A BVS iA5C9 A59C CMP #$3F A59E BNE iA5A4 A5A0 LDA #$99 A5A2 BNE iA5C9 A5A4 iA5A4 CMP #$30 A5A6 BCC iA5AC A5A8 CMP #$3C A5AA BCC iA5C9 A5AC iA5AC STY $71 A5AE LDY #$00 A5B0 STY $0B A5B2 DEY A5B3 STX $7A A5B5 DEX A5B6 iA5B6 INY A5B7 INX A5B8 iA5B8 LDA $0200,X A5BB SEC A5BC SBC $A09E,Y A5BF BEQ iA5B6 A5C1 CMP #$80 A5C3 BNE iA5F5 A5C5 ORA $0B A5C7 iA5C7 LDY $71 A5C9 iA5C9 INX A5CA INY A5CB STA $01FB,Y A5CE LDA $01FB,Y A5D1 BEQ iA609 A5D3 SEC A5D4 SBC #$3A A5D6 BEQ iA5DC A5D8 CMP #$49 A5DA BNE iA5DE A5DC iA5DC STA $0F A5DE iA5DE SEC A5DF SBC #$55 A5E1 BNE iA582 A5E3 STA $08 A5E5 iA5E5 LDA $0200,X A5E8 BEQ iA5C9 A5EA CMP $08 A5EC BEQ iA5C9 A5EE iA5EE INY A5EF STA $01FB,Y A5F2 INX A5F3 BNE iA5E5 A5F5 iA5F5 LDX $7A A5F7 INC $0B A5F9 iA5F9 INY A5FA LDA $A09D,Y A5FD BPL iA5F9 A5FF LDA $A09E,Y A602 BNE iA5B8 A604 LDA $0200,X A607 BPL iA5C7 A609 iA609 STA $01FD,Y A60C DEC $7B A60E LDA #$FF A610 STA $7A A612 RTS
As may be expected from this, the code is also functionally the same.
Commodore 64 BASIC V.2, Kernal Rev. 3
In Kernal Rev. 3, the last incarnation of this code, which is also found in the C128, the tokenizing is still the same, while its address has moved by 4 bytes in ROM:
C64 Kernal Rev. 2 A57C LDX $7A A57E LDY #$04 A580 STY $0F A582 iA582 LDA $0200,X A585 BPL iA58E A587 CMP #$FF A589 BEQ iA5C9 A58B INX A58C BNE iA582 A58E iA58E CMP #$20 A590 BEQ iA5C9 A592 STA $08 A594 CMP #$22 A596 BEQ iA5EE A598 BIT $0F A59A BVS iA5C9 A59C CMP #$3F A59E BNE iA5A4 A5A0 LDA #$99 A5A2 BNE iA5C9 A5A4 iA5A4 CMP #$30 A5A6 BCC iA5AC A5A8 CMP #$3C A5AA BCC iA5C9 A5AC iA5AC STY $71 A5AE LDY #$00 A5B0 STY $0B A5B2 DEY A5B3 STX $7A A5B5 DEX A5B6 iA5B6 INY A5B7 INX A5B8 iA5B8 LDA $0200,X A5BB SEC A5BC SBC $A09E,Y A5BF BEQ iA5B6 A5C1 CMP #$80 A5C3 BNE iA5F5 A5C5 ORA $0B A5C7 iA5C7 LDY $71 A5C9 iA5C9 INX A5CA INY A5CB STA $01FB,Y A5CE LDA $01FB,Y A5D1 BEQ iA609 A5D3 SEC A5D4 SBC #$3A A5D6 BEQ iA5DC A5D8 CMP #$49 A5DA BNE iA5DE A5DC iA5DC STA $0F A5DE iA5DE SEC A5DF SBC #$55 A5E1 BNE iA582 A5E3 STA $08 A5E5 iA5E5 LDA $0200,X A5E8 BEQ iA5C9 A5EA CMP $08 A5EC BEQ iA5C9 A5EE iA5EE INY A5EF STA $01FB,Y A5F2 INX A5F3 BNE iA5E5 A5F5 iA5F5 LDX $7A A5F7 INC $0B A5F9 iA5F9 INY A5FA LDA $A09D,Y A5FD BPL iA5F9 A5FF LDA $A09E,Y A602 BNE iA5B8 A604 LDA $0200,X A607 BPL iA5C7 A609 iA609 STA $01FD,Y A60C DEC $7B A60E LDA #$FF A610 STA $7A A612 RTS
C64 Kernal Rev. 3 A578 LDX $7A A57A LDY #$04 A57C STY $0F A57E iA57E LDA $0200,X A581 BPL iA58A A583 CMP #$FF A585 BEQ iA5C5 A587 INX A588 BNE iA57E A58A iA58A CMP #$20 A58C BEQ iA5C5 A58E STA $08 A590 CMP #$22 A592 BEQ iA5EA A594 BIT $0F A596 BVS iA5C5 A598 CMP #$3F A59A BNE iA5A0 A59C LDA #$99 A59E BNE iA5C5 A5A0 iA5A0 CMP #$30 A5A2 BCC iA5A8 A5A4 CMP #$3C A5A6 BCC iA5C5 A5A8 iA5A8 STY $71 A5AA LDY #$00 A5AC STY $0B A5AE DEY A5AF STX $7A A5B1 DEX A5B2 iA5B2 INY A5B3 INX A5B4 iA5B4 LDA $0200,X A5B7 SEC A5B8 SBC $A09E,Y A5BB BEQ iA5B2 A5BD CMP #$80 A5BF BNE iA5F1 A5C1 ORA $0B A5C3 iA5C3 LDY $71 A5C5 iA5C5 INX A5C6 INY A5C7 STA $01FB,Y A5CA LDA $01FB,Y A5CD BEQ iA605 A5CF SEC A5D0 SBC #$3A A5D2 BEQ iA5D8 A5D4 CMP #$49 A5D6 BNE iA5DA A5D8 iA5D8 STA $0F A5DA iA5DA SEC A5DB SBC #$55 A5DD BNE iA57E A5DF STA $08 A5E1 iA5E1 LDA $0200,X A5E4 BEQ iA5C5 A5E6 CMP $08 A5E8 BEQ iA5C5 A5EA iA5EA INY A5EB STA $01FB,Y A5EE INX A5EF BNE iA5E1 A5F1 iA5F1 LDX $7A A5F3 INC $0B A5F5 iA5F5 INY A5F6 LDA $A09D,Y A5F9 BPL iA5F5 A5FB LDA $A09E,Y A5FE BNE iA5B4 A600 LDA $0200,X A603 BPL iA5C3 A605 iA605 STA $01FD,Y A608 DEC $7B A60A LDA #$FF A60C STA $7A A60E RTS
Therefor, all the observation, we made for PET BASIC 1.0 (less the two instructions for white space skipping in keywords) and BASIC 2.0 are still valid, half a decade later.
BTW, you can easily discern the Kernal revision of a C64 by inspecting the contents of ROM location $FF80
(as in “PEEK (65408)
”):
0xAA
(dec.170
) … Rev. 1 (the address contains just the usual fill byte)0x00
(dec.0
) … Rev. 20x03
(dec.3
) … Rev. 30x43
(dec.67
) … SX-64 (similar to Rev. 3, but there are no tape routines and screen colors differ)0x64
(dec.100
) … CBM 4064 / Educator 64
And that’s it, for today. — With some of the questions actually answered, we drop the curtain (in lesser dismay.)
Norbert Landsteiner,
Vienna, 2025-07-05