Parsing and Tokenizing

I’m fairly happy with my parsing and tokenizing code now. I wanted to give a little breakdown of how it works.

The over-all goal here is to take a command from the user in the form:


where C is a command such as “E” for Examine, “D” for Deposit, etc., and store it in memory, tokenized and converted from ASCII to binary.

I wanted to give the user flexibility. For example, numbers do not need to be zero-padded on entry. You should be able to type E 1F and have the monitor know you mean E 001F, or D 2FF 1A and know that you mean D 02FF 1A.

I wanted whitespace between tokens to be ignored. Typing "E 1FF" should be the same as typing "E    1FF "

And finally, I wanted to support multiple forms of commands with the same tokenizing code. The Examine command can take either one or two 16-bit addresses as arguments—for example, E 100 1FF should dump all memory contents from 0100 to 01FF. But the Deposit command takes one 16-bit address and between one and sixteen 8-bit values, for example D 1234 1A 2B 3C to deposit three bytes starting at address 1234

So I decided I’d reserve 20 bytes of memory in page 0 to hold the tokenized, converted data.

  • Byte 0 stores the number of arguments entered, not including the command itself
  • Byte 1 stores the command, for example “E” for Examine or “D” for Deposit.
  • Bytes 2 and 3 store the first argument, which is always a 16-bit address value.
  • Bytes 4 and 5 store the second argument, which is sometimes a full 16-bits, and sometimes only 8-bits occupying byte 4.
  • Bytes 6 through 20 are always 8-bit values.

Each implemented command then knows how to use this parsed data to fulfill its operation.

[An aside: Using page zero for this is controversial in my mind — 21 bytes is a lot of space to use, and page zero is precious, so I will move it to page 2 at a later date. But it will work the same, just a change in addressing mode from Zero Page,X to Absolute,X would be required.]

How It’s Implemented

The implementation is fairly straightforward. Here’s a very rough flowchart with some details elided:


In general, the idea is to start at the beginning of IBUF, the input buffer, and scan until the start of a token is found. This location is then stored in TKST. Next, we continue scanning until the end of the token is found. The end is stored in TKND. Once it is, we walk the token backward, one character at a time, converting it from a hexadecimal ASCII representation into a number. Once we’ve reached the start of the token (or we’ve done 4 characters, whichever comes first), we know we’re done with the token. We jump back to TKND and start the process over again.

We do this until either:

  • The buffer is exhausted, or
  • We’ve scanned 17 arguments, total

whichever comes first. At that point, we fall through and start executing whatever command has been decoded.

The code is below.