This document describes the internals of the P65 system. Read this if you want to see the guts of how the system works, or if you're curious about how to implement a system with a lot of complex data structures in Perl. It is NOT necessary to read this if you just want to write code.
P65 doesn't act directly on the source - instead, it works with an intermediate representation (IR). To build the IR, it reads the files in a line at a time. The lexer breaks it up into tokens, and then the parser turns the line into one or more IR nodes.
A lexer token is a reference to a list of one or two elements, e.g., ["NUM", 23]. The first element is a string indicating what type it is, the second is the value. The lexer types are:
To distinguish opcodes from labels, the system keeps a string of all opcodes surrounded by dashes. Since dashes are illegal in identifiers, searching this string for "-str-" will succeed if and only if str is a real opcode.
Once a line has been broken up into tokens, this is passed to the parser. The parser uses a recursive descent technique to parse lines. Since the entire line has already been lexed into tokens, we can look ahead any amount on the line (attempting to look past end of line gives us an "EOL" token). Constructing a grammar for 6502 instructions is pretty straightforward. As it goes along, it builds up an IR list. Once the IR list has been made, the files are never touched again.
Pragmas are handled by having a hash table that connects pragma names (strings) to function references. This uniform interface makes adding new pragmas very easy; just define a new function and plug it into the pragma hashtable.
The IR is a list of lists. Perl doesn't support this directly, so it's really a list of references to lists. Each element in the IR list (a node) has the same first three elements. First comes the line number, then the current file name, then a string indicating what type of node it is. Any arguments that node takes come after that.
The arguments in nodes are either expressions from the parser (like <lbl+3) or simple strings (usually a label name of some kind). More complex arguments are represented as a four element list, with one element for prefix, type ("num" or "label"), value (integer or string as appropriate), then offset (an integer).
The meat of the IR, however, is the node types and arguments thereof. The basic components are:
.alias commands,
it's the argument listed..ascii and .incbin statements just
represent binary data, and thus are also represented as BYTE
nodes..word directive..org instruction. The
value is the address..checkpc instruction. The value is the address
bound..advance instruction. The value is the target
address.All other language features in P65 are implemented in terms of these.
The command .space var size becomes two IR
nodes:
LABEL var ^ SETPC ^+size
To implement anonymous labels, a running count is kept of how many such labels have already been created. The first label is named *1, the next *2, etc. Labels of the form ++ or - are computed based on current label count. Once the parse phase is over, temporary labels all have unique names and are treated as normal labels.
When simulating segmented memory, the parser keeps track of the currently active segment (defaults to "text"), and a hash table of how many times that segment has shown up. On a segment switch, it places a label of the form *oldsegment*n with a value of ^ (where n is the number of times it's shown up) and then sets the PC to the value of *newsegment*n (if such a label exists) or 0 (if it doesn't and this is a new segment). Essentially, it uses the label mechanism to keep track of where the PC for that segment should be, and only uses one variable for holding an actual active PC. Once the parse phase is complete, the memory model has been flattened and there is no need to care about segments anymore.
Once the IR has been constructed, the assembler can really get to work. First it checks the IR to ensure that all label references are valid. Then, it computes the values of labels and uses that to determine optimal instruction encoding. Finally, it produces the binary output, checking the properties whose verification must be delayed until the end. If an error occurs in any pass, further passes are aborted.
Passes are implemented using a walker. This steps through the IR list, and consults a dispatch table associated with the pass. Each node type is an entry in the dispatch table, and points to a function that handles that type of node. Unknown nodes may either flag errors or be ignored. Very simple passes that only focus on small numbers of operations usually ignore unknown node types.
It is the responsibility of each node to update the program counter appropriately. The walker always starts by setting the program counter to zero. (Remember, by this point segments have been flattened, so one program counter is all it needs to track.
Labels are maintained in a global hash table. These maps strings (the label names) to integers (the program counter values thereof).
The passes run in this order.
Once the assembly pass is completed, the list of bytes in memory is our desired output, and if there are no errors, this is written out.