Functions of an Assembler
- Remove comments.
- Replace named constants and labels.
- Identify data or variable space to be located in memory.
- Assign addresses to instructions and data.
- Convert textual instructions to binary.
Implementing an Assembler
Single-Pass Assembler
For very basic assembly, it is possible to convert instructions to binary line-by-line.
MOV R1, #17 ; → 0xE3A01011
MOV R2, #20 ; → 0xE3A02014
ADD R0, R1, R2 ; → 0xE0810002Two-Pass Assembler
Forward references cannot be resolved in a single pass. In order to this, two passes are required. The first pass:
- Assigns an address to each instruction and data word.
- Creates a Symbol Table to map labels to addresses.
- Ensure that there are not any conflicting names. The second pass then uses this data to:
- Replace labels with their values.
- Convert instructions to binary.
Assembling Complex Pseudo-Instructions
If the addresses are sufficiently close, then ADRL R0, data can be implemented as a single instruction using and offset from the PC:
ADD R0, PC, #(data-[PC])
If this is not possible, then it will have to be implemented using multiple instructions. This will cause the addresses of future symbols (including data) to change. Updating other instructions accordingly would require further passes.
Conversion of Instructions to Binary
Parsing a single instruction and producing a binary output is done in the same way as (albeit much simpler than) Compilers.
Lexical Analysis
The instruction is first broken into a list of individual tokens:
start ADD R0, R1, #5 may be broken down as such:
| identifier | identifier | identifier | comma | identifier | comma | hash | number |
|---|---|---|---|---|---|---|---|
start | ADD | R0 | , | R1 | , | # | 5 |
Syntactical Analysis
It is determined whether each identifier refers to a label, instruction, register, literal etc. It is then determined whether the sequence of tokens forms complete, legal instruction according to the grammar of the language.
Semantic Analysis
This stage ensures that the correct number of operands are supplied and that they are of the right format for the given instruction, and resolves arguments where possible.
Code Generation
The outputs of the previous stages are used to generate the binary code for each instruction. Each section of the instruction maps to part of the binary output.
Relative Addressing
It is not usually known at compile-time the address to which the program will be loaded. This means that it is not practically possible to use absolute addresses. In order to overcome this, addresses are referenced as offsets from the PC (eg. label becomes [PC, #(label-[PC])]).