Functions of an Assembler

  • Remove comments.
  • Replace named constants and labels.
  • Identify data or variable space to be located in memory.
  • Assign addresses to instructions and data.
  • Convert textual instructions to binary.

Implementing an Assembler

Single-Pass Assembler

For very basic assembly, it is possible to convert instructions to binary line-by-line.

MOV R1, #17 ;    → 0xE3A01011
MOV R2, #20 ;    → 0xE3A02014
ADD R0, R1, R2 ; → 0xE0810002

Two-Pass Assembler

Forward references cannot be resolved in a single pass. In order to this, two passes are required. The first pass:

  • Assigns an address to each instruction and data word.
  • Creates a Symbol Table to map labels to addresses.
  • Ensure that there are not any conflicting names. The second pass then uses this data to:
  • Replace labels with their values.
  • Convert instructions to binary.

Assembling Complex Pseudo-Instructions

If the addresses are sufficiently close, then ADRL R0, data can be implemented as a single instruction using and offset from the PC: ADD R0, PC, #(data-[PC]) If this is not possible, then it will have to be implemented using multiple instructions. This will cause the addresses of future symbols (including data) to change. Updating other instructions accordingly would require further passes.

Conversion of Instructions to Binary

Parsing a single instruction and producing a binary output is done in the same way as (albeit much simpler than) Compilers.

Lexical Analysis

The instruction is first broken into a list of individual tokens: start ADD R0, R1, #5 may be broken down as such:

identifieridentifieridentifiercommaidentifiercommahashnumber
startADDR0,R1,#5

Syntactical Analysis

It is determined whether each identifier refers to a label, instruction, register, literal etc. It is then determined whether the sequence of tokens forms complete, legal instruction according to the grammar of the language.

Semantic Analysis

This stage ensures that the correct number of operands are supplied and that they are of the right format for the given instruction, and resolves arguments where possible.

Code Generation

The outputs of the previous stages are used to generate the binary code for each instruction. Each section of the instruction maps to part of the binary output.

Relative Addressing

It is not usually known at compile-time the address to which the program will be loaded. This means that it is not practically possible to use absolute addresses. In order to overcome this, addresses are referenced as offsets from the PC (eg. label becomes [PC, #(label-[PC])]).