5-2 - Compilation and linking
*****************************
Compilation and link editing are the two main stages in the process
of translating a High Level Language (HLL) program to machine code
that can be executed on a CPU.
A small glossary
----------------
MACHINE INSTRUCTIONS Simple commands that the hardware can execute directly
GENERAL REGISTERS (or just REGISTERS) are small very fast memory units
built into the CPU, registers are used in almost all
CPU operations.
ADDRESSING MODES The methods in which a memory address can be specified
in a machine instruction.
PROGRAM COUNTER Special register keeping track of the memory location
of the next machine instruction to be executed.
CISC ARCHITECTURE Characterized by: full set of register-to-memory
instructions, variable length instructions, many
addressing modes, instructions typically decoded
and interpreted by a microprogram.
RISC ARCHITECTURE Characterized by: simple fixed-length instructions
(load/store), few addressing modes, instructions
typically decoded and executed directly by hardware.
ENTRY POINT Memory address where execution of a program/routine
starts.
CONTROL TRANSFER Stopping sequential execution of program code, and
jumping to a specified code location, performed by
changing the contents of the program counter.
JUMP Unconditional control transfer.
BRANCH Conditional control transfer.
MEMORY REFERENCE Accessing a specified place in memory that stores
some variable or machine instructions.
Machine instructions
--------------------
The CPU is an electronic machine, which constantly fetches
MACHINE INSTRUCTIONS from the main memory and executes them.
Machine instructions are simple commands that the hardware can
execute directly, typical (and basic) instructions:
1) Load a word from a certain location in main memory to a
certain GENERAL REGISTER
2) Perform some simple arithmetic operation (add, multiply
compare, etc) on data kept in a certain pair of registers
3) Store the contents of a certain register to a certain
location in main memory
4) Conditional jump (branch) or unconditional jump to a
pre-determined location in the program (this is really
just a memory location, as programs are loaded into
memory before execution).
Computer CPUs, either hardwired or using microcode are very limited,
every instruction (or micro-instruction) must be implemented by
dedicated digital logic electronic circuits, and so can't be too
complex.
Modern CPUs are usually designed according to the RISC (Reduced
Instruction Set Computer) paradigm, RISC CPUs can execute only a
small set of instructions, and so are simpler and faster. Of course
the RISC instructions are choosed so any needed operation can be
performed by a suitable sequence of them.
The older CISC paradigm (Complex Instruction Set Computers), takes
the opposite view, the CPU instruction set is made very large, every
instruction can be performed from every general register, in any
addressing mode (orthogonality), but even CISCs can execute only
relatively simple machine instructions.
A common opinion is that modern RISC CPUs are faster because it is
more efficient to build complex operations from smaller building
blocks, and the CISC wealth of instructions and modes is just slowing
the CPU and isn't useful to HLL compilers.
Other people maintain that RISCs are now faster than CISCs, not
because of the inherent superiority of their paradigm, but because
RISCs became fashionable and were built using modern technologies
(DEC NVAX chip is a modern fast CISC).
Addressing modes
----------------
Usually when we are loading/storing a register or jumping/branching
we need to specify a memory address, usually some (or all) of these
addressing modes are available:
Addressing mode Location of operand
----------------- ----------------------------------------------
Immediate/Literal In the machine instruction itself
Direct/Absolute Memory location whose address is specified in
the instruction
Indirect Memory location whose address is in a memory
location whose address is specified in the
machine instruction.
Register In a general register
Register-indirect Memory location whose address is in a register
whose name is specified in the instruction
Register-deferred Same as register-indirect
Autoincrement Register-indirect followed by an automatic increment
of the register contents by the operand size
Autodecrement Register-indirect preceded by an automatic decrement
of the register contents by the operand size
Displacement Memory location whose address is the sum of some
register's contents and a value specified in
the machine instruction
Relative Just like displacement mode, but the register
used is the program counter (with updated value)
Indexed Memory location whose address is the sum of some
register's contents and another register's
contents multiplied by a small constant that
equals a data item size
Indexed-indirect Memory location whose address is in the memory
location whose address is the sum of some
register's contents and another register's
contents multiplied by a small constant that
equals a data item size
Indexed-deferred Same as indexed-indirect
Remarks:
Displacement mode is sometimes called indexed mode, in that
case what we called indexed mode is probably unavailable.
The program counter is similar to other registers, but
changing its contents means performing a jump/branch.
The immediate/literal addressing mode is clearly not
suitable for memory variables.
FORTRAN programs and the underlying machine
-------------------------------------------
You can't really understand programming and how compilers works,
without examining the way the HLL source code is translated into
machine code.
Most compilers offer an option to generate an assembly listing of
the translated program, this is often the only way to check and
study how the compiler handles subtle points, e.g. if it does
shortcut-evaluations, handles properly DO-loop termination etc.
A small example program:
PROGRAM EXMPL
INTEGER INT
READ (*,*) INT
INT = INT + 1
WRITE (*,*) INT
END
And this is the translation to the classical assembly
language of the VAX, with a little editing to make
it more readable:
.TITLE EXMPL
.IDENT 01
0000 .PSECT $CODE
! PROGRAM EXMPL
0000 EXMPL::
0000 .WORD ^M
0002 MOVAL $LOCAL, R11
! READ (*,*) INT
0009 MNEGL #4, -(SP)
000C CALLS #1, FOR$READ_SL
0013 PUSHAL INT(R11)
0015 CALLS #1, FOR$IO_L_R
001C CALLS #0, FOR$IO_END
! INT = INT + 1
0023 ADDL3 #1, INT(R11), R12
! WRITE (*,*) INT
0027 MNEGL #1, -(SP)
002A CALLS #1, FOR$WRITE_SL
0031 PUSHL INT%R12
0033 CALLS #1, FOR$IO_L_V
003A CALLS #0, FOR$IO_END
! END
0041 MOVL #1, R0
0044 RET
.END
Symbolic memory locations
-------------------------
HLL programmers don't have to work directly with all these bewildering
addressing modes, the compiler uses them in a user-transparent way to
create the HLL language constructs.
HLLs (and even assembly) implements the following 'code mechanisms':
1) Explicit/Implicit statement labels, used as targets
for control transfer.
2) Memory variables - memory locations of suitable size,
that can be referenced by symbolic names instead of
memory addresses.
The most direct way to handle statement labels and memory variables
is to compute and keep the offsets (e.g. from the program beginning)
of the relevant machine instructions while translating the HLL to
machine code.
The two 'mechanisms' are actually implemented using suitable addressing
modes, and the above-mentioned 'offset bookeeping'.
The relocation problem
----------------------
A program usually contains many references to memory locations,
either as operands or as targets for jumps/branches.
It is convenient to have the compiler generate machine code that
can be easily combined with other pieces of machine code and form
one program, it is also convenient that a program will be able to
run no matter where in the main memory it was loaded.
What we are looking for is a way to write machine code that don't
use explicit memory addresses, that way the code will be able to
execute no matter where in memory it was loaded. That property is
called RELOCATABILITY.
To achieve that, all memory references contained in the program
must be either invariant to the memory location in which they will
be loaded, or some kind of an address translation process must
take place.
Possible methods to achieve location invariance are referring to
memory variables using relative mode, and using displacement mode
with some register (called base register) pointing to the beginning
of the loaded program.
The relative mode method works because when an instruction starts
executing the program counter points to it, if we make memory
references relative to the program counter they will be independant
of the place where the program was loaded.
In the displacement mode method we just have to load the base register
with the address of the loaded program beginning.
The relocation problem between modules
--------------------------------------
Using the relative/displacement addressing modes solves most of the
relocation problems in a single compilation unit.
In FORTRAN ....
There are some cases we will have to translate special memory references
by adding to them the memory address of the beginning of the program.
Symbol resolution
-----------------
Programs starts execution when the program counter is loaded with the
address of the main entry point,
Return to contents page