Interpreting Compilers and Compiling Interpreters

By yours truly

Consider this C++ snippet
int main() {
    return 57;
}
                    
Let's compile this and execute it
$ g++ -o hello hello.cpp
$ ./hello
$ echo $? # shows return value of last command
  57
                    

How did our code in "English" turn into 0s and 1s?

What happened in between?

What does it mean to compile?

In simple terms, a Compiler is a translater, that takes code written in a high-level language (return blablabla) and converts it into a low-level language that a computer can understand (01010101...)

This is a one-time process, and the output is a file that can be executed multiple times

But this isn't the only translator in the market

Enter the Interpreter

Instead of translating the entire code at once, an Interpreter translates the code line-by-line, executing it as it goes

with some trade-offs...

Compiler vs. Interpreter

Compiler Interpreter
Compiles the whole code before execution Executes code line-by-line
Faster execution Slower execution
Displays all errors after compilation Stops at the first error
No source code needed after compilation Requires source code every time
Used in production (C, C++, C#) Used in development (Python, Ruby, Perl)

Interpretation Flow

  1. Lexing – Convert code into tokens.
  2. Parsing – Generate AST.
  3. Semantic Analysis – Check correctness.
  4. Execution:
    • Interpreter – Executes code line by line.
    • Bytecode – Runs on VM (e.g., CPython, JVM).
    • JIT – Converts hot code to machine code.

Compilers

Compiler converts the code into assembly language, which is then converted into machine code by the assembler
What does the assembly code for our C++ program look like? We can run the following command to get the assembly code

g++ -S -o hello.s hello.cpp
C++ Code
int main() {
    return 57;
}
Assembly
	.file	"hello.cpp"
	.text
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	endbr64
	; and so on...
            

A brief look at assembly

(Go Down ↓)

Assembly language is a low-level programming language that is a human-readable version of machine code

It uses mnemonics to represent machine instructions (like MOV, ADD, SUB)

It requires manual handling of CPU registers and memory addresses

Each assembly language only works for a specific CPU architecture

x86-64 for Intel

ARM for Apple M-series

Compilers must generate assembly code that is compatible with the target architecture and OS

g++ will give a.out in Linux, a.exe in Windows

Each OS has different syscalls(?)

Headers change, from <windows.h> to <unistd.h>

and so do file system conventions, from "C:\\Users\\" to "/home/user/"

Data can be stored in memory and can be accessed using memory addresses

Reading and storing data in this way is slow

A much faster alternative are registers which are internal storage locations of the processor

Registers are the processor's personal notepad for storing information temporarily, though few in number

Examples - RAX, ESP, DS

System Calls are APIs for the interface between the user space and the kernel space

Every system call has a unique number associated with it

This number is stored in the EAX or RAX with arguments passed to other general purpose registers and stack as needed

The syscall instruction is used to trigger a system call

Let's take a look at the (simplified) assembly equivalent of our C++ snippet

global _start
_start:
   mov rax, 60
   mov rdi, 57
   syscall
            

This tells the assembler to expose _start as the global label so that the linker can recognise it as the entry point of the program

global _start
_start:
   mov rax, 60
   mov rdi, 57
   syscall
            

This defines the _start label, which is the entry point of our program

global _start
_start:
   mov rax, 60
   mov rdi, 57
   syscall
            

Moves the value 60 into the register rax

In Linux x86-64, 60 is the syscall number for exit

global _start
_start:
   mov rax, 60
   mov rdi, 57
   syscall
            

Moves the value 57 into the register rdi

The exit syscall uses rdi as its first argument, which represent the exit status of the program

global _start
_start:
   mov rax, 60
   mov rdi, 57
   syscall
            

Triggers a system call. Since rax contains 60, this executes exit(57) syscall, terminating the program with exit code 57

global _start
_start:
   mov rax, 60
   mov rdi, 57
   syscall
            

Let's compile and run this assembly code

The assembler, nasm, translated the assembly code into machine code and generates and object file

nasm -f elf64 -o hello.o hello.s

The linker, ld takes the object file and links it with the necessary library to create an executable file

ld -o hello hello.o
Questions about how the variables are stored in memory, how mathematical operations happens still remain unanswered, but we will get back to them!

Compilation Flow

  1. Lexing – Convert code into tokens.
  2. Parsing – Generate AST.
  3. Semantic Analysis – Check correctness.
  4. IR Generation – Convert AST to IR.
  5. Optimization – Improve performance.
  6. Code Generation – Convert IR to assembly.
  7. Assembly to Machine Code:
    • Assembler – Assembly → Object code.
    • Linker – Creates executable.
  8. Execution – CPU runs machine code.
Symbol Table is a data structure used by compilers to store information about the variables, functions, and other identifiers in the program. This information is gathered during the analysis phase and used during the code generation phase.

During Lexing, entries for identifiers are added to the symbol table

During Parsing, information about the type of the identifier is added

During Semantic analysis, the symbol table is used to check for undeclared variables, type mismatches, etc.

During IR Generation, Symbol Table is used to generate intermediate representation of the code

During Optimization, the symbol table is used to track the usage of variables and functions

During Code Generation, the symbol table is used to generate assembly code using the addresses of the variables and functions

Lexing is the process of converting the code into tokens and removes comments and whitespaces (space, tab, newline etc.)

Types of Tokens in a Compiler

  • Keywords (e.g., if, else, while)
  • Identifiers (e.g., variableName, functionName)
  • Operators (e.g., +, -, *, /)
  • Literals (e.g., 42, "Hello", true)
  • Punctuation (e.g., ;, ,, {}, ())

Tokenization of C Code

Code

int main()
{
    int x, a=5, b=7;
    x = 4*a + b;
    printf("x is %d", x);
}
      

Keywords: int, int

Identifiers: main, x, a, b, x, a, b, printf, x

Operators: =, =, =, *, +

Literals: 5, 7, 4, "x is %d"

Punctuation: (, ), {, ,, ,, ;, ;, (, ,, ), ;, }

Parsing is the process of converting the tokens into a tree-like structure called the Abstract Syntax Tree (AST)

The AST is a representation of the code that is easier to work with

It is used to check the correctness of the code and generate the intermediate representation

This is done using the use of Context-Free Grammars

The grammar defines the syntax of the language

The grammar consists of Terminals and Non-terminals, Production Rules, and Start Symbol

Parse Tree, Code, and AST for return(57);

Code

return(57);
      
Parse Tree

         stmt
          │
   returnStmt
      ┌────┴────┐
   RETURN    expr
      │         │
   LPAREN     NUMBER
      │         │
      (         57
      │         
   RPAREN
      │
   SEMI
      
AST

  ReturnStmt
     │
    57
      

Parse Tree, Code, and AST for
x = a + b * 7;

Code

x = a + b * 7;
      
Parse Tree
         stmt
          │
    assignmentStmt
      ┌────┴────┐
      ID       expr
      │         │
      x       ┌─┴─┐
            expr  expr
             │     │
            ID  ┌──┴──┐
            │   expr  NUMBER
            a     │      │
                 ID      7
                 │
                 b      
AST

  Assign(x)
      │
      +
     / \
    a   *
       / \
      b   7
      
Semantic Analysis is the process of checking the correctness of the code

It does type checking, scope checking, name resolution, control flow analysis and type inference

It uses the AST and the symbol table to perform these checks

Intermediate Representation (IR) is a representation of the code that is easier to work with than the AST and is used to perform optimizations

It allows for machine-independent optimizations

Three Address Code (TAC) is a form of IR that uses at most three operands per instruction, with only one operator
Code
x = (a + b) * c;
                            
Three Address Code
t1 = a + b
t2 = t1 * c
x = t2
                            

Each name has a pointer to an entry in the symbol table

Another Example

TAC greatly simplifies nested loops and if-while statements


While a < b
  If c < d Then
    x = (y + z) * w
  Else
    x = (y - z) * w
  End If
End While
                            
L1:  IfFalse a < b goto L3
     IfFalse c < d goto L2
     t1 = y + z
     t2 = w * t1
     x = t2
     goto L1
L2:  t2 = y - z
     x = t2
     t1 = y - z
     t2 = w * t1
     x = t2
     goto L1
L3:  ...
Before conversion to machine code, the TAC might be partitioned into Basic Blocks

A basic block is a sequence of instructions with a single entry point and a single exit point and they are always executed together

This allows for better optimization

L1:  IfFalse a < b goto L3
     IfFalse c < d goto L2
     t1 = y + z
     t2 = w * t1
     x = t2
     goto L1
L2:  t2 = y - z
     x = t2
     t1 = y - z
     t2 = w * t1
     x = t2
     goto L1
L3:  ...
B1  IfFalse a < b goto B7
B2  IfFalse c < d goto B5
B3  t1 = y + z
    t2 = w * t1
    x = t2
B4  goto B1
B5  t2 = y - z
    x = t2
B6  t1 = y - z
    t2 = w * t1
    x = t2
B6  goto B1
B7  ...
      

Note that flow between different blocks is stored in a graph called the Control Flow Graph

Each node is a basic block and each edge is a possible flow of control

Instructions inside each node are stored in a linked list, order can thus be changed with just manipulation of pointers when we optimise this IR

Basic Block Optimisation

This involves removing redundant instructions, dead code, and constant folding

The observable behaviour of the program should not change after these optimizations

... and it needs to be quick

Common Subexpression Elimination involves eliminates redundant calculations

This is done by using a directed acyclic graph (DAG) to store the results of the calculations

t1 = a + b
t2 = a + b
# becomes 
t1 = a + b
t2 = t1
Look at if a new expression points to the a pre-existing node with the same children and operator

If it does, then that part of the tree can be reused and we can copy the value of the pre-existing node


a = b + c
b = a - d
d = a - d

#becomes

a = b + c
b = a - d
d = b

      d     b
       \   /
       ( - )
       /   \
      a     d
       \   /
       ( + )
       /   \
      b     c
                            
Copy Propagation and Constant Folding involves replacing variables with their values and evaluating constant expressions

p = 5               p = 5
q = p + 3           q = 8
r = q + 2           r = 10
p = q * r           p = 80
                    
Dead Code Elimination involves removing code that doesn't affect the program's output

This is done by checking if the value of a variable is used after it is assigned


int foo(void) {
  int a = 24;
  int b = 25; /* Assignment to dead variable */
  int c;
  c = a * 4;
  return c;
  b = 24; /* Unreachable code */
  return 0;
}
                    

Removing dead code can make the program smaller and faster

Algebraic Transformations involves simplifying expressions using algebraic rules


a = b + 0           a = b
b = c * 1           b = c
c = d - 0           c = d
t = x**2            t = x * x
t = x * 2           t = x + x
t = x * 8           t = x << 3
                    

Object Code Generation

This is the process of conversion of IR to assembly code, it involves :

  • Mapping machine code instructions to IR instructions
  • Register Allocation and assignment
  • Machine Specific Optimizations to take advantage of the target architecture
Mappings for each instruction can be direct or complex depending on the architecture and data types (int, float, double ...)

Basic arithmetic and bitwise operations are usually direct (provided the registers are loaded properly)

For example, the IR instruction t1 = a + b can be mapped to ADD a, b, t1 in x86-64

Complex operations like division are split into multiple instructions (as we will soon see...)

Register Allocation is the process of assigning variables to registers

Registers are faster than memory, so the goal is to keep as many variables in registers as possible

If there are more variables than registers, some variables will be stored in memory, this is called spilling

Machine Specific Optimizations are optimizations that take advantage of the target architecture

Go down to see some examples

Instruction Scheduling is the process of reordering instructions to minimize the number of cycles

This is done to take advantage of pipelining and out-of-order execution

mov eax, [mem]  ; Load from memory
add eax, 1      ; Add 1
sub ebx, 6      ; Subtract 6
mov [mem], eax  ; Store to memory
                    

The above code can be reordered to take advantage of pipelining

mov eax, [mem]  ; Load from memory
sub ebx, 6      ; Subtract 6, perform this while eax loads
add eax, 1      ; Add 1
mov [mem], eax  ; Store to memory
                    

An independent instruction can be executed while the command executes

Vectorisation is the process of converting scalar operations into vector operations
mov eax, [arr1]  ; Load from memory
mov ebx, [arr2]  ; Load from memory
add eax, ebx    ; Add
mov [arr3], eax ; Store to memory
                    

The operations are repeated for each element in the array

The above code can be vectorised to take advantage of SIMD(Single Instruction, Multiple Data) instructions

movdqu xmm0, [arr1]   ; Load 4 packed integers from arr1
movdqu xmm1, [arr2]   ; Load 4 packed integers from arr2
paddd xmm0, xmm1      ; Add packed 32-bit integers
movdqu [arr3], xmm0   ; Store result to arr3
                    
Loop Unrolling is the process of duplicating the loop body to reduce the overhead of the loop

This is done to reduce the number of branches and improve instruction cache usage

The loop body is duplicated and the loop counter is incremented by the number of unrolled iterations

Inlining is the process of replacing a function call with the function body

This is done to reduce the overhead of the function call

Division v/s GCC

Division is a very computationally expensive operation

So, GCC does black magic (math) to convert division into multiplication and bit shifts

A simple function which divides by 2006 looks something like this

divideby2006(unsigned int):
        mov     eax, edi
        imul    rax, rax, 1096222959
        shr     rax, 41
        ret
n/2006 = ( n * 1096222959 ) >> 41 (How?)