Road to Compiler: An exciting journey to make a compiler from scratch.

João Augusto Perin
16 min readJul 22, 2020

You know…in every programmer’s lives the time that your ultra frameworked a libraries-based web applications won’t cut it anymore, you start to think more and more about what you should do next.

The same time that it is intriguing not to make whatever you want, it is comfortable to stay in the same place.

In one of these days I just jumped out of the bed, and decided I want to make something great, wandered a couple of hours in the internet, found some nice projects, but then came the idea, from the outstanding book from Bob Nystrom: Crafting Interpreters.

There was when I first started to document my progress and what I’ve learned on Notion, and this now, come to be my first series in my Medium Blog! 😃

So, are you ready?

P.S: Most of my code is basically identical to the code of the book, aside of some comments and variable/methods names. The main difference of this Series is that I’ll resume most of my learning out of the read through this book, and explain in my words (simpler words than the author ones).

How Languages Actually Work?

The process a programming language come across it’s similar to trespass a mountain. It has it’s challenges all the way up, and then, all the way down. Let’s take a high-level approach.

1. Scanning

The first step is scanning, also known as lexing, or (if you’re trying to impress someone) lexical analysis. They all mean pretty much the same thing.

A scanner (or “lexer”) takes in the linear stream of characters and chunks them together into a series of something closer to “words”. In programming languages, each of these words is called a token.

Some tokens are single characters, like ( and ,. Others may be several characters long, like numbers (123), string literals (“hi!”), and identifiers (min).

💡 “Lexical” comes from the Greek root “lex”, meaning “word”.

Some characters in a source file don’t actually mean anything.

Whitespace is often insignificant and comments, by definition, are ignored by the language. The scanner usually discards these, leaving a clean sequence of meaningful tokens.

2. Parsing

The next step is parsing. This is where our syntax gets a grammar — the ability to compose larger expressions and statements out of smaller parts.

A parser takes the flat sequence of tokens and builds a tree structure that mirrors the nested nature of the grammar. These trees have a couple of different names — “parse tree” or “abstract syntax tree” — depending on how close to the bare syntactic structure of the source language they are. In practice, language hackers usually call them “syntax trees”, “ASTs”, or often just “trees”.

💡 Parsing has a long, rich history in computer science that is closely tied to the artificial intelligence community. Many of the techniques used today to parse programming languages were originally conceived to parse human languages by AI researchers who were trying to get computers to talk to us.

3. Static Analysis

The first two stages are pretty similar across all implementations. Now, the individual characteristics of each language start coming into play. At this point, we know the syntactic structure of the code — things like operator precedence and expression nesting — but we don’t know much more than that.

In an expression like a + b, we know we are adding a and b, but we don’t know what those names refer to. Are they local variables? Global? Where are they defined?

The first bit of analysis that most languages do is called binding or resolution. For each identifier we find out where that name is defined and wire the two together. This is where scope comes into play — the region of source code where a certain name can be used to refer to a certain declaration.

If the language is statically typed, this is when we type check. Once we know where a and b are declared, we can also figure out their types. Then if those types don’t support being added to each other, we report a type error.

If the language is dynamic typed, the type check will occur only during the runtime.

4. Intermediate Representation

You can think of the compiler as a pipeline where each stage’s job is to organize the code in a way that makes the next stage simpler to implement. The front end of the pipeline is specific to the source language the user is programming in. The back end is concerned with the final architecture that the code will run on.

In the middle, the code may be stored in some intermediate representation (or “IR”) that isn’t tightly tied to either the source or destination forms (hence “intermediate”). Instead, the IR acts as an interface between these two languages.

5. Optimization

Once we understand what the user’s program means, we are free to swap it out with a different program that has the same semantics but implements them more efficiently — we can optimize it.

A simple example is constant folding: if some expression always evaluates to the exact same value, we can do the evaluation at compile time and replace the code for the expression with its result. If the user typed in:

pennyArea = 3.14159 * (0.75 / 2) * (0.75 / 2);

We can do all of that arithmetic in the compiler and change the code to:

pennyArea = 0.4417860938;

6. Code Generation

We have applied all of the optimizations we can think of to the user’s program. The last step is converting it to a form the machine can actually run. In other words generating code, where “code” refers to the kind of primitive assembly-like instructions a CPU runs and not the kind of “source code” a human might want to read.

If your compiler produces bytecode, your work isn’t over once that’s done. Since there is no chip that speaks that bytecode, it’s your job to translate. Again, you have two options. You can write a little mini-compiler for each target architecture that converts the bytecode to native code for that machine. You still have to do work for each chip you support, but this last stage is pretty simple and you get to reuse the rest of the compiler pipeline across all of the machines you support. You’re basically using your bytecode as an intermediate representation.

In both cases, for all but the basest of low-level languages, we usually need some services that our language provides while the program is running. For example, if the language automatically manages memory, we need a garbage collector going in order to reclaim unused bits. If our language supports “instance of” tests so you can see what kind of object you have, then we need some representation to keep track of the type of each object during execution.

All of this stuff is going at runtime, so it’s called, well, the “runtime”. In a fully compiled language, the code implementing the runtime gets inserted directly into the resulting executable. In, say, Go, each compiled application has its own copy of Go’s runtime directly embedded in it. If the language is run inside an interpreter or VM, then the runtime lives there. This is how most implementations of languages like Java, Python, and JavaScript work.

💡Lexemes x Tokens Lexemes represents all of the small pieces of information when scanning the source code, every word or signal that’s legal to the actual grammar of the language. Tokens in the other way are Lexemes that happen to be a reserved word as well (live while or if in the case of Lox).

The Scanner

After all of these concepts overflowing our heads, it’s time to truly build the scanner. Take a simple and fast glance at all the code we need, and then I’ll go through step by step explaining everything we have done so far.

package com.joaoaugustoperin.jLox;import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/*
* A import static isn't the best practice in the world but here
* it kinda makes sense. If you don't know about static importing
* I suggest you look after this. To resume, this keeps me from calling
* TokenType [dot] something everytime I want some static content from
* that class. And as Enum that it is, I would do this so many times
* across this file. Pretty handy, uh?
*
* */
import static com.joaoaugustoperin.jLox.TokenType.*;
class Scanner {
private final String source;
private final List<Token> tokens = new ArrayList<>();
private static final Map<String, TokenType> keywords;
private int start = 0;
private int current = 0;
private int line = 1;
static {
keywords = new HashMap<>();
keywords.put("and", AND);
keywords.put("class", CLASS);
keywords.put("else", ELSE);
keywords.put("false", FALSE);
keywords.put("for", FOR);
keywords.put("fun", FUN);
keywords.put("if", IF);
keywords.put("nil", NIL);
keywords.put("or", OR);
keywords.put("print", PRINT);
keywords.put("return", RETURN);
keywords.put("super", SUPER);
keywords.put("this", THIS);
keywords.put("true", TRUE);
keywords.put("var", VAR);
keywords.put("while", WHILE);
}

Scanner(String source) {
this.source = source;
}

List<Token> scanTokens() {
while (!isAtEnd()) {
// We are at the beginning of the next lexeme.
start = current;
scanToken();
}
tokens.add(new Token(EOF, "", null, line));
return tokens;
}

private void scanToken() {
char c = advance();
switch (c) {
case '(': addToken(LEFT_PAREN); break;
case ')': addToken(RIGHT_PAREN); break;
case '{': addToken(LEFT_BRACE); break;
case '}': addToken(RIGHT_BRACE); break;
case ',': addToken(COMMA); break;
case '.': addToken(DOT); break;
case '-': addToken(MINUS); break;
case '+': addToken(PLUS); break;
case ';': addToken(SEMICOLON); break;
case '*': addToken(STAR); break;

case '!': addToken(match('=') ? BANG_EQUAL : BANG); break;
case '=': addToken(match('=') ? EQUAL_EQUAL : EQUAL); break;
case '<': addToken(match('=') ? LESS_EQUAL : LESS); break;
case '>': addToken(match('=') ? GREATER_EQUAL : GREATER); break;

case '/':
if (match('/')) {
// A comment goes until the end of the line.
while (peek() != '\n' && !isAtEnd()) advance();
} else {
addToken(SLASH);
}
break;
case ' ':
case '\r':
case '\t':
break;
case '\n':
line++;
break;

case '"': string(); break;

default:
if (isDigit(c)) {
number();
}else if(isAlpha(c)) {
identifier();
} else {
Lox.error(line, "Unexpected character.");
}
break;
}
}

private void identifier() {
while (isAlphaNumeric(peek())) advance();
// See if the identifier is a reserved word.
String text = source.substring(start, current);
TokenType type = keywords.get(text);
if (type == null) type = IDENTIFIER;
addToken(type);
}

private void number() {
while (isDigit(peek())) advance();
// Look for a fractional part.
if (peek() == '.' && isDigit(peekNext())) {
// Consume the "."
advance();
while (isDigit(peek())) advance();
}
addToken(NUMBER,
Double.parseDouble(source.substring(start, current)));
}

private void string() {
while (peek() != '"' && !isAtEnd()) {
if (peek() == '\n') line++;
advance();
}
// Unterminated string.
if (isAtEnd()) {
Lox.error(line, "Unterminated string.");
return;
}
// The closing ".
advance();
// Trim the surrounding quotes.
String value = source.substring(start + 1, current - 1);
addToken(STRING, value);
}

private boolean match(char expected) {
if (isAtEnd()) return false;
if (source.charAt(current) != expected) return false;
current++;
return true;
}

private char peek() {
if (isAtEnd()) return '\0';
return source.charAt(current);
}

private char peekNext() {
if (current + 1 >= source.length()) return '\0';
return source.charAt(current + 1);
}

private boolean isAlpha(char c) {
return (c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z') ||
c == '_';
}
private boolean isAlphaNumeric(char c) {
return isAlpha(c) || isDigit(c);
}

private boolean isDigit(char c) {
return c >= '0' && c <= '9';
}
private boolean isAtEnd() {
return current >= source.length();
}

private char advance() {
current++;
return source.charAt(current - 1);
}
private void addToken(TokenType type) {
addToken(type, null);
}
private void addToken(TokenType type, Object literal) {
String text = source.substring(start, current);
tokens.add(new Token(type, text, literal, line));
}


}

I know, there’s so much thing going on right here, but don’t panic, at least, not yet. Grab a coffee and here we go!

Lexems & Tokens

In my last post here, we talked a little bit about the difference between these two concepts, but now we need to make it crystal clear.

Take this simple line of Lox code:

var language = "lox";

Here, var is the keyword for declaring a variable. That three-character sequence “v-a-r” means something. But if we yank three letters out of the middle of language, like “g-u-a”, those don’t mean anything on their own.

That’s what lexical analysis is about. Our job is to scan through the list of characters and group them together into the smallest sequences that still represent something. Each of these blobs of characters is called a lexeme. In that example line of code, the lexemes are:

However, in the process of grouping character sequences into lexemes, we also stumble upon some other useful information. When we take the lexeme and bundle it together with that other data, the result is a token.

The parser could categorize tokens from the raw lexeme by comparing the strings, but after all, string comparison ends up looking at individual characters, and isn’t that the scanner’s job?

That’s when we DEFINE THE TOKEN TYPES. It isn’t like a rule, or something you NEED to do, in order to make a compiler, but it IS needed in order to make a GREAT compiler.😉

To do so, let’s make a Java Enum:

package com.joaoaugustoperin.jLox;enum TokenType {
// Single-character tokens.
LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE,
COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR,
// One or two character tokens.
BANG, BANG_EQUAL,
EQUAL, EQUAL_EQUAL,
GREATER, GREATER_EQUAL,
LESS, LESS_EQUAL,
// Literals.
IDENTIFIER, STRING, NUMBER,
// Keywords.
AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,
EOF
}

A Token need its representation, let’s make it through a Java class:

package com.joaoaugustoperin.jLox;class Token {
final TokenType type;
final String lexeme;
final Object literal;
final int line;
Token(TokenType type, String lexeme, Object literal, int line) {
this.type = type;
this.lexeme = lexeme;
this.literal = literal;
this.line = line;
}
public String toString() {
return type + " " + lexeme + " " + literal;
}
}

Now we get to the trickiest part of the Scanner building. The Scanner class itself, with its properties and methods. But let’s go step by step, breath in, and here we go!

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import static com.craftinginterpreters.lox.TokenType.*; class Scanner {
private final String source;
private final List<Token> tokens = new ArrayList<>();
Scanner(String source) {
this.source = source;
}
}

To begin with, we first implement the Scanner class with simplicity. It makes:

  • Take the entire source code (as a String).
  • Creates a List of every Token we find in our source code.

💡

A import static isn’t the best practice in the world but here it kinda makes sense. If you don’t know about static importing I suggest you look after this. To resume, this keeps me from calling TokenType [dot] something every time I want some static content from that class. And as Enum that it is, I would do this so many times across this file. Pretty handy, uh?

And here begins the engine of our Scanner, the method that will control everything…the scanTokens Method:

private int start = 0;
private int current = 0;
private int line = 1;
List<Token> scanTokens() {
while (!isAtEnd()) {
// We are at the beginning of the next lexeme.
start = current;
scanToken();
}
tokens.add(new Token(EOF, "", null, line));
return tokens;
}

We create 3 more property fields of the Scanner class to keep track of where the code is being read.

The scanner works its way through the source code, adding tokens, until it runs out of characters. When it’s done, it appends one final “end of file” token (which, in this case, is “\0”). That isn’t strictly needed, but it makes our parser a little cleaner.

The start and current fields are offsets in the string — the first character in the current lexeme being scanned, and the character we’re currently considering.

💡 For Instance, considering these two variables. If we were scanning the word “sun”, it would be something like this:

Of course, considering “sun” as the first lexem in the first position of the first line in the first first of first first.

The other field tracks what source line current is on so we can produce tokens that know their location.

Now we define a little helper function to make sure we haven’t get to the end of the file yet (not get the “\0” token).

private boolean isAtEnd() {
return current >= source.length();
}

Recognizing Lexems

That’s where it starts to get messy, be aware!

Each turn of the loop, we scan a single token. This is the real heart of the scanner. We’ll start simple. Imagine if every lexeme was only a single character long. All you need to do is consume the next character and pick a token type for it. Several lexemes are only a single character in Lox, so let’s start with those:

private void scanToken() {
char c = advance();
switch (c) {
case '(': addToken(LEFT_PAREN); break;
case ')': addToken(RIGHT_PAREN); break;
case '{': addToken(LEFT_BRACE); break;
case '}': addToken(RIGHT_BRACE); break;
case ',': addToken(COMMA); break;
case '.': addToken(DOT); break;
case '-': addToken(MINUS); break;
case '+': addToken(PLUS); break;
case ';': addToken(SEMICOLON); break;
case '*': addToken(STAR); break;
}
}

the variable c holds the value of the current character passing through the scanner, than we pass (based on a Java switch statement) to the addToken helper function below:

private void addToken(TokenType type) {
addToken(type, null);
}
private void addToken(TokenType type, Object literal) {
String text = source.substring(start, current);
tokens.add(new Token(type, text, literal, line));
}

Other helper function we use right here:

private char advance() {
current++;
return source.charAt(current - 1);
}

It isn’t complicated, advance to the next character and return the value of the time it is called (that goes directly into that c variable of the switch statement).

Before we get too far in, let’s take a moment to think about errors at the lexical level. What happens if a user throws a source file containing some characters Lox doesn’t use, like @#^ at our interpreter? Right now, those characters get silently added to the next token. That ain’t right.

Ok, as the author told us, here we go:

default:
Lox.error(line, "Unexpected character.");
break;

Note that the erroneous character is still consumed by the earlier call to advance(), this means, it has already put on the c variable. That’s important so that we don’t get stuck in an infinite loop (But log correctly the error found). And we keep scanning even after that, it isn’t scanner’s responsibility to detect syntax errors, it just digest characters and ALERTS with error that the user inserted some illegal characters in.

We have single-character lexemes covered, but that doesn’t cover all of Lox’s operators. What about !? It’s a single character, right? Sometimes, yes, but not when it’s followed by an =. In that case, it should be a != lexeme. Likewise, <, >, and = can all be followed by =.

case '!': addToken(match('=') ? BANG_EQUAL : BANG); break;
case '=': addToken(match('=') ? EQUAL_EQUAL : EQUAL); break;
case '<': addToken(match('=') ? LESS_EQUAL : LESS); break;
case '>': addToken(match('=') ? GREATER_EQUAL : GREATER); break;

As you can see, we use another helper function called match:

private boolean match(char expected) {
if (isAtEnd()) return false;
if (source.charAt(current) != expected) return false;
current++;
return true;
}

It basically takes a expected input, and returns true if the actual input satisfies it. In the other hand returns false if we get to the end-of-file token (“/0”) or if the actual input is different than the expected.

We’re still missing one operator, /. That one needs a little special handling because comments begin with a slash too. So here it goes:

case '/':
if (match('/')) {
// A comment goes until the end of the line.
while (peek() != '\n' && !isAtEnd()) advance();
} else {
addToken(SLASH);
}
break;

As you probably noticed, here we go with another helper function called peek, and here it is:

private char peek() {
if (isAtEnd()) return '\0';
return source.charAt(current);
}

It is exactly like advance, except it doesn’t consume the character, meaning it doesn’t increase the current counter.

Now’s a good time to skip over those other meaningless characters, newlines and whitespace, too:

case ' ':
case '\r':
case '\t':
// Ignore whitespace.
break;
case '\n':
line++;
break;

Now that we’re comfortable with longer lexemes, we’re ready to tackle literals. We’ll do strings first, since they always begin with a specific character, :

case '"': string(); break;

That will obviously call the method string(), that’s defined right here:

private void string() {
while (peek() != '"' && !isAtEnd()) {
if (peek() == '\n') line++;
advance();
}
// Unterminated string.
if (isAtEnd()) {
Lox.error(line, "Unterminated string.");
return;
}
// The closing ".
advance();
// Trim the surrounding quotes.
String value = source.substring(start + 1, current - 1);
addToken(STRING, value);
}

The code explains itself, but the comments above every line is to help you a little bit.

Now…to the numbers!

To recognize the beginning of a number lexeme, we look for any digit. It’s kind of tedious to add cases for every decimal digit, so we’ll stuff it in the default case instead:

default:
if (isDigit(c)) {
number();
} else {
Lox.error(line, "Unexpected character.");
}
break;

And of course, here we go to the 2 helper methods: idDigit() & number(), as well as other helper function we’ll need inside of number() function:

private boolean isDigit(char c) {
return c >= '0' && c <= '9';
}
private char peekNext() {
if (current + 1 >= source.length()) return '\0';
return source.charAt(current + 1);
}
private void number() {
while (isDigit(peek())) advance();
// Look for a fractional part.
if (peek() == '.' && isDigit(peekNext())) {
// Consume the "."
advance();
while (isDigit(peek())) advance();
}
addToken(NUMBER,
Double.parseDouble(source.substring(start, current)));
}

Here comes a little boring part: You might think we could match keywords like or in the same way we handle multiple-character operators like <=:

case 'o':
if (peek() == 'r') {
addToken(OR);
}
break;

Consider what would happen if a user named a variable orchid. And here, for further information, I advice you to read this section on the source book for this project.

Instead, we assume any lexeme starting with a letter or underscore is an identifier:

default:
if (isDigit(c)) {
number();
} else if (isAlpha(c)) {
identifier();
} else {
Lox.error(line, "Unexpected character.");
}

That will call:

private void identifier() {
while (isAlphaNumeric(peek())) advance();
// See if the identifier is a reserved word.
String text = source.substring(start, current);
TokenType type = keywords.get(text);
if (type == null) type = IDENTIFIER;
addToken(type);
}

And use these helpers:

private boolean isAlpha(char c) {
return (c >= 'a' && c <= 'z') ||
(c >= 'A' && c <= 'Z') ||
c == '_';
}
private boolean isAlphaNumeric(char c) {
return isAlpha(c) || isDigit(c);
}

So now, we will define and use keywords in our Scanner class:

private static final Map<String, TokenType> keywords;  static {
keywords = new HashMap<>();
keywords.put("and", AND);
keywords.put("class", CLASS);
keywords.put("else", ELSE);
keywords.put("false", FALSE);
keywords.put("for", FOR);
keywords.put("fun", FUN);
keywords.put("if", IF);
keywords.put("nil", NIL);
keywords.put("or", OR);
keywords.put("print", PRINT);
keywords.put("return", RETURN);
keywords.put("super", SUPER);
keywords.put("this", THIS);
keywords.put("true", TRUE);
keywords.put("var", VAR);
keywords.put("while", WHILE);
}

And that’s it! We finally made our Scanner class! Obviously it ins’t the most precise and bug-free of all, but it actually does it work, it does what it mean to do.😄

So, congratulations to us! 👏👏👏

--

--

João Augusto Perin

I don’t know how to describe myself actually…maybe a hypothetical jedi? God no…that sounds like a bad album title from an underground 80’s band.