Writing a compiler from scratch

Disclaimer: This is my first story. This means it may be non-informative, dull or both. Feedback is appreciated! (You can just email me. :)

Update: Per Vognsen currently has a project called Bitwise wherein compiler construction is widely discussed. For more in-depth background on compiler development, I really recommend watching some of his videos.

Writing my own compiler is something that’s always been on my bucket list (the same holds for writing a kernel). To check this one off, I have recently done my university’s Compiler construction course. The gist of this course is of course very straightforward: “In this course you are going to build a compiler.”

We were given a language called SPL to compile, which has pairs, global variables, lists and polymorphic functions (and more).

Off we go (lexer and parser) ¶

Our teachers had advised us to use a functional programming language, so that we could use combinators and pattern matching. I was not very familiar with Haskell and Clean, so we decided to use the Rust programming language for our project, because it had pattern matching and looked like a language that would be fairly easy to learn.

The first part was easy: Build a lexer. I was determined to always provide the user with helpful error messages, so we constructed our tokens like this:

pub enum Token {
    // Literal terminator symbols
    Eq, Semicolon, DoubleColon, LeftParen, RightParen, LeftBrace,
    RightBrace, LeftBracket, RightBracket, Comma, Dot, Arrow,

    // Keywords
    Var, Void, Int, Bool, Char, If, Else, While, Return, False,
    True, Hd, Tl, Fst, Snd,

    // (rest omitted)
}

pub struct TokenAndSpan(pub Token, pub Span);

This way we could always produce error messages that contained the code where it went wrong. After tokenizing the input (which was trivially easy) we went on to building the parser. We first build a suitable AST structure. We then carefully adapted our grammar so that it would be fully deterministic, and so that all the operators had the right associativity. Our tactic was to make it so that our grammar was a one-on-one representation of our parser.

This turned out to work very well. We did not need any kind of Shunting-yard algorithm which would add an extra layer of complexity to our parser. Just recursive descent.

This resulted in a parser.rs file of 1114 lines. Seems like a lot? Yes, but this code was so straightforward that it could be considered boilerplate. For what it’s worth, we could probably have generated the code from the grammar. The objective was to write a parser however, not a parser generator, so we could not care less.

The advanced-ness of our TokenAndSpans added a lot of extra lines, but they paid off. Our parser threw better errors than gcc’s (at that time):

error: expected ';' or an expression
4:5     } else {
        ^

Our parser was fast. Complete. No errors. — So far so good.

Type checking ¶

This is where stuff gets tricky.

At this stage you go through the AST and of every value you determine the exact type. We decorated every value in our AST with a type field, which kinda looked like this:

pub enum Type {
    /// Type variable, a.k.a. this type is not yet known. After type checking
    /// stage, no value should be of this type
    Variable(usize),

    /// Basic types
    Int, Char,

    /// List of values of `<Type>`
    List(Box<Type>),

    // (rest omitted)

Usually, this process is cut in two steps:

First you look for every value name in your [HIR]. If a value is a literal, you know the type. If it’s not, you assign a type variable to the value. This means that you now know that all values have some type, although you don’t know what it is yet.
Now it is time to fill in all the types. This is done by going through the code and inferring the type. When you find x = 3, you know that x is an Int. If further on you see that y = x : [], you know that y is a List(Int). Combining a pair of types is called type unification. This can also fail, of course. When it does, you will get a (sometimes useful) error.

We found out that doing this correctly was hard. Implementing type inference is easy when you are dealing with primitives. But our language has polymorphic funtions. This means that we must also be able to infer function types. For example:

// `insert(x, xs)`  inserts an element `x` into a list `xs`
insert(x, xs) :: a [a] -> [a] {
    // function body omitted
}

Our problem was that this may ultimately result in multiple different functions, where one may act on Ints, while another may act on Bools. Looking back at when we wrote the code, we could have easily implemeted monomorphization, but here was a lot of time pressure and in no time our infer.rs file looked like old cat vomit. We did not finish this compiler stage in time.

With regret we ditched the support for polymorphic funtions, and went on with writing the code generation stage.

LLVM codegen ¶

This stage our goal was to take our type-checked code and convert it to LLVM intermediate respresentation, and let llc compile it to platform specific bytecode.

After our defeat with the previous phase we were quite nervous to start. Lucky for us, coding this stage turned out to be quite easy. All constructs in the AST could be literally translated into corresponding LLVM IR.

In less than 1000 lines of code all features were translated, including dynamic allocation for lists. Garbage collection was not part of the course, so we did not implement this in any way. All malloc‘d list elements would never be free‘d, but (of course) we did not care.

LLVM is a slow compiler, but the bytecode that was generated was fast. I wrote a very naive factorization program in a couple of languages, including SPL, and did some benchmarks:

Compiler	Execution time
Pypy (version 2.2.1, `-O`)	8.2 s
Haskell (version 7.10.3, `-O2`)	8.4 s
Clean (version 2.4)	7.3 s
`rsplt` (LLVM version 3.6, `-O2`)	2.5 s

I measured this as accurately as possible. But like all benchmarks, this one is probably still very innaccurate. I recorded the best time of 5 runs on my laptop, which yielded in these execution times.

Our codegen was fast. We especially liked the fact that our SPL program was a lot faster than the equivalent Clean program.

I was actually pretty curious as to why we were faster that the Clean program. So I went as far as profiling the object code with callgrind. As far as I remember at the moment, the assembly code of both programs was almost identical. The hot code in the Clean binary contained about two more instructions than the rsplt binary, so there was actually no real difference.

We finished the project in time and our final presentation was quite fun. We had actually built a compiler from scratch. In the end it turned out to be easier than expected, but this was mainly because the work was split in easy tasks. Also, our teachers were pretty competent.

Lessons learned ¶

Of all the courses that I have done while studying during my undergraduate degree, this has most certainly been one of the most insightful ones. Not only did I learn a lot about compiler internals, it was also a good experience maintaining a large project (with ugly code) for five months.

Don’t be too ambitious ¶

When starting the project, we already thought of a lot of nice features that our compiler should have: Colored output. Pretty error reporting. Full polymorphic functions with monomorphization.

If you are coding something for the first time, try to get your shit working first. In the end, you will see that you could have done everything better. During your first rewrite you can go implement your pretty error reporting or your super-clever garbage collection.

Hands-on knowledge matters ¶

In the beginning of the course we were to choose a language to write our compiler in. We were urged to use a functional programming language (they recommended their own language Clean). We were stubborn of course. Programming in a functional language is just harder than programming in an imperative language.

In the end this turned out not to be a problem. The parsing and type-checking stage may have been a bit harder to implement, but we found that the language does not really matter.

What mattered more was the way we wrote our code. A good rule of thumb, when writing complicated code, would be “Think before you type”. Most code is easy, some isn’t. So when you are going to implement your parser consider beforehand: “Am I going to use parser combinators?” “Should I first rewrite my language’s grammar?”

Also use a structured workflow. Use git. It does not really matter how well you know Haskell if you have to download your newest version from Dropbox for your demo.

New programming language? Hard project? Choose one! ¶

We wanted to use a typed language, so Python was out of the picture. We did not choose C or C++ (which was used by friends of ours in the past), because we did not want to spend most of our time doing memory management. We did not choose Java, well… because it’s Java. So in the end we settled on Rust, because it seemed like a language that we should probably learn anyway (the same holds for Go). It has pattern matching and a very nice macro system (useful for making parser combinators), plus it has a very nice testing framework and an easy package manager. One downside was that the language was pretty new to both of us.

This would not have been a problem if the task at hand was easy. Like building some kind of web server, or a simple game. That is easy. Building a compiler for the first time is not easy. Halfway through our project we found out that we should not have used Rust. It would have saved us tons of time, allowing us to work on our compiler, instead of debugging all those cannot move out of borrowed content errors. We should have used a language that we both already knew, like Python (screw strong typing). Or we should have done an easy project. Trying both was just a bad idea.

You didn’t really understand the errors ¶

Spoiler: You probably still don’t understand then now!

Which brings us to the third major lesson: compiler errors are terrible. We get trained to know what to fix when we see error: cannot unify types [v9], v13. We only know that there was some kind of type error. But we sure as hell don’t know what “unification” means.

A fun thing about building your own compiler is that you yourself encounter the same errors in your input. We actually learned what type inference is and what unification actually means. Now I recognize these exact same errors when compiling code in other languages.

However, compiling code is complex. Often compilers fail to generate reasonable error messages that are both complete and concise. (The Elm language does this very well, by the way.)

This last skill is the main reason why I urge you to implement your own compiler some time. Aside from the point that you should have written one at least once in your life.

Conclusion ¶

I got a 7 (out of 10) for the course. A reasonable mark, for my standards. A fun course. I don’t mind having put a lot more time in this course than in other courses. Building a compiler is challenging and a lot of fun. I am now less intimidated by compiler code in the wild.

I would rather do this course than learning about all types of access control models. I am still waiting for my university to set up a Kernel construction course.