Let's implement a language - Part 1 - The Lexer
This month marks the beginning of our journey into compiler development territory. For our first steps, we’ll start with an overview of modern compiler architecture and the development of a lexer capable of handling the TosLang programming language.
As you read along, you might notice that some details are missing from the programming examples. This is intended since some parts of what we are going to implement are quite trivial and repetitive and I didn’t want to obscure the examples by using macros or template metaprogramming. For the full code listing of the TosLang compiler, you just need to go here.
Now, without further ado, let’s dive into compiler development.
What’s in a compiler?
For most programmer a compiler is simply a program that transform the code they have written into an executable. How it achieves this however is a total mystery. In reality, there’s really nothing mysterious about the work a compiler does. It’s just that it’s just such a big piece of software involving so many aspects of software engineering that people often shy away from delving deeper and prefer to let someone else deal with it.
Let’s try to eschew the overcomplicated parts and present a high-level overview of a traditional static compiler. The most popular design for this kind of compiler is the three phase design as illustrated below.
I won’t go into every details here (as as everything will be explained as we implement the compiler) but what you have to know about each part is this:
- Frontend: Responsible for making sure the program respect the language specification. Can also perform some language-specific optimizations.
- Optimizer: Performs analysis and transformation on an intermediate representation of the program to optimize it. Since it works on an intermediate representation, the optimizations done here depends neither on the programming language not the target architecture.
- Backend: Performs optimizations based on the architecture the program is being compiled for and generate
The TosLang Compiler
Since TosLang is a statically typed language, its compiler will follow the three phase design. However, before even considering to implement one phase, we first need to define what the interface to the TosLang compiler is going to be. As we know, from a regular user point of view, what a compiler does, regardless of the language, is take a file name and use this information to produce an executable file through what may as well be black magic. Thus, as a starting point, I propose the following as a first draft for the TosLang compiler:
What this’ll do is take a TosLang program filename and generate an executable with a arbitrarily chosen name (let’s say a.out). Next, let’s implement the Compiler::Run function:
Notice that I’m using
std::istreambuf_iterator to fetch the content of the file and build my buffer. The reason behind this choice is that we want the raw unformatted string representing the program and we want it fast which is exactly what we’re going to get by using this iterator.
After the buffer has been filled, the next thing the method does is build a sequence of tokens out of the buffer. The next section offers insight as to why this must be done.
What does a lexer do anyway?
Let’s zoom in on the on the compiler design picture.
As I mentionned a compiler frontend has the responsibility of making sure the program respects the specification of the language it is written with. To do so, it must ask and answer the three following questions: What is being written? Does it respect the language’s grammar? And does any make sense?
The role of the lexer is to answer the first of these questions. So for each element of the program the lexer has to identify what elements of the programming language are being used. Is is a integer literal? Is it a keyword? If so, which is it? Each time it finds an answer, a token will be associated with the program element (keyword, identifier, operator and so on). Thus, when it has gone over all the program, it will have a sequence of tokens to give off to the next phase of the frontend i.e. the parser.
The first part of the implementation of a programming language is to define what tokens will there be to identify. For TosLang, the tokens will be as follows:
Next, let’s define the interface of the lexer.
I know this doesn’t require a class for the moment. But trust me, it will grow sooner than later.
Now, let’s dive into the lexer.
In the end, our lexer will be able to take a TosLang program like the one below:
and produce the following output:
[FUNCTION], [IDENTIFIER], [LEFT_PAREN], [IDENTIFIER], [COLON], [TYPE], [RIGHT_PAREN], [ARROW], [TYPE], [LEFT_BRACE], [PRINT], [STRING_LITERAL], [SEMI-COLON], [RETURN], [SEMI-COLON], [RIGHT_BRACE]
This about sums up the first step toward developing a full-fledged compiler for the TosLang programming language. While this doesn’t look like much for the moment, this development is crucial for what is coming up later.
Also, before I’m being reproached of being a poor developper, I am fully aware that there’s currently zero error handling done in the lexer. So if someone wanted to put a letter in the middle of a number (like this: 123b456), this would immediately result in a wrong output. However, fear not my young padawans, we will deal with error handling later down the road. This example only purpose was to demonstrate how a simple lexer could be implemented.
Coming up next month: Designing an AST