Continuing our journey toward a fully working compiler, we will now build the parser that will allow us to produce an AST out of program we received as input.
Ever closer to being done with the frontend
However, before diving into the parser, some things must be taken care of.
Refactoring the lexer
The lexer that we previously built is fine as a first draft that (mostly) follows common (and old!) language building design. However, such a design might not scale well when working with sizeable programs from (verbose) programming languages.
Definitely not thinking about a language in particular
The thing is that lexing such programs all at once will rapidly consume way too much memory. And this is only for producing tokens. Just imagine if we want to also want to produce informations about the source location of a token or the identifier associated with a token. All of this adds up to a hefty price in memory consumption.
The thing to realize about the lexer and the parser is that they’re not two out-of-touch entities that only communicate when the lexer hands off a token sequence to the parser. In reality, they’re two cooperating entities that work together to produce an AST out of the text of a program. This is why they’re so often used as a motivating example for what coroutines might be useful for.
Thus, in the real world, lexers more often than not offers services to fetch tokens from a program one at a time and not to get them all at the same time. For those of you that know a bit about Flex and Bison, this should feel pretty familiar since this is how the lexers and parsers they generate operate.
With all that said, let’s modernize our lexer. First, let’s go over the Lexer class definition.
Compared to the previous version of the Lexer class definition, this one contains way more data members. The first three ones are there to enable the Lexer to go over a program’s text. The next two (mCurrentNumber and mCurrentIdentifier) will allow us to transmit some informations that the parser might require. For example, what is the name of the function in a call expression or what is the value of a number literal. This all ties into the fact that in the Lexer-Parser relationship, the Lexer is the one that has to deal directly with the program’s textual representation while the Parser toils away building the AST. Finally, the last one (mLoc) is there because we need to take into account the program’s locale when lexing it.
The other major difference from the previous definition, is the function for getting tokens out of a program. As you can see from the function signature, we ditched the token stream approach and now the function will only give out one token at a time. Its implementation stays mostly the same, only now we don’t loop over all the program’s text ine one go.
With these few changes, we should be able to scale a bit better now. Also, this new version of the Lexer could easily accomodate a lookahead feature if TosLang’s grammar ever evolves in such a way that it might need it to disambiguate some of its rules.
For a post about the parser there sure is a lack of it so far! Let’s fix this. As the interface to our Parser, I suggest the following:
This implementation ties into what I previously wrote about the roles in the Lexer-Parser relationship. To wit, the mLexer member variable will be there to analyze the program’s text and lazily supply the Parser with tokens. The mCurrentToken is there as a caching mechanism because the Parser will put it through quite the grinder when trying to match TosLang’s grammar rules.
The proposed interface also make echo to what I wrote last month about the relations between nodes in the AST. For those of you that don’t remember, the AST that we’re about to build uses std::unique_ptr to both ease memory management and make clear the intent that a parent node is the unique owner of its child nodes. Consequently, it’s only natural that the Parser that is going to create this AST be made of functions producing std::unique_ptr of nodes that can then be moved into a parent node.
The last thing to mention about this interface is that the only service this class will offer to its clients is to produce an AST from a file. This is merely for educational purposes, I just want the Parser to be standalone for the moment. Now then, let’s have a look at the implementation.
Basically, this method just make sure that everything is set up for the parsing and then jumps into ParseProgramDecl to start the actual parsing. Remember that following the TosLang’s grammar, the root of any program is a node of the ProgramDecl type.
Before continuing, however, I realize some may be stuck on the line where is invoke a standard function to ask about a file extension. No, you’re not dreaming. This is a C++17 feature that’s coming to a standard library near you (if your preferred standard library implementation doesn’t already offers it).
Now let’s see how we parse a ProgramDecl:
Here, in lieu of comments I chose to describe part of the function with grammar production rules. This was done to make clear that implementing a parser, in its most basic form, is just a translation from these rules to code. More often than not, a language’s grammar rules can be divided into three categories for parsing. Let’s examine them with examples from TosLang.
1. Rules producing a terminal node
These rules produce a terminal node like a number or a string literal. In that case, we can just parse them and produce the corresponding node without much hassle. As an example, here’s how to parse a string literal of TosLang.
It’s so simple, the lexer took care of everything!
2. Rules producing an intermediate node
These rules concern elements of the language such as declarations (either of functions or variables) or non-terminal expressions like arithmetic expressions. These will require that we dig down into each element that make up the intermediate node to parse them before dealing with the intermediate node. This digging may further continue if one or more elements of the intermediate node are also intermediate nodes themselves. An example of such a rule in TosLang is the rule concerning variable declarations.
3. Rules with recursion
There exists rules within a grammar to specify that a sequence of node of the same type will be produced. These will take the form of recursive rules such as the one in ParseProgramDecl that specify that a ProgramDecl is made of several Decls. Such rules can be translated into code by turning them into while loops. As an example, let’s look at what happens when parsing the parameters of a function declaration
Following these guidelines, one is able to implement what is commonly referred to in formal parlance as a recursive descent parser. In other words, what we described is a parser that will start by generating terminal nodes to then assemble them together to produce the AST.
Finally, take note that the implementation presented doesn’t handle errors in any way. This is a topic for another post.
For more details on the implementation of TosLang’s Parser, you can go check the full source code here.
Coming up next month: Starting semantic analysis with something quite curious…