Parsing and code generation

In this post we're going to take a close look at how to put together a basic parser and how to generate real code from the Abstract Syntax Tree (AST).


# Overview

A couple of posts ago, we talked about the basics behind the creation of a programming language, but we didn't look at any real code how to actually do that. This time, we're going to get our hands dirty and write a real parser and a code generator for a very small language subset.

# The parser

 For simplicity, our parser will be able to understand only variable declarations, numbers, arithmetic operators, nested expressions and the say keyword which will print something to the screen.

The first thing before writing a parser, it's the design of the AST that it should produce. The AST have a recursive definition with sub-branches that resembles the top of the tree. A very simple design for the AST, is the following:
  • main
    • the top of the AST
  • self
    • an object or an expression
  • call
    • a list of calls for the self object
    • op
      • an operator
    • arg
      • an expression
...where an object represents the actual value, such as the value of a number or the content of a string, but encapsulated inside an object of a specific type, while an expression it's a recursive definition that resembles the top of the tree.

Now that we have a basic design for our AST, we can begin designing the parser. We first need to create our main function, which will take a given string as input, returning the AST in the end. Secondly, we need a function to parse the objects and the expressions, which may call the main function when it encounters a nested expression.

For a simple expression, such as 1+2, the parser returns the following AST:
{
    main => [
        {
         self => bless({value => 1}, "Number"),
         call => [{op => "+", arg => {self => bless({value => 2}, "Number")}}],
        }
    ]
}

We have the main key which is the top of the AST, containing one statement which is composed of a self object, and a call to the + operator that takes a second argument.

The code of the parser is not included here, but it can be found at the end of the post.

# Code generation

To generate code from the AST, we have to create two functions that will traverse the tree and translate the objects and the expressions into an equivalent code that we want to generate.

Same as in the parser, he have a main function which will iterate over the statements, calling the second function to do the real work, which may call back the main function when it encounters a nested expression, resulting in a mutual recursion.

A very basic concept for a code generator looks like this:
# The main function
sub generate {
    my ($ast) = @_;

    my @statements;
    foreach my $statement (@{$ast->{main}}) {
        push @statements, generate_expr($statement);
    }

    return join(";\n", @statements);
}

# The secondary function
sub generate_expr {
    my ($expr) = @_;

    my $code = '';
    my $obj  = $expr->{self};

    my $ref = ref($obj);
    if ($ref eq 'HASH') {
        $code = '(' . generate($obj) . ')';
    }
    elsif ($ref eq 'Number') {
        $code = $obj->{value};
    }

    foreach my $call (@{$expr->{call}}) {
        $code .= $call->{op};
        $code .= generate_expr($call->{arg});
    }
    return $code;
}

If we call the generate() function with the AST for the expression: 3 * (1 + 2)

my $ast = {
  main => [{
   self => bless({value => 3}, "Number"),
   call => [{op  => "*", arg => {
    self => {main => [
      {
        self => bless({value => 1}, "Number"),
        call => [{op => "+", arg => {self => bless({value => 2}, "Number")}}],
      }
    ]}
   }}]
 }]
};

print(generate($ast), "\n");

will generate back the following result: 3*(1+2)

# Conclusion

Parsing and code generation lie at the heart of programming language implementations. However, the concepts are useful in other domains as well. For example, if we want to apply some transformations to a structured text, it's much easier and safer to parse the text into an AST and generate a new text, than applying the transformations to the plain text directly.

# Implementation

Bellow we have the implementation for a very basic parser and a rudimentary code generator, that is able to parse the following code:
var x = 42
var y = (81 / 3)
say (x^2 * (3+y) - 1)
...and produce an equivalent Perl code:
my $x = 42;
my $y = (81 / 3);
print(($x ** 2 * (3 + $y) - 1), "\n")