Creating a programming language

In this post we're going to look over the main things required in designing and implementing a programming language.

# Overview

Before writing any code and before designing any language construct, we need to think about the core of the language that we want to create. The core of the language is the most important part which defines the language itself.

We also have to choose a programming paradigm for our language from the following list:
  • imperative
  • declarative
  • functional
  • object-oriented
  • procedural
  • logic
  • symbolic

A language can have more than one paradigm. For example, it can be imperative and object-oriented at the same time. This are called multi-paradigm languages.

After this two criteria are met, we can start thinking about a syntax for our language and begin carefully to design language constructs and implement them.

In order to implement our programming language, we need to write a parser which creates an abstract syntax tree (AST) from our source code. The abstract syntax tree is extremely important and we need to think carefully about its structure, which must be able to include all the aspects of our language, such as namespaces, classes, methods, functions, expressions, etc...

After we have our AST, we can either interpret it by walking it with an interpreter, or translate it into another language. If our language is statically typed, we can generate C or Assembly code from the AST, and later compile it into machine code.

# The core of the language

The core of the language is the mechanism which connects all the pieces together and makes the language work.

Ideally, everything in a programming language should be consistent with all the other parts, and all the language constructs should work with each other in harmony.

If we choose an object-oriented paradigm, then everything in our language should be an object. Calling a method on an object, should return another object on which we can call another method, and so on.

Example of addition in object-oriented style:
  • 1.add(2).add(3)
  • 1.add(2.add(3))

The evaluation is from left to right, calling one method at a time.

Applying this rules, we have a core for our language. This means that any operation will follow this basic rules and each language feature that we add later must obey this rules as well.

# The syntax

The syntax of our language is also important. We have to carefully design it in order for it to be easy to write, read and understand.

It's also important to think about the flexibility and expressiveness of the syntax, allowing programmers to express themselves when they write a program in our language, and not to enforce them to use only a specific style.

While our language may be object-oriented, writing the "add" method for addition, instead of the "+" operator, may be very frustrating, so we might want to think about offering other ways for doing that.

Also, there are people which may prefer prefix notation, instead of infix notation, while others prefer postfix notation.

Prefix notation:
  • (+ 1 2 3)
  • sum(1, 2, 3)

Infix notation:
  • 1 + 2 + 3
  • 1.add(2).add(3)

Postfix notation:
  • (1 2 3 +)
  • (1, 2, 3)sum

If we decide to allow different notations, we might want to translate them correctly into method calls at parse-time, in order to integrate them well with the core of the language.

Another aspect is the design of classes, functions and control-flow constructs. Before doing that, we need to choose a structure of blocks for our language.

We can choose between different styles, such as:

The first style is the most popular one and it's borrowed from the syntax of the C programming language.

If we include functions in our language, we also have to choose a style for defining them, which we can borrow from a language that we like, or design something new exclusively for our language.

Some examples for a curly-bracket language, are:
  • def foo() { ... }
  • func foo() { ... }
  • sub foo() { ... }
  • -> foo() { ... }
  • foo() = { ... }

Loops have different forms and styles as well, depending mainly on the programming paradigm that we've chosen. For example, in functional programming, we may not have any form of loops at all, because we can do anything recursively. (However, anything that can be done recursively, can also be implemented iteratively, and vice versa.)

Some forms of iteration are:
  • for (;;) { ... }
  • loop(n) { ...  }
  • n.times { ... }
  • { ... } * n

Choosing the right syntax for our language, may be a difficult task, but it's a very important aspect in designing a programming language. The syntax influences the amount of time required by someone to learn our language, and the amount of time required by someone to read and understand the code that was written in our language.

# The parser

The parser is the program which understands the syntax of a language and creates the abstract syntax tree.

I will not go into the details of creating a parser, but I'm going to talk a little bit about the AST the parser generates. As specified earlier, the structure of the AST must be carefully designed in order to include all the specifications of our language.

To illustrate how the AST might look like, let's consider the following code:
  • 1.add(2).add(3)

By parsing the code, the parser generates an AST that looks something like this:

var AST = Hash(
    self => Number(1),
    call => [
              Hash(method => "add", arg => [Hash(self => Number(2))]),
              Hash(method => "add", arg => [Hash(self => Number(3))]),
The generated AST has a self object and an array of method calls, each method taking an argument of type number, mirroring exactly the meaning of the code that was parsed.

Before generating code from the AST or before interpreting it, we might want to optimize it by applying various optimization techniques, such as constant folding.

After the optimization step, we can traverse the AST and start generating code.  On the other hand, if we choose to interpret the AST, instead of generating low-level code, we can simply execute each method-call that we encounter while traversing the AST.

Here is an example of a very basic interpreter:

func execute(expr) {
    var self_obj = expr{"self"}
    for call in (expr{"call"}) {
         self_obj = self_obj.(call{"method"})(call{"arg"}.map{execute(_)}...)
    return self_obj

say execute(AST)
It starts by assigning the value of expr{"self"} to self_obj then it starts calling each method from the expr{"call"} array, recursively evaluating each argument, replacing the value of self_obj with the result returned by each method call. In the end, the caller retrieves the last value of self_obj, which should be Number(6).

(The above example, is in fact valid Sidef code and works exactly as described).

The code generation is a whole topic on its own and it's too complex to cover it in this post, but the main idea is to traverse the AST and generate equivalent low-level code for each expression encountered along the way.

# Conclusion

In most cases, the design and the implementation of a programming language are closely linked to each other. Personally, I prefer to think about the implementation first, then I will begin to design something that fits the implementation of the current system. This trick makes the implementation process much simpler and also makes the language to integrate more nicely together.

To learn more about programming languages, I highly recommend the following books:
  • "Understanding Programming Languages" (by M. Ben-Ari)
  • "Programming Language Pragmatics" (by Michael L. Scott)
  • "Compilers: Principles, Techniques, and Tools"
  • "Structure and Interpretation of Computer Programs"

If you're fairly familiar with the Perl programming language, you can start playing with Sidef, which is a new programming language implemented completely in Perl. It explores the features of object-oriented programming, providing interesting ways for computation. The source-code is entirely open-source and can be modified freely.

See also: