Parsing your own language with ANTLR4

Have you ever wanted to write your own programming language? Let’s say that you have, because that’s my excuse to show you how ANTLR4 works.

We’ll take the example of a super-simple functional language where you can call methods with strings:

print(concat("Hello ", "World"))

We’ll call our language “C3PO”. It sounds like a good name.

First things first. How do you define the structure of a language?

Introducing ANTLR4 grammars

ANTLR4 is, you guessed it, the fourth version of ANTLR. ANTLR stands for ANother Tool for Language Recognition. Because why not.

ANTLR allows you to define the “grammar” of your language. Just like in English, a grammar lets you explain what structure is allowed (and what isn’t). Unlike English however, a grammar follows logic and can be easily understood by a computer. Let me show you what it looks like!

// Our grammar is called C3PO.
grammar C3PO;

// We define expression to be either a method call or a string.
expression
    : methodCall
    | STRING
    ;

// We define methodCall to be a method name followed by an opening
// paren, an optional list of arguments, and a closing paren.
methodCall
    : methodName '(' methodCallArguments ')'
    ;

// We define methodName to be a name.
methodName
    : NAME
    ;

// We define methodCallArguments to be a list of expressions
// separated by commas.
methodCallArguments
    : // No arguments
    | expression (',' expression)*  // Some arguments
    ;

// NAME represents any variable or method name.
// The regular expression we use basically means "starts with a letter
// and may follow with any number of alphanumerical characters"
NAME
    : [a-zA-Z][a-zA-Z0-9]*
    ;

// STRING represents a string value, for example "abc".
// Note that for simplicity, we don't allow escaping double quotes.
STRING
    : '"' ~('"')* '"'
    ;

// WS represents a whitespace, which is ignored entirely by skip.
WS
    : [ \t\u000C\r\n]+ -> skip
    ;

Each of these blocks (methodCall, methodCallArguments, expression, NAME, STRING) is called a rule. For now, don’t worry about the difference between lower and uppercase rules.

Of course, our programming language is missing key features such as support for numbers. Let’s not worry about it, you can add that yourself later.

Testing the grammar

Coming back to our example, let’s see how our code matches the grammar we’ve defined above.

First, set up ANTLR4 following the official instructions. Then run the following commands:

$ cd c3po
# Make sure that you have the grammar above saved as C3PO.g4
$ antlr4 C3PO.g4
# When successful, you will see a bunch of .java files.
$ javac C3PO*.java -cp /path/to/antlr-complete.jar
# When successful, you will see a bunch of .class files.
# Now evaluate your code using the grammar.
$ echo "print(concat(\"Hello \", \"World\"))" | \
  grun C3PO expression -tree
(expression
  (methodCall
    (methodName print)
    (
      (methodCallArguments
        (expression
          (methodCall
            (methodName concat)
            (
              (methodCallArguments
                (expression "Hello "),
                (expression "World")
              )
            )
          )
        )
      )
    )
  )
)

Our code parsed successfully!

Now, just for fun, let’s try some incorrect code:

$ echo "1 + 2"  | grun C3PO expression -tree
line 1:0 token recognition error at: '1'
line 1:2 token recognition error at: '+'
line 1:4 token recognition error at: '2'
line 2:0 no viable alternative at input '<EOF>'
expression

Why did this fail? Because we didn’t define any grammar rules for numbers or for the addition sign, so 1 + 2 is illegal in our language.

How do I use this from code?

You probably don’t want to run a shell command whenever you need to parse code. Ideally, you want an API to access each node in the parsed tree.

It turns out ANTLR4 lets you generate parser code in a variety of languages: Java, C#, Python, Go, C++, Swift, JavaScript and even TypeScript!

In TypeScript for example, here is what it could look like (after setting up antlr4ts):

import { ANTLRInputStream, CommonTokenStream } from "antlr4ts";
import * as lexer from "./C3POLexer";
import * as parser from "./C3POParser";

let code = `print(concat("Hello ", "World"))`;

// This is the scary part that you don't need to worry about.
let inputStream = new ANTLRInputStream(code);
let l = new lexer.C3POLexer(inputStream);
let tokenStream = new CommonTokenStream(l);
let p = new parser.C3POParser(tokenStream);

// Parse and execute the code.
let result = p.expression();
evaluateExpression(result);

type ExpressionValue = string | null;

function evaluateExpression(e: parser.ExpressionContext): ExpressionValue {
  if (e.methodCall()) {
    return evaluateMethodCall(e.methodCall()!);
  } else {
    // Our grammar is super simple, it's always a string between double quotes.
    let stringExpression = e.text;
    return stringExpression.substr(1, stringExpression.length - 2);
  }
}

function evaluateMethodCall(m: parser.MethodCallContext): ExpressionValue {
  let methodName = m.methodName().text;
  let methodArguments = m
    .methodCallArguments()
    .expression()
    .map((expression) => evaluateExpression(expression));
  switch (methodName) {
    case "print":
      console.log.apply(null, methodArguments);
      return null;
    case "concat":
      return methodArguments.join("");
    default:
      throw new Error("Unknown method " + methodName);
  }
}

A sample project is available at https://github.com/fwouts/sample-antlr4-typescript if you’d like to see this in action.


Now that you’ve seen how easy it is to parse your own language, you might wonder: what about existing programming languages? Can I parse them too? The answer is YES. In fact, the grammar you need is probably already defined in the grammars-v4 repo.

Read other things I wroteWho's François?