Creating your own template engine in JavaScript: part 1

Monday, March 5, 2012 Posted by Ruslan Matveev
Today I'm going to tell you how to create your own template engine in JavaScript. The question is why? There are dozens of template engines already available so why do we need to create another one? This is the list of my reasons:

  • - There are many template engines but most of them are built with no API to extend it (so you cannot add your own function, your own constructions and so on...)
  • - There are JavaScript template engines that compile your template into plain JavaScript, on one hand it's good, but on the other hand - we cannot reuse (port) this code to other languages (for example Java), so I've decided to make it easily portable.
  • - I used to use Smarty template engine for PHP, so I wanted to make my template engine syntax look close to it.
So with all this reasons and a little bit of a knowledge of how to build such things I've decided to do it myself at the end. Almost every template engine consists of several important parts and our template engine is not an exception:

  • - Tokenizer - breaks out original template represented as a string into the list of tokens. Letters, numbers, spaces doesn't make any sense for us when we treat them separately, but identifiers (that consists of letters and digits), numeric values (integers and floats), spaces (that most of the time just ignored) - that is exactly what we need for the parser.
  • - Parser - takes token stream produced by the tokenizer and tries to recognize language constructions based on it. For example ["var" identifier "=" number] represents assignment construction, ["for" identifier in identifier] may represent loop initializer and so on. So parser is another very important part of our template engine.
  • - Evaluator - heart of our template engine. It takes abstract syntax tree (AST) produced by the parser, evaluates it and produces the result. For example for such AST (that represents addition operation): {"op": "+", "left": 10, "right": "10"} evaluator will produce "20" as the result. It's most simple, but also most important part of the system.
But first things first, and in this part I'll explain you everything about tokenizing (also known as lexical analysis). We'll build our tokenizer on top of JavaScript regular expression engine. How the hell regular expressions can help us to turn template source into the list of tokens? Look, it's very simple (hello Gerbert), let's take this regular expression for example: "([0-9]+)|(for\b)|([a-zA-Z]+)|(\\.)|(,)|(;)|([\x09\x0A\x0D\x20]+)". This regular expression defines several tokens:

  • - Integer numbers [0-9]+
  • - Reserved word "for"
  • - Identifiers [a-zA-Z]+
  • - Punctuators ".", ",", ";"
  • - White space characters [\x09\x0A\x0D\x20]+
Now how do we turn this into tokenizer? Here is the trick, when you execute regular expression match in JavaScript, you can calculate the number of group that has been successfully matched, here is what I mean:

var regExp = new RegExp('(1)|(2)|(3)','g');
var match = regExp.exec('2');
if (!match) alert('no match');
for (var c = 1; c < match.length; c++) {
    if (!match[c]) continue;
    alert('matching group is: ' + (c - 1));
}

When you run this code snippet in your favorite browser it should produce alert box with "matching group is: 1" written in it. This means that we can determine what regular expression group number is matching beginning of our input string ("2"). Next step is to show how we can loop through the tokens:

var input ='123';
var regExp = new RegExp('(1)|(2)|(3)','g');
var match = regExp.exec(input);
while (match) {
    for (var c = 1; c < match.length; c++) {
        if (!match[c]) continue;
        alert('matching group is: ' + (c -1));
        break;
    }
    match = regExp.exec(input);
}

This code snippet when you run it in the browser will produce three alert boxes (one for each of the matching tokens). Play with it for a while to understand the idea. In the mean time I'll give you more advanced example that incorporates the technique shown above into very simple JavaScript Tokenizer class, that will serve as the base for the next parts of this story:

function Tokenizer() {
    this.input = '';
    this.tokens = {};
    this.tokenExpr = null;
    this.tokenNames = [];
}

Tokenizer.prototype.addToken = function(name, expression) {
    this.tokens[name] = expression;
};

Tokenizer.prototype.tokenize = function(input) {
    this.input = input;
    var tokenExpr = [];
    for (var tokenName in this.tokens) {
        this.tokenNames.push(tokenName);
        tokenExpr.push('('+this.tokens[tokenName]+')');
    }
    this.tokenExpr = new RegExp(tokenExpr.join('|'), 'g');
};

Tokenizer.prototype.getToken = function() {
    var match = this.tokenExpr.exec(this.input);
    if (!match) return null;
    for (var c = 1; c < match.length; c++) {
        if (!match[c]) continue;
        return {
            name: this.tokenNames[c - 1],
            pos: match.index,
            data: match[c]
        };
    }
};

this is how you can use it:

var tokenizer = new Tokenizer();
tokenizer.addToken('number', '[0-9]+');
tokenizer.addToken('for', 'for\\b');
tokenizer.addToken('identifier', '[a-zA-Z]+');
tokenizer.addToken('dot', '\\.');
tokenizer.addToken('comma', ',');
tokenizer.addToken('semicolon', ';');
tokenizer.addToken('whitespaces', '[\x09\x0A\x0D\x20]+');

tokenizer.tokenize('.for 123 foobar for .,;');

var token;
while (token = tokenizer.getToken()) {
    console.info(token);
}

That's all for today, in the second part we will improve our tokenizer to handle more advanced stuff, and will make a first into the parsing process. Feel free to ask any questions by comments.

UPDATE: part 1.5 is finally out and it's here.

Post a Comment