Creating your own template engine in JavaScript: part 1.5

Tuesday, November 19, 2013 Posted by Ruslan Matveev
It's been a while since I had a time to write something here. Yeah yeah yeah, I'm just lazy guy, but I will try to do it more often. Last time I've promised you second part of this never ending story, it's been about a year ago! I learn many new tricks, so before going into the second part of "creating your own template engine in JavaScript", I've decided to rewrite my uber Tokenizer from the first part. What was wrong with that? Well many things:

  • - too many strings and identifiers that I had to use every time when I ask it for a token
  • - escaping regular expressions in the string literals was really pain in the ass
  • - ignored tokens (which is sometimes very nice to have) were completely ignored and there was no way to read them somehow
  • - all this context switching things were looking very nice a year ago, but now I would make it a bit different

1. Use regular expression literals for token definitions


As I have mentioned in the introduction passing regular expressions enclosed in the string literals for token definitions was not very nice solution. I had to think what has to be escaped and what not and why this piece of crap doesn't work as I expect, so at the end I've decided to get simplify this part a little bit. JavaScript has special data type for regular expressions, so why not to utilize it and use it instead of strings? Hmm why instead? Would be a great feature to use regexp literals and string literals for different kind of tokens. For example:

// define keyword using string literal
tokenizer.addToken('RETURN', 'return'); 
// define "." using string literal
tokenizer.addToken('DOT', '.'); 
// define token using regular expression
tokenizer.addToken('ID', /[_$a-zA-Z\xA0-\uFFFF][_$a-zA-Z0-9\xA0-\uFFFF]*/);

Looks much cleaner right? Instead of escaping dots, brackets and all other stuff like that in order to form valid regular expression, we'll just put "pure" string tokens into strings and use regular expression only when we really need them. But as you remember from the first part all our tokens will at the end form one huge regular expression. What does it mean for us? It means that now we have to escape special regular expressions in the string literals in order to produce valid regexp. We know what characters has to be escaped so we can use Regexp to escape Regexp ;) So now addToken method will look like this:

// used to escape strings to use within regexp
var REGEXP_ESCAPE = /([.?*+^$[\]\\(){}|-])/g;

Tokenizer.prototype.addToken = function(name, expression) {
    // check if expression is RegExp literal
    if (expression instanceof RegExp) {
        // turn RegExp to string
        expression = expression.toString();
        // get rid of leading and trailing slashes
        expression = expression.slice(1, -1);
    } else {
        // replace regular expression characters with "\"
        expression = expression.replace(REGEXP_ESCAPE, '\\$1');
    }
    ...
};

Let's try to make something usable at the end of this part. Our tokenizer now looks like this:

// used to escape strings to use within regexp
var REGEXP_ESCAPE = /([.?*+^$[\]\\(){}|-])/g; 

function Tokenizer() {
    this.input = '';
    this.tokens = {};
    this.tokenExpr = null;
    this.tokenNames = [];
} 

Tokenizer.prototype.addToken = function(name, expression) {
    // check if expression is RegExp literal
    if (expression instanceof RegExp) {
        // turn RegExp to string
        expression = expression.toString();
        // get rid of leading and trailing slashes
        expression = expression.slice(1, -1);
    } else {
        // replace regular expression characters with "\"
        expression = expression.replace(REGEXP_ESCAPE, '\\$1');
    }
    this.tokens[name] = expression;
}; 

Tokenizer.prototype.tokenize = function(input) {
    this.input = input;
    var tokenExpr = [];
    for (var tokenName in this.tokens) {
        this.tokenNames.push(tokenName);
        tokenExpr.push('('+this.tokens[tokenName]+')');
    }
    this.tokenExpr = new RegExp(tokenExpr.join('|'), 'g');
}; 

Tokenizer.prototype.getToken = function() {
    var match = this.tokenExpr.exec(this.input);
    if (!match) return null;
    for (var c = 1; c < match.length; c++) {
        if (!match[c]) continue;
        return {
            name: this.tokenNames[c - 1],
            pos: match.index,
            data: match[c]
        };
    }
};

this is how you can use it:

var tokenizer = new Tokenizer();
tokenizer.addToken('number', /[0-9]+/);
tokenizer.addToken('for', /for\\b/);
tokenizer.addToken('identifier', /[a-zA-Z]+/);
tokenizer.addToken('dot', '.');
tokenizer.addToken('comma', ',');
tokenizer.addToken('semicolon', ';');
tokenizer.addToken('whitespaces', /[\x09\x0A\x0D\x20]+/);

tokenizer.tokenize('.for 123 foobar for .,;');

var token;
while (token = tokenizer.getToken()) {
    console.info(token);
}

2. Consume and test tokens


Next target for our optimizations is getToken method. Right now if you want to check if next token is for example identifier, then you'll have to obtain token using getToken method, then test token name. It's not very simple system to use, so we'll have to optimize it as well. Instead of single getToken method we'll need two methods: next and test. Both of them will take optional argument - required token id. The only difference between them is that test method will return true or false whenever next token id is the one that we pass it as an argument, when next method will return token whenever it is the one that we pass it as an argument and move forward, so we can match next token, next and next. Well I understand that it may sound like nonsense, but I know JavaScript better than English so I'll try to explain it with the following code block:

var tokenizer = new Tokenizer();

tokenizer.addToken('FOR', /for\b/);
tokenizer.addToken('DOT', '.');
tokenizer.addToken('WHITESPACES', /[\x09\x0A\x0D\x20]+/);

tokenizer.tokenize('.for ');
// match dot and move forward
console.info(tokenizer.next('.'));
// match for and move forward
console.info(tokenizer.next('for'));

Method "next" tries to match a token, if it's successful, it returns you matched token and when you call it next time it will retrieve you a next token, then next one and so on.

var tokenizer = new Tokenizer();

tokenizer.addToken('FOR', /for\b/);
tokenizer.addToken('DOT', '.');
tokenizer.addToken('WHITESPACES', /[\x09\x0A\x0D\x20]+/);

tokenizer.tokenize('.for ');
// check if current token is "." -> true
console.info(tokenizer.test('.'));
// check if current token is "for" -> false
console.info(tokenizer.test('for'));

Method "test" tries to match a token and returns you true if it's successful and false in any other case. With this two simple methods you can already make some simple parser which will allow you to control which token sequences do you want to accept and which ones must be rejected:

if (tokenizer.test('DOT')) {
    console.info('hey I found "."');
    console.info('here it is:', tokenizer.next());
}

After a couple of "tiny" changes our tokenizer will look like this (ask me if you need a help to understand it):

// used to escape strings to use within regexp
var REGEXP_ESCAPE = /([.?*+^$[\]\\(){}|-])/g;

var T_EOF = -1;
var T_ERR = -2;

function Tokenizer() {

    // input string and it's length
    var inputString, inputLength;
    // token buffer
    var tokenBuffer = [];

    var tokenExprs = [];
    var tokenRegExp = null;
    var tokenIds = [];

    var lastTokenId = 0;

    function readTokenToBuffer() {

        // init local variables
        var startPos, matchPos, matchStr, match, length;

        for (;;) if (tokenRegExp.lastIndex !== inputLength) {
            startPos = tokenRegExp.lastIndex;
            if (match = tokenRegExp.exec(inputString)) {
                matchStr = match[0], matchPos = match.index;

                // check if we have T_ERR token
                if (length = matchPos - startPos) {
                    tokenBuffer.push({
                        type: T_ERR,
                        pos: startPos,
                        value: inputString.substr(startPos, length)
                    });
                }

                length = match.length;

                // find matched group index
                while (match[length--] === undefined);

                // obtain token info
                match = tokenIds[length];

                // return matched token
                return tokenBuffer.push({
                    type: match,
                    pos: matchPos,
                    value: matchStr
                });

            }

            // return T_ERR token in case if we couldn't match anything
            else return (
                tokenRegExp.lastIndex = inputLength,
                tokenBuffer.push({
                    type: T_ERR,
                    pos: startPos,
                    value: inputString.slice(startPos)
                })
            );

        }
        // return T_EOF if we reached end of file
        else return tokenBuffer.push({
            type: T_EOF,
            pos: inputLength
        });
    }


    function addToken(tokenId, expression) {
        // check if expression is RegExp literal
        if (expression instanceof RegExp) {
            // turn RegExp to string
            expression = expression.toString();
            // get rid of leading and trailing slashes
            expression = expression.slice(1, -1);
        } else {
            // replace regular expression characters with "\"
            expression = expression.replace(REGEXP_ESCAPE, '\\$1');
        }

        tokenId = (
            this.hasOwnProperty(tokenId) ?
            this[tokenId] : this[tokenId] = ++lastTokenId
        );

        tokenExprs.push('(' + expression + ')');
        tokenIds.push(tokenId);

    }

    function tokenize(input) {
        inputString = input;
        inputLength = input.length;
        tokenBuffer = [];
        tokenRegExp = tokenExprs.join('|');
        tokenRegExp = new RegExp(tokenRegExp, 'g');
        tokenRegExp.lastIndex = 0;
    }

    function next(type) {

        if (!tokenBuffer.length)
            readTokenToBuffer();

        if (!arguments.length ||
            tokenBuffer[0].type === type) {
            return tokenBuffer.shift();
        }

    }

    function test(type) {
        if (!tokenBuffer.length)
            readTokenToBuffer();
        return tokenBuffer[0].type === type;
    }

    return {
        addToken: addToken,
        tokenize: tokenize,
        next: next,
        test: test
    };

}

This is how you use it:

var tokenizer = new Tokenizer();

tokenizer.addToken('FOR', /for\b/);
tokenizer.addToken('NUMBER', /[0-9]+/);
tokenizer.addToken('ID', /[a-zA-Z]+/);
tokenizer.addToken('DOT', '.');
tokenizer.addToken('COMMA', ',');
tokenizer.addToken('SEMICOLON', ';');
tokenizer.addToken('WHITESPACES', /[\x09\x0A\x0D\x20]+/);

tokenizer.tokenize('.for 123 foobar for .,;');

// match dot and move forward
console.info(tokenizer.next(tokenizer.DOT));
// match for and move forward
console.info(tokenizer.next(tokenizer.FOR));
// match WHITESPACES and move forward
console.info(tokenizer.next(tokenizer.WHITESPACES));

// try to match identifier
// this will return undefined because next token is number
console.info(tokenizer.next(tokenizer.ID));
// try to match number
// result token is number of course
console.info(tokenizer.next(tokenizer.NUMBER));

// test next token, it's WHITESPACES
console.info(tokenizer.test(tokenizer.WHITESPACES));
// since test doesn't move forward next token is WHITESPACES
console.info(tokenizer.next(tokenizer.WHITESPACES));

// we can also use next with no arguments
// in this case it will return next token of any type
// will return ID = foobar
console.info(tokenizer.next());

// using this two simple methods we can do powerful things
// following block will expect WHITESPACES followed by FOR
if (tokenizer.next(tokenizer.WHITESPACES)) {
    console.info("found WHITESPACES");
    if (tokenizer.next(tokenizer.FOR)) {
        console.info("found FOR");
    } else {
        console.error('expected FOR');
    }
} else {
    console.error('expected WHITESPACES');
}

You can check this jsFiddle if you want to play around with it.


3. Ignore junk tokens


If you take any programming language (template engine can be considered being programming language as well), then we have all kinds of tokens in there. Identifiers, numbers, strings and... spaces between them. Spaces, new lines, tab characters we don't really need them when we're talking about tokenizing and parsing. Imagine that you have to recognize JavaScript function definition. We can "try" do it like this:

var tokenizer = new Tokenizer();

tokenizer.addToken('FUNCTION', /function\b/);
tokenizer.addToken('ID', /[a-zA-Z]+/);
tokenizer.addToken('LPAREN', '(');
tokenizer.addToken('RPAREN', ')');
tokenizer.addToken('LBRACE', '{');
tokenizer.addToken('RBRACE', '}');

tokenizer.tokenize('function foo() {');

if (tokenizer.next(tokenizer.FUNCTION)) {
    var functionName = tokenizer.next(tokenizer.ID);
    if (!functionName) throw 'FUNCTION NAME EXPECTED';
    // draw the rest of the fucking owl
}

But of course this will throw an exception, because there is no ID after FUNCTION, but there is space character instead. How do we deal with that? Of course you can easily fix it by consuming this space character right after function keyword:

var tokenizer = new Tokenizer();

tokenizer.addToken('FUNCTION', /function\b/);
tokenizer.addToken('ID', /[a-zA-Z]+/);
tokenizer.addToken('LPAREN', '(');
tokenizer.addToken('RPAREN', ')');
tokenizer.addToken('LBRACE', '{');
tokenizer.addToken('RBRACE', '}');
tokenizer.addToken('WHITESPACES', /[\x09\x0A\x0D\x20]+/);

tokenizer.tokenize('function foo() {');

if (tokenizer.next(tokenizer.FUNCTION)) {
    // throw WHITESPACES
    tokenizer.next(tokenizer.WHITESPACES);
    var functionName = tokenizer.next(tokenizer.ID);
    if (!functionName) throw 'FUNCTION NAME EXPECTED';
    console.info('function name is =', functionName.value);
    // draw the rest of the fucking owl
}

and it will work! But why do you need to spend your own time to consume tokens that is useless for the final result? Well, you don't so we'll have to come up with some solution. Ignored tokens won't be returned by our shiny tokenizer unless we intentionally ask tokenizer to retrieve it:

var tokenizer = new Tokenizer();

tokenizer.addToken('FUNCTION', /function\b/);
tokenizer.addToken('ID', /[a-zA-Z]+/);
tokenizer.addToken('LPAREN', '(');
tokenizer.addToken('RPAREN', ')');
tokenizer.addToken('LBRACE', '{');
tokenizer.addToken('RBRACE', '}');
tokenizer.ignore('WHITESPACES', /[\x09\x0A\x0D\x20]+/);

tokenizer.tokenize('function foo() {');

if (tokenizer.next(tokenizer.FUNCTION)) {
    // don't need this anymore
    // console.info(tokenizer.next(tokenizer.WHITESPACES));
    var functionName = tokenizer.next(tokenizer.ID);
    if (!functionName) throw 'FUNCTION NAME EXPECTED';
    console.info('function name is =', functionName.value);
    // draw the rest of the fucking owl
}

as I've mentioned, when we really need to check for ignored token, we can do that, check it out:

var tokenizer = new Tokenizer();

tokenizer.addToken('FUNCTION', /function\b/);
tokenizer.addToken('ID', /[a-zA-Z]+/);
tokenizer.addToken('LPAREN', '(');
tokenizer.addToken('RPAREN', ')');
tokenizer.addToken('LBRACE', '{');
tokenizer.addToken('RBRACE', '}');
tokenizer.ignore('WHITESPACES', /[\x09\x0A\x0D\x20]+/);

tokenizer.tokenize('function foo() {');

if (tokenizer.next(tokenizer.FUNCTION)) {
    if (tokenizer.next(tokenizer.WHITESPACES))
        console.info('FOUND WHITESPACES!');
    var functionName = tokenizer.next(tokenizer.ID);
    if (!functionName) throw 'FUNCTION NAME EXPECTED';
    console.info('function name is =', functionName.value);
    // draw the rest of the fucking owl
}

4. Get rid of redundant token names


All this stuff like "tokenizer.next(tokenizer.FUNCTION)", ohh it takes ages to write it, can we optimize it somehow? When you ask tokenizer for a FUNCTION what do you really need? Right you need literally "function" same goes for "}", "(", "+", "-" and many many other characters. Let's change the way we define tokens:

var tokenizer = new Tokenizer();

// don't need to define following tokens with names
tokenizer.match(/function\b/);
tokenizer.match('(');
tokenizer.match(')');
tokenizer.match('{');
tokenizer.match('}');
// ignore spaces
tokenizer.ignore(/[\x09\x0A\x0D\x20]+/);
// define named tokens
tokenizer.match('ID', /[a-zA-Z]+/);

tokenizer.tokenize('function foo() {');

// check if next token is "function"
if (tokenizer.next('function')) {
    var functionName = tokenizer.next(tokenizer.ID);
    if (!functionName) throw 'FUNCTION NAME EXPECTED';
    if (!tokenizer.next('(')) throw '( EXPECTED';
    if (!tokenizer.next(')')) throw ') EXPECTED';
    if (!tokenizer.next('{')) throw ') EXPECTED';
    console.info('function name is =', functionName.value);
    // draw the rest of the fucking owl
}

Looks much cleaner now right?

5. Cleanup


After implementing all things that I've been talking about in the previous parts tokenizer code became too big for this article, so I've created a github repo with it. In order to use it you'll have to load it with some AMD loader such as RequireJS for example. Following code represent simple expression parser made with tokenizer that is described in this article. Check it out:



In next articles (and I hope that it's going to be soon!), we'll begin to build simple template parser based on the tokenizer that we've made. Please let me know if you have any questions.

Post a Comment