Runes to Tokens

In the last post we looked at the token definitions, in this post, we’re going to look at converting strings of text into those token nodes in the lexer package.

Runes vs Chars#

In go there are two primitive types that can represent character data, these are runes and chars. For standard UK/US English they are very similar and offer very little difference in functionality from one another.

They are fundamentally different, we’re going to use Runes as they support Unicode whereas char only supports ASCII. We need to support Unicode in our parser as the JSON standard supports unicode which enables things such as emojis to be used in name and values.

The Lexer Structure#

In this section we’re going to create the lexer struct, this struct is responsible for turning runes into tokens.

Let’s start with the lexer struct.

package lexer

// Lexer - the lexer object
type Lexer struct {
	buffer       []rune
	ch           rune
	position     int
	nextPosition int
	currentChar  int
	currentLine  int
}

// New - create a new lexer
func New(obj string) Lexer {
	l := Lexer{
		buffer: []rune(obj),
	}

	l.readChar()

	return l
}

The structure creates a slice of the characters that we’re trying to parse. It also stores the references to where it’s currently got to.

The “New” function is there to take a string and initialise the lexer structure.

The Tools#

There are a couple of tools that we need to write for the lexer before we can lex the runes that are provided.

The tools that we need to write are isLetter and isDigit, these will fold into the identifiers (True, False, Null) and the isDigit will fold into the numbers.

Open up a tools.go file:

package lexer

import "unicode"

func isLetter(check rune) bool {
	return unicode.IsLetter(check)
}

func isDigit(check rune) bool {
	return unicode.IsDigit(check) || check == '.' || check == '-' || check == '+' || check == 'e' || check == 'E'
}

In JSON numeric digits are valid numbers as well as “.”, “-”, “+”, “e” and “E” as you can represent exponent numbers in text.

These tools will come in handy when writing the reader functions.

The Tokeniser File#

The next thing that we’re going to write is the tokeniser, this file will be responsible for looking at the current token and deciding what type to lex to. For example if you see “{” then lex to type LPAREN

The tokeniser file is basically one big switch/case statement. It looks like the following:

package lexer

import "github.com/jecrocker/goson/token"

// NextToken - Get the next token
func (l *Lexer) NextToken() token.Node {
	var node token.Node
	l.skipWhitespace()

	startChar := l.currentChar
	startLine := l.currentLine

	switch l.ch {
	case 0:
		node.Type = token.EOF
		node.Literal = ""
	case '{':
		node.Type = token.LPAREN
		node.Literal = "{"
	case '}':
		node.Type = token.RPAREN
		node.Literal = "}"
	case '[':
		node.Type = token.LBRACKET
		node.Literal = "["
	case ']':
		node.Type = token.RBRACKET
		node.Literal = "]"
	case ',':
		node.Type = token.COMMA
		node.Literal = ","
	case ':':
		node.Type = token.COLON
		node.Literal = ":"
	case '"':
		node.Type = token.STRING
		node.Literal = l.readString()
	case '-', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0':
		node.Type = token.NUMBER
		node.Literal = l.readNumber()
	default:
		tok := l.readIdentifier()
		node.Type = token.LookupType(tok)
		node.Literal = tok
	}

	l.readChar()

	node.Char = startChar
	node.Line = startLine

	return node
}

You’ll notice at the bottom we also append the start character and start line to the token node, this is important information as when we validate the tree in the parser so we can tell the developer/user where there are errors.

The exception to the rules in this section is the default, if we’re reading an identifier then we need to read the identifier before looking up the type.

The Reader File#

The final part of lexing is the readers, these are for the tokens that are longer than a single character. The readers look like the following:

package lexer

import (
	"fmt"
)

func (l *Lexer) readChar() {
	if l.nextPosition >= len(l.buffer) {
		l.ch = 0
	} else {
		l.ch = l.buffer[l.nextPosition]
	}

	l.position = l.nextPosition
	l.nextPosition++
	l.currentChar++
}

func (l *Lexer) readString() string {
	str := "\""
	l.readChar()
	for l.ch != '"' && l.ch != 0 {
		str = fmt.Sprintf("%s%c", str, l.ch)
		if l.ch == '\\' {
			l.readChar()
			str = fmt.Sprintf("%s%c", str, l.ch)
		}
		l.readChar()
	}
	l.readChar()
	str = fmt.Sprintf("%s\"", str)
	return str
}

func (l *Lexer) readIdentifier() string {
	ident := ""
	for isLetter(l.ch) {
		ident = fmt.Sprintf("%s%c", ident, l.ch)
		l.readChar()
	}

	return ident
}

func (l *Lexer) readNumber() string {
	number := ""

	for isDigit(l.ch) {
		number = fmt.Sprintf("%s%c", number, l.ch)
		l.readChar()
	}

	return number
}

func (l *Lexer) skipWhitespace() {
	for l.ch == ' ' || l.ch == '\t' || l.ch == '\n' || l.ch == '\r' {
		if l.ch == '\n' || l.ch == '\r' {
			l.currentLine++
			l.currentChar = 0
		}
		l.readChar()
	}
}

func (l *Lexer) peekChar() rune {
	if l.nextPosition >= len(l.buffer) {
		return 0
	}
	return l.buffer[l.nextPosition]
}

All of these methods are somewhat involved in reading the file/string and are important.

The read and peek char are for advancing the lexer and reading the next character in the buffer.

The readIdentifier, readString and readNumber are for reading the respective types or, in the case of identifier, reading true, false and null.