Runes to Tokens
In the last post we looked at the token definitions, in this post, we’re going to look at converting strings of text into those token nodes in the lexer package.
Runes vs Chars#
In go there are two primitive types that can represent character data, these are runes and chars. For standard UK/US English they are very similar and offer very little difference in functionality from one another.
They are fundamentally different, we’re going to use Runes as they support Unicode whereas char only supports ASCII. We need to support Unicode in our parser as the JSON standard supports unicode which enables things such as emojis to be used in name and values.
The Lexer Structure#
In this section we’re going to create the lexer struct, this struct is responsible for turning runes into tokens.
Let’s start with the lexer struct.
package lexer
// Lexer - the lexer object
type Lexer struct {
buffer []rune
ch rune
position int
nextPosition int
currentChar int
currentLine int
}
// New - create a new lexer
func New(obj string) Lexer {
l := Lexer{
buffer: []rune(obj),
}
l.readChar()
return l
}
The structure creates a slice of the characters that we’re trying to parse. It also stores the references to where it’s currently got to.
The “New” function is there to take a string and initialise the lexer structure.
The Tools#
There are a couple of tools that we need to write for the lexer before we can lex the runes that are provided.
The tools that we need to write are isLetter and isDigit, these will fold into the identifiers (True, False, Null) and the isDigit will fold into the numbers.
Open up a tools.go file:
package lexer
import "unicode"
func isLetter(check rune) bool {
return unicode.IsLetter(check)
}
func isDigit(check rune) bool {
return unicode.IsDigit(check) || check == '.' || check == '-' || check == '+' || check == 'e' || check == 'E'
}
In JSON numeric digits are valid numbers as well as “.”, “-”, “+”, “e” and “E” as you can represent exponent numbers in text.
These tools will come in handy when writing the reader functions.
The Tokeniser File#
The next thing that we’re going to write is the tokeniser, this file will be responsible for looking at the current token and deciding what type to lex to. For example if you see “{” then lex to type LPAREN
The tokeniser file is basically one big switch/case statement. It looks like the following:
package lexer
import "github.com/jecrocker/goson/token"
// NextToken - Get the next token
func (l *Lexer) NextToken() token.Node {
var node token.Node
l.skipWhitespace()
startChar := l.currentChar
startLine := l.currentLine
switch l.ch {
case 0:
node.Type = token.EOF
node.Literal = ""
case '{':
node.Type = token.LPAREN
node.Literal = "{"
case '}':
node.Type = token.RPAREN
node.Literal = "}"
case '[':
node.Type = token.LBRACKET
node.Literal = "["
case ']':
node.Type = token.RBRACKET
node.Literal = "]"
case ',':
node.Type = token.COMMA
node.Literal = ","
case ':':
node.Type = token.COLON
node.Literal = ":"
case '"':
node.Type = token.STRING
node.Literal = l.readString()
case '-', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0':
node.Type = token.NUMBER
node.Literal = l.readNumber()
default:
tok := l.readIdentifier()
node.Type = token.LookupType(tok)
node.Literal = tok
}
l.readChar()
node.Char = startChar
node.Line = startLine
return node
}
You’ll notice at the bottom we also append the start character and start line to the token node, this is important information as when we validate the tree in the parser so we can tell the developer/user where there are errors.
The exception to the rules in this section is the default, if we’re reading an identifier then we need to read the identifier before looking up the type.
The Reader File#
The final part of lexing is the readers, these are for the tokens that are longer than a single character. The readers look like the following:
package lexer
import (
"fmt"
)
func (l *Lexer) readChar() {
if l.nextPosition >= len(l.buffer) {
l.ch = 0
} else {
l.ch = l.buffer[l.nextPosition]
}
l.position = l.nextPosition
l.nextPosition++
l.currentChar++
}
func (l *Lexer) readString() string {
str := "\""
l.readChar()
for l.ch != '"' && l.ch != 0 {
str = fmt.Sprintf("%s%c", str, l.ch)
if l.ch == '\\' {
l.readChar()
str = fmt.Sprintf("%s%c", str, l.ch)
}
l.readChar()
}
l.readChar()
str = fmt.Sprintf("%s\"", str)
return str
}
func (l *Lexer) readIdentifier() string {
ident := ""
for isLetter(l.ch) {
ident = fmt.Sprintf("%s%c", ident, l.ch)
l.readChar()
}
return ident
}
func (l *Lexer) readNumber() string {
number := ""
for isDigit(l.ch) {
number = fmt.Sprintf("%s%c", number, l.ch)
l.readChar()
}
return number
}
func (l *Lexer) skipWhitespace() {
for l.ch == ' ' || l.ch == '\t' || l.ch == '\n' || l.ch == '\r' {
if l.ch == '\n' || l.ch == '\r' {
l.currentLine++
l.currentChar = 0
}
l.readChar()
}
}
func (l *Lexer) peekChar() rune {
if l.nextPosition >= len(l.buffer) {
return 0
}
return l.buffer[l.nextPosition]
}
All of these methods are somewhat involved in reading the file/string and are important.
The read and peek char are for advancing the lexer and reading the next character in the buffer.
The readIdentifier, readString and readNumber are for reading the respective types or, in the case of identifier, reading true, false and null.