A-Lexer not Alexa

The first part of parsing is lexing, we need to build something that will iterate over the characters of a string and turn them into something that a computer can understand.

We’re now going to get into parsing JSON, there are three types of tokens that we need to think about when parsing JSON. We can parse both single character tokens, multi-character tokens or, special tokens.

Single Character Tokens#

Firstly we’re going to define the single character tokens and what we’re naming those in code

Token Name Token
LPAREN {
RPAREN }
LBRACKET [
RBRACKET ]
COMMA ,
COLON :

These tokens help to make up the structure of something that has been written or generated by a user.

Multi Character Tokens#

The multi character tokens consist, as the name might suggest, longer than one character. They have the same purpose as the other tokens, however they often portray value aswell.

Token Name Token
TBOOL true
FBOOL false
NULL null

Special Tokens#

Finally we get to special tokens, special tokens are special because they start and end with a value but don’t have a specific value. As an example of a special token, strings will start with a quote mark " and will end with the same mark, but the contents can be anything.

The special tokens look like the following.

Token Name Start End Valid
STRING " " *
NUMBER - or digit digit ., +, -, e, E or digit
INVALID
EOF

The special token of note is the invalid token. This is the default token that will be returned if no other token matches. This is incredibly important as we need to catch invalid cases, in JSON all identifiers are strings so we don’t need to worry about parsing generic tokens.

Characters to Tokens#

In order to convert between characters and tokens we need some level of structure that allows us to store this information in memory. The following code is an example of how to build a type.

package token

// Tok - the token export
type Node struct {
    Type    Type
    Literal string
}

// Type - the token type
type Type string

const (
    LPAREN   = "LPAREN"
    RPAREN   = "RPAREN"
    LBRACKET = "LBRACKET"
    RBRACKET = "RBRACKET"
    COMMA    = "COMMA"
    COLON    = "COLON"

    TBOOL    = "TBOOL"
    FBOOL    = "FBOOL"
    NULL     = "NULL"

    STRING   = "STRING"
    NUMBER   = "NUMBER"
    INVALID  = "INVALID"
)

What this enables is a type, Node, that can take 2 parameters Type and Literal. Type is the type of the token and Literal is the listeral string that the string contained.

The only other thing that we need to add is a keyword lookup for the literals to token type. This would be a continuation in the existing file.

var keywords = map[string]Type{
    "true":  TBOOL,
    "false": FBOOL,
    "null":  NULL,
}

func LookupType(word string) Type {
    if tok, ok := keywords[word]; ok {
        return tok
    }

    return INVALID
}

What this new section is saying is that we’ll try to lookup the multi-character tokens but if we can’t find it then to return an invalid.