SQLite — Tokenizer

What the Tokenizer Does

The tokenizer is a hand-coded deterministic finite automaton (DFA) in tokenize.c. It scans the SQL string one character at a time and classifies each lexeme into a token type constant (e.g. TK_SELECT, TK_ID, TK_INTEGER).

Unlike a separate lex pass, SQLite's tokenizer is tightly coupled to the parser: sqlite3RunParser() owns both. It tokenizes incrementally — each call to sqlite3GetToken() returns exactly one token, which is immediately fed to the LEMON parser engine.

Design choice: The tokenizer and parser run in a single pass. There is no separate tokenization phase that builds a token list. This minimizes memory allocation.

Key Functions

sqlite3RunParser tokenize.c:600 sqlite3GetToken tokenize.c:273

tokenize.c:600 — sqlite3RunParser: drives the tokenize+parse loop

int sqlite3RunParser(Parse *pParse, const char *zSql){
  int nErr = 0;
  void *pEngine;               /* LEMON-generated LALR(1) parser */
  i64 n = 0;                   /* Length of the next token */
  int tokenType;               /* type of the next token */
  int lastTokenParsed = -1;    /* type of the previous token */
  sqlite3 *db = pParse->db;

  pParse->rc = SQLITE_OK;
  pParse->zTail = zSql;

  /* Allocate (or reuse) the LEMON parser object */
  pEngine = sqlite3ParserAlloc(sqlite3Malloc, pParse);

  /* ── Main tokenization loop ── */
  do {
    n = sqlite3GetToken((u8*)zSql, &tokenType); /* classify next token */

    if( tokenType==TK_ILLEGAL ) { nErr++; break; }
    if( tokenType==TK_SEMI )    { pParse->zTail = &zSql[1]; }

    sqlite3Parser(pEngine, tokenType,      /* feed token to LALR engine */
                  pParse->sLastToken, pParse);

    lastTokenParsed = tokenType;
    zSql += n;                            /* advance pointer past token */
  } while( tokenType!=TK_EOF && pParse->rc==SQLITE_OK );

  sqlite3ParserFree(pEngine, sqlite3_free);
  return nErr;
}

tokenize.c:273 — sqlite3GetToken: DFA that classifies one lexeme

i64 sqlite3GetToken(const unsigned char *z, int *tokenType){
  int i, c;
  switch( aiClass[*z] ){         /* aiClass is a 256-entry lookup table */

    case CC_SPACE:               /* whitespace */
      for(i=1; sqlite3Isspace(z[i]); i++){}
      *tokenType = TK_SPACE;
      return i;

    case CC_MINUS:               /* "-" or "-- comment" */
      if( z[1]=='-' ){
        for(i=2; z[i] && z[i]!='\n'; i++){}
        *tokenType = TK_SPACE;   /* comments become whitespace */
        return i;
      }
      *tokenType = TK_MINUS;
      return 1;

    case CC_ALPHA: {             /* keyword or identifier */
      for(i=1; (c=z[i])!=0 && (aiClass[c]==CC_ALPHA
                             || aiClass[c]==CC_DIGIT); i++){}
      /* sqlite3KeywordCode() maps the text to TK_SELECT etc. */
      *tokenType = sqlite3KeywordCode(z, i);
      return i;
    }

    case CC_DIGIT:               /* numeric literal */
      ...
      *tokenType = TK_INTEGER;
      return i;

    /* ... many more character classes ... */
  }
}

Token Types

Every token has an integer type constant defined in parse.h (auto-generated by LEMON from parse.y). The tokenizer maps character sequences to these constants via the aiClass[] lookup table and the sqlite3KeywordCode() function (generated in keywordhash.h).

Token constant	Matches
`TK_SELECT`	keyword `SELECT`
`TK_FROM`	keyword `FROM`
`TK_WHERE`	keyword `WHERE`
`TK_ID`	unquoted identifier (table/column name)
`TK_STRING`	single-quoted string literal `'text'`
`TK_INTEGER`	numeric integer literal `42`
`TK_FLOAT`	floating point literal `3.14`
`TK_BLOB`	hex blob literal `X'ABCD'`
`TK_SEMI`	statement terminator `;`
`TK_EOF`	end of input
`TK_SPACE`	whitespace / comments (skipped by parser)

Keywords are detected by sqlite3KeywordCode(), which uses a perfect hash generated by mkkeywordhash.c — no string comparison loop needed.

mkkeywordhash.c tool/ tokenize.c full file

Token Flow Example

For the SQL SELECT name FROM users WHERE id = 1, the tokenizer emits:

TK_SELECT  "SELECT"
TK_ID      "name"
TK_FROM    "FROM"
TK_ID      "users"
TK_WHERE   "WHERE"
TK_ID      "id"
TK_EQ      "="
TK_INTEGER "1"
TK_EOF

Each token is fed immediately to sqlite3Parser() (the LEMON engine) which reduces grammar rules as they complete.

Next Stage

Each token emitted here is consumed by the LEMON LALR(1) parser. The parser builds AST nodes (like Select, Expr) by firing grammar rule actions.

→ Parser (parse.y)