What the Tokenizer Does
The tokenizer is a hand-coded deterministic finite automaton (DFA) in tokenize.c. It scans the SQL string one character at a time and classifies each lexeme into a token type constant (e.g. TK_SELECT, TK_ID, TK_INTEGER).
Unlike a separate lex pass, SQLite's tokenizer is tightly coupled to the parser: sqlite3RunParser() owns both. It tokenizes incrementally — each call to sqlite3GetToken() returns exactly one token, which is immediately fed to the LEMON parser engine.
Key Functions
int sqlite3RunParser(Parse *pParse, const char *zSql){
int nErr = 0;
void *pEngine; /* LEMON-generated LALR(1) parser */
i64 n = 0; /* Length of the next token */
int tokenType; /* type of the next token */
int lastTokenParsed = -1; /* type of the previous token */
sqlite3 *db = pParse->db;
pParse->rc = SQLITE_OK;
pParse->zTail = zSql;
/* Allocate (or reuse) the LEMON parser object */
pEngine = sqlite3ParserAlloc(sqlite3Malloc, pParse);
/* ── Main tokenization loop ── */
do {
n = sqlite3GetToken((u8*)zSql, &tokenType); /* classify next token */
if( tokenType==TK_ILLEGAL ) { nErr++; break; }
if( tokenType==TK_SEMI ) { pParse->zTail = &zSql[1]; }
sqlite3Parser(pEngine, tokenType, /* feed token to LALR engine */
pParse->sLastToken, pParse);
lastTokenParsed = tokenType;
zSql += n; /* advance pointer past token */
} while( tokenType!=TK_EOF && pParse->rc==SQLITE_OK );
sqlite3ParserFree(pEngine, sqlite3_free);
return nErr;
}
i64 sqlite3GetToken(const unsigned char *z, int *tokenType){
int i, c;
switch( aiClass[*z] ){ /* aiClass is a 256-entry lookup table */
case CC_SPACE: /* whitespace */
for(i=1; sqlite3Isspace(z[i]); i++){}
*tokenType = TK_SPACE;
return i;
case CC_MINUS: /* "-" or "-- comment" */
if( z[1]=='-' ){
for(i=2; z[i] && z[i]!='\n'; i++){}
*tokenType = TK_SPACE; /* comments become whitespace */
return i;
}
*tokenType = TK_MINUS;
return 1;
case CC_ALPHA: { /* keyword or identifier */
for(i=1; (c=z[i])!=0 && (aiClass[c]==CC_ALPHA
|| aiClass[c]==CC_DIGIT); i++){}
/* sqlite3KeywordCode() maps the text to TK_SELECT etc. */
*tokenType = sqlite3KeywordCode(z, i);
return i;
}
case CC_DIGIT: /* numeric literal */
...
*tokenType = TK_INTEGER;
return i;
/* ... many more character classes ... */
}
}
Token Types
Every token has an integer type constant defined in parse.h (auto-generated by LEMON from parse.y). The tokenizer maps character sequences to these constants via the aiClass[] lookup table and the sqlite3KeywordCode() function (generated in keywordhash.h).
| Token constant | Matches |
|---|---|
TK_SELECT | keyword SELECT |
TK_FROM | keyword FROM |
TK_WHERE | keyword WHERE |
TK_ID | unquoted identifier (table/column name) |
TK_STRING | single-quoted string literal 'text' |
TK_INTEGER | numeric integer literal 42 |
TK_FLOAT | floating point literal 3.14 |
TK_BLOB | hex blob literal X'ABCD' |
TK_SEMI | statement terminator ; |
TK_EOF | end of input |
TK_SPACE | whitespace / comments (skipped by parser) |
Keywords are detected by sqlite3KeywordCode(), which uses a perfect hash generated by mkkeywordhash.c — no string comparison loop needed.
Token Flow Example
For the SQL SELECT name FROM users WHERE id = 1, the tokenizer emits:
TK_SELECT "SELECT" TK_ID "name" TK_FROM "FROM" TK_ID "users" TK_WHERE "WHERE" TK_ID "id" TK_EQ "=" TK_INTEGER "1" TK_EOF
Each token is fed immediately to sqlite3Parser() (the LEMON engine) which reduces grammar rules as they complete.
Next Stage
Each token emitted here is consumed by the LEMON LALR(1) parser. The parser builds AST nodes (like Select, Expr) by firing grammar rule actions.