Stage 1 — Tokenizer
tokenize.c — Breaks SQL text into a stream of typed tokens for the parser
What the Tokenizer Does

The tokenizer is a hand-coded deterministic finite automaton (DFA) in tokenize.c. It scans the SQL string one character at a time and classifies each lexeme into a token type constant (e.g. TK_SELECT, TK_ID, TK_INTEGER).

Unlike a separate lex pass, SQLite's tokenizer is tightly coupled to the parser: sqlite3RunParser() owns both. It tokenizes incrementally — each call to sqlite3GetToken() returns exactly one token, which is immediately fed to the LEMON parser engine.

Design choice: The tokenizer and parser run in a single pass. There is no separate tokenization phase that builds a token list. This minimizes memory allocation.
Key Functions
tokenize.c:600 — sqlite3RunParser: drives the tokenize+parse loop
int sqlite3RunParser(Parse *pParse, const char *zSql){
  int nErr = 0;
  void *pEngine;               /* LEMON-generated LALR(1) parser */
  i64 n = 0;                   /* Length of the next token */
  int tokenType;               /* type of the next token */
  int lastTokenParsed = -1;    /* type of the previous token */
  sqlite3 *db = pParse->db;

  pParse->rc = SQLITE_OK;
  pParse->zTail = zSql;

  /* Allocate (or reuse) the LEMON parser object */
  pEngine = sqlite3ParserAlloc(sqlite3Malloc, pParse);

  /* ── Main tokenization loop ── */
  do {
    n = sqlite3GetToken((u8*)zSql, &tokenType); /* classify next token */

    if( tokenType==TK_ILLEGAL ) { nErr++; break; }
    if( tokenType==TK_SEMI )    { pParse->zTail = &zSql[1]; }

    sqlite3Parser(pEngine, tokenType,      /* feed token to LALR engine */
                  pParse->sLastToken, pParse);

    lastTokenParsed = tokenType;
    zSql += n;                            /* advance pointer past token */
  } while( tokenType!=TK_EOF && pParse->rc==SQLITE_OK );

  sqlite3ParserFree(pEngine, sqlite3_free);
  return nErr;
}
tokenize.c:273 — sqlite3GetToken: DFA that classifies one lexeme
i64 sqlite3GetToken(const unsigned char *z, int *tokenType){
  int i, c;
  switch( aiClass[*z] ){         /* aiClass is a 256-entry lookup table */

    case CC_SPACE:               /* whitespace */
      for(i=1; sqlite3Isspace(z[i]); i++){}
      *tokenType = TK_SPACE;
      return i;

    case CC_MINUS:               /* "-" or "-- comment" */
      if( z[1]=='-' ){
        for(i=2; z[i] && z[i]!='\n'; i++){}
        *tokenType = TK_SPACE;   /* comments become whitespace */
        return i;
      }
      *tokenType = TK_MINUS;
      return 1;

    case CC_ALPHA: {             /* keyword or identifier */
      for(i=1; (c=z[i])!=0 && (aiClass[c]==CC_ALPHA
                             || aiClass[c]==CC_DIGIT); i++){}
      /* sqlite3KeywordCode() maps the text to TK_SELECT etc. */
      *tokenType = sqlite3KeywordCode(z, i);
      return i;
    }

    case CC_DIGIT:               /* numeric literal */
      ...
      *tokenType = TK_INTEGER;
      return i;

    /* ... many more character classes ... */
  }
}
Token Types

Every token has an integer type constant defined in parse.h (auto-generated by LEMON from parse.y). The tokenizer maps character sequences to these constants via the aiClass[] lookup table and the sqlite3KeywordCode() function (generated in keywordhash.h).

Token constantMatches
TK_SELECTkeyword SELECT
TK_FROMkeyword FROM
TK_WHEREkeyword WHERE
TK_IDunquoted identifier (table/column name)
TK_STRINGsingle-quoted string literal 'text'
TK_INTEGERnumeric integer literal 42
TK_FLOATfloating point literal 3.14
TK_BLOBhex blob literal X'ABCD'
TK_SEMIstatement terminator ;
TK_EOFend of input
TK_SPACEwhitespace / comments (skipped by parser)

Keywords are detected by sqlite3KeywordCode(), which uses a perfect hash generated by mkkeywordhash.c — no string comparison loop needed.

Token Flow Example

For the SQL SELECT name FROM users WHERE id = 1, the tokenizer emits:

TK_SELECT  "SELECT"
TK_ID      "name"
TK_FROM    "FROM"
TK_ID      "users"
TK_WHERE   "WHERE"
TK_ID      "id"
TK_EQ      "="
TK_INTEGER "1"
TK_EOF

Each token is fed immediately to sqlite3Parser() (the LEMON engine) which reduces grammar rules as they complete.

Next Stage

Each token emitted here is consumed by the LEMON LALR(1) parser. The parser builds AST nodes (like Select, Expr) by firing grammar rule actions.