B-Tree Fundamentals
SQLite stores every table and index as a separate B-tree within the database file. The file is divided into pages of a fixed size (default 4096 bytes). Each page is either an interior node (holds only keys + child pointers) or a leaf node (holds keys + data).
+ child ptrs
SQLite uses a variant called a B*-tree: all data lives in leaf pages; interior pages hold only separator keys and child page numbers. Tables are sorted by rowid (or the PRIMARY KEY for WITHOUT ROWID tables); indexes are sorted by the indexed column values.
Key Functions
int sqlite3BtreeCursor(
Btree *p, /* the btree */
Pgno iTable, /* root page of the table/index */
int wrFlag, /* 1 = write cursor, 0 = read-only */
struct KeyInfo *pKeyInfo, /* column affinity/collation (indexes only) */
BtCursor *pCur /* OUT: write new cursor here */
){
if( p->sharable ){
return btreeCursorWithLock(p, iTable, wrFlag, pKeyInfo, pCur);
}else{
return btreeCursor(p, iTable, wrFlag, pKeyInfo, pCur);
}
}
/* The cursor tracks:
- a page stack from root → leaf (the "path")
- the cell index within the current leaf page
- whether the cursor is valid (pointing at an existing entry) */
int sqlite3BtreeNext(BtCursor *pCur, int flags){
/* Moves within the current leaf page first.
If we reach the last cell, ascend to the parent,
then descend into the next subtree. */
MemPage *pPage = pCur->pPage;
int idx = pCur->ix;
idx++;
if( idx >= pPage->nCell ){
/* move to parent or return SQLITE_DONE */
return moveToParent(pCur);
}
pCur->ix = idx;
return SQLITE_OK;
}
BtCursor — Position Tracking
A BtCursor tracks the cursor's current position in the B-tree as a stack of (page, cell-index) pairs — the path from the root down to the current leaf entry.
/* btreeInt.h — BtCursor (simplified) */
struct BtCursor {
u8 eState; /* CURSOR_VALID, CURSOR_INVALID, CURSOR_SKIPNEXT */
u8 curFlags; /* read-only? etc. */
i16 iPage; /* index of current level in apPage[] */
u16 ix; /* current cell index on apPage[iPage] */
Pgno pgnoRoot; /* root page of this B-tree */
BtShared *pBt; /* shared B-tree state */
MemPage *pPage; /* current page (= apPage[iPage]) */
MemPage *apPage[BTCURSOR_MAX_DEPTH]; /* page stack root→leaf */
u16 aiIdx[BTCURSOR_MAX_DEPTH]; /* cell index at each level */
KeyInfo *pKeyInfo; /* collation for index cursors */
...
};
Page Layout
Each page follows a defined binary format. The first 100 bytes of page 1 are the database file header. Every page starts with a small page header followed by a cell pointer array (offsets into the page content area), then the cell content itself at the end of the page.
Page layout (4096 bytes): ┌─────────────────────────────────┐ 0 │ Page header (8 or 12 bytes) │ │ flags, freeblocks, ncells, │ │ content area offset, ... │ ├─────────────────────────────────┤ 8/12 │ Cell pointer array │ │ [offset₀, offset₁, offset₂…] │ 2 bytes each ├─────────────────────────────────┤ │ Unallocated space │ │ (grows down ↓) │ ├─────────────────────────────────┤ │ Cell content area │ ← cells stored right-to-left │ [cell₂ | cell₁ | cell₀] │ └─────────────────────────────────┘ 4096
For a table leaf page, each cell encodes: varint(payload_length) + varint(rowid) + payload_data. For an index leaf page, each cell encodes: varint(payload_length) + payload_data (key columns + rowid packed into the key).
Record Format
Row data (the "payload" in leaf cells) uses SQLite's packed record format. A record header tells the type and size of each column value; column values follow in order.
Record format: ┌────────────────┬──────────────────────────────┬──────────────────┐ │ header_size │ serial_type₀ serial_type₁ … │ value₀ value₁ … │ │ (varint) │ (varint each) │ (variable width) │ └────────────────┴──────────────────────────────┴──────────────────┘ Serial type encoding: 0 → NULL 1 → 1-byte int 2 → 2-byte int ... 7 → IEEE 754 double 8 → integer 0 (no bytes in value section) 9 → integer 1 (no bytes in value section) N≥12, N even → BLOB, length = (N-12)/2 bytes N≥13, N odd → TEXT, length = (N-13)/2 bytes
The VDBE OP_MakeRecord encodes registers into this format; OP_Column decodes it back, skipping directly to the requested column offset.
Insert and Page Splits
Inserting a new entry moves the cursor to the correct leaf position using a binary search, then calls insertCell(). When a leaf fills up, balance() redistributes cells across sibling pages (a B-tree balance/split operation). Interior nodes are updated to reflect the new separator keys.
int sqlite3BtreeInsert(
BtCursor *pCur, /* cursor pointing at insertion location */
const BtreePayload *pX, /* key + data to insert */
int flags, /* BTREE_SAVEPOSITION etc. */
int seekResult /* result of prior seek (0 if unknown) */
){
/* 1. Ensure write transaction is active */
/* 2. Move cursor to correct leaf position */
/* 3. insertCell() — put the encoded cell onto the leaf page */
/* 4. balance() — split/redistribute if page overflows */
...
}
Next Stage
The B-tree engine never reads or writes disk directly. It asks the Pager for a DbPage* (an in-memory copy of a page) via sqlite3PagerGet(), modifies it, and marks it dirty via sqlite3PagerWrite(). The Pager handles journaling and eventual disk flush.