TSDB Write Path
How samples flow from the fanout storage interface through the WAL and Head into durable on-disk blocks.
βΆ Write Path Diagram
scrape.Commit()
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββ
β fanoutStorage.Appender β storage/fanout.go:29
β fanoutAppender.Append() fans to ALL β
ββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
ββββββββββ΄βββββββββββ
β local (primary) β remote (secondary, best-effort)
βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β tsdb.DB β tsdb/db.go:291
β Appender() β initAppender β headAppender β
ββββββββββββββββββββββ¬ββββββββββββββββββββββββββ
β
ββββββββββββββ΄βββββββββββββββββ
β β
βΌ βΌ
ββββββββββββββββ βββββββββββββββββββββββ
β WAL (wlog) β β Head β tsdb/head.go:71
β tsdb/wlog/ β β (in-memory) β
β wlog.go:182 β β β
β β β stripeSeries map β
β SERIES rec β β ββ memSeries #ID β tsdb/head.go:2508
β SAMPLE rec β β ββ headChunks β
β EXEMPLAR recβ β ββ mmappedChunksβ
β HISTOGRAM β ββββββββββββ¬βββββββββββ
ββββββββββββββββ β chunk full (120 samples)
βΌ
βββββββββββββββββββββββ
β Head Chunk Files β
β (m-mmap) β
β data/wal/chunks_head/
ββββββββββββ¬βββββββββββ
β compaction trigger
βΌ
βββββββββββββββββββββββ
β Block (on disk) β tsdb/block.go
β data/<ulid>/ β
β βββ chunks/ β
β βββ index β
β βββ tombstones β
β βββ meta.json β
βββββββββββββββββββββββ
βΆ fanoutStorage β Write Multiplexer write
type fanout struct {
logger *slog.Logger
primary Storage // local TSDB β error here aborts the write
secondaries []Storage // remote write β errors logged but ignored
}
// fanoutAppender.Commit() order:
// 1. primary.Commit() β must succeed
// 2. secondary[i].Commit() β best-effort, logged on failure
The fanout decouples local durability from remote delivery. A remote write failure never drops a local sample.
βΆ tsdb.DB β Top-level Database
type DB struct {
dir string
locker *tsdbutil.DirLocker
logger *slog.Logger
opts *Options
compactor Compactor
mtx sync.RWMutex
blocks []*Block // persisted, immutable blocks sorted by time
head *Head // mutable in-memory block
compactc chan struct{} // signal to trigger compaction
stopc chan struct{}
donec chan struct{}
autoCompact bool
...
}
// initAppender defers creation of the real headAppender until the first
// Append() call so that the mint/maxt are known.
type initAppender struct {
app storage.Appender // nil until first append
head *Head
...
}
func (a *initAppender) Append(ref storage.SeriesRef, lset labels.Labels,
t int64, v float64) (storage.SeriesRef, error) {
if a.app != nil {
return a.app.Append(ref, lset, t, v)
}
// First append: create real headAppender
a.app = a.head.appender()
return a.app.Append(ref, lset, t, v)
}
βΆ WAL β Write-Ahead Log write
The WAL provides durability. Every sample is persisted to disk before it enters the in-memory Head. On crash recovery Prometheus replays the WAL to rebuild the Head.
type WL struct {
dir string
logger *slog.Logger
segmentSize int // default 128 MiB
mtx sync.RWMutex
segment *Segment // active segment file
donePages int
page [pageSize]byte // 32 KiB page buffer
actorc chan func()
stopc chan chan struct{}
compress CompressionType // snappy or zstd
...
}
Record Types
| Record | Written when | Content |
|---|---|---|
| SERIES | First time a label set is seen | series ref + labels.Labels |
| SAMPLES | Every Append() | []RefSample{ref, t, v} |
| EXEMPLARS | AppendExemplar() | []RefExemplar{ref, exemplar} |
| HISTOGRAMS | AppendHistogram() | []RefHistogramSample{ref, t, h} |
| METADATA | metadata updates | []RefMetadata{ref, type, unit, help} |
| TOMBSTONES | Delete() intervals | []Stone{ref, {mint,maxt}} |
| MMAPMARKERS | chunk m-mapped | refs of flushed chunks |
Segment Layout
data/wal/ βββ 00000001 β completed segment (128 MiB) βββ 00000002 β completed segment βββ 00000003 β active segment (being written) data/wal/chunks_head/ βββ 000001 β m-mapped head chunks
wlog.WriteCheckpoint()) that compresses older segments.
βΆ Head β In-Memory Block
type Head struct {
chunkRange atomic.Int64 // maximum time range for a chunk (default 2h)
numSeries atomic.Uint64
minTime, maxTime atomic.Int64
wal, wbl *wlog.WL // WAL and Write-Behind Log (OOO samples)
exemplars ExemplarStorage
// Hash-stripe sharded map: 512 stripes to reduce lock contention.
series *stripeSeries
// Pools to recycle slices without GC pressure.
floatsPool zeropool.Pool[[]record.RefSample]
histogramsPool zeropool.Pool[[]record.RefHistogramSample]
...
}
The stripeSeries structure is a 512-way hash-sharded map that maps HeadSeriesRef β *memSeries. This dramatically reduces lock contention under high write concurrency.
βΆ memSeries β Per-Series Storage
type memSeries struct {
// Immutable after construction β no lock needed.
ref chunks.HeadSeriesRef
shardHash uint64
sync.Mutex // guards everything below
lset labels.Labels
// mmappedChunks: completed chunks flushed to disk (memory-mapped).
// Pointer arithmetic tracks firstChunkID to handle compaction shifts.
mmappedChunks []*mmappedChunk
firstChunkID chunks.HeadChunkID
// headChunks: linked list of in-memory chunks still being written.
// headChunks β headChunks.prev β ... (most recent first)
headChunks *memChunk
ooo *memSeriesOOOFields // out-of-order sample state
...
}
Chunk Lifecycle for a memSeries
- First sample β allocate
memChunkwith XOR encoder; attach asheadChunks - Samples appended to active
headChunkvia XOR encoding (~120 samples max) - Chunk full or time range exceeded β flush to
chunks_head/m-map file viachunkDiskMapper - Flushed chunk pointer stored in
mmappedChunks; memory freed - Compaction moves
mmappedChunksinto a Block;firstChunkIDadvances
XOR Chunk Encoding
Prometheus uses the Gorilla XOR encoding for float samples, adapted from Facebook's paper:
| Component | Encoding | Typical size |
|---|---|---|
| timestamp delta | delta-of-delta, variable bits | 1β3 bytes |
| float value | XOR of previous, leading/trailing zeros compressed | 0β9 bytes |
| per sample average | combined | ~1.37 bytes |
Implementation: tsdb/chunkenc/xor.go
βΆ headAppender β Transactional Append
The headAppender collects samples in memory for a single scrape batch, writes WAL records, then appends to memSeries β all under a single lock lifecycle.
func (a *headAppender) Append(ref storage.SeriesRef,
lset labels.Labels, t int64, v float64) (storage.SeriesRef, error) {
// 1. Look up existing series by ref or by label hash.
s := a.head.series.getByID(chunks.HeadSeriesRef(ref))
if s == nil {
// 2. New series: register it, get a new ref.
var created bool
s, created, err = a.head.getOrCreate(lset.Hash(), lset)
if created {
// 3. WAL SERIES record scheduled.
a.series = append(a.series, record.RefSeries{
Ref: s.ref,
Labels: lset,
})
}
}
// 4. Accumulate sample for batch WAL write.
a.samples = append(a.samples, record.RefSample{
Ref: s.ref, T: t, V: v,
})
a.sampleSeries = append(a.sampleSeries, s)
return storage.SeriesRef(s.ref), nil
}
func (a *headAppender) Commit() error {
// 5. Write WAL SERIES + SAMPLES records atomically.
// 6. For each sample: s.append(t, v, ...) β XOR encode into headChunk.
// 7. If chunk full: enqueue for m-map flush.
...
}
βΆ Compaction β Head β Block
When the Head accumulates more than chunkRange * 3/2 (default 3h) of data, a compaction is triggered. The oldest portion of the head is written to a new immutable on-disk Block.
// Block directory layout after compaction:
data/
βββ 01HJXXXXXXXXXXXXX/ β ULID (time-sortable unique ID)
β βββ chunks/
β β βββ 000001 β raw XOR-compressed chunk data
β β βββ 000002
β βββ index β inverted index: label β posting list
β βββ tombstones β delete intervals
β βββ meta.json β {ulid, minTime, maxTime, stats, compaction}
βββ 01HJYYYYY.../
Block Merge (Level Compaction)
| Level | Time range | Triggered by |
|---|---|---|
| 0 (head flush) | β€ 2h | Head min time advancing |
| 1 | β€ 2h Γ 3 = 6h | 3 overlapping L0 blocks |
| 2 | β€ 18h | 3 overlapping L1 blocks |
| N | β€ 2h Γ 3^N | cascading merge |
maxTime < now - retentionDuration are marked for deletion and removed from DB.blocks.
βΆ Out-of-Order (OOO) Samples
Since Prometheus 2.39, OOO samples (arriving with timestamps older than the current Head maxTime) are buffered in a separate Write-Behind Log (WBL) and the memSeriesOOOFields structure, then merged at compaction time.
type memSeriesOOOFields struct {
oooMmappedChunks []*mmappedChunk // flushed OOO chunks
oooHeadChunk *oooHeadChunk // current in-memory OOO chunk
firstOOOChunkID chunks.HeadChunkID
}
// OOO write path:
// headAppender.Append() β detects t < s.maxTime
// β s.appendOOO(t, v)
// β written to wbl (Write-Behind Log, separate WAL)
// β OOO compaction merges into regular blocks
--storage.tsdb.allow-overlapping-compaction and out_of_order_time_window in the TSDB config. OOO data older than the window is silently dropped.