Prometheus Data Flow — TSDB Write Path

▶🗺 Write Path Diagram

  scrape.Commit()
        │
        ▼
  ┌───────────────────────────────────────────┐
  │         fanoutStorage.Appender            │  storage/fanout.go:29
  │   fanoutAppender.Append() fans to ALL     │
  └────────┬──────────────────────────────────┘
           │
  ┌────────┴──────────┐
  │ local (primary)   │   remote (secondary, best-effort)
  ▼                   ▼
  ┌──────────────────────────────────────────────┐
  │                tsdb.DB                       │  tsdb/db.go:291
  │  Appender() → initAppender → headAppender    │
  └────────────────────┬─────────────────────────┘
                       │
          ┌────────────┴────────────────┐
          │                             │
          ▼                             ▼
  ┌──────────────┐            ┌─────────────────────┐
  │  WAL (wlog)  │            │   Head              │  tsdb/head.go:71
  │ tsdb/wlog/   │            │   (in-memory)       │
  │ wlog.go:182  │            │                     │
  │              │            │  stripeSeries map   │
  │  SERIES rec  │            │  └─ memSeries #ID   │  tsdb/head.go:2508
  │  SAMPLE rec  │            │     └─ headChunks   │
  │  EXEMPLAR rec│            │     └─ mmappedChunks│
  │  HISTOGRAM   │            └──────────┬──────────┘
  └──────────────┘                       │  chunk full (120 samples)
                                         ▼
                              ┌─────────────────────┐
                              │  Head Chunk Files   │
                              │  (m-mmap)           │
                              │  data/wal/chunks_head/
                              └──────────┬──────────┘
                                         │  compaction trigger
                                         ▼
                              ┌─────────────────────┐
                              │  Block (on disk)    │  tsdb/block.go
                              │  data/<ulid>/       │
                              │  ├── chunks/        │
                              │  ├── index          │
                              │  ├── tombstones     │
                              │  └── meta.json      │
                              └─────────────────────┘

▶📡 fanoutStorage — Write Multiplexer write

storage/fanout.go — fanout struct L29

type fanout struct {
    logger     *slog.Logger
    primary    Storage    // local TSDB — error here aborts the write
    secondaries []Storage // remote write — errors logged but ignored
}

// fanoutAppender.Commit() order:
//  1. primary.Commit()   ← must succeed
//  2. secondary[i].Commit()  ← best-effort, logged on failure

The fanout decouples local durability from remote delivery. A remote write failure never drops a local sample.

▶🗄 tsdb.DB — Top-level Database

tsdb/db.go — DB struct (key fields) L291

type DB struct {
    dir    string
    locker *tsdbutil.DirLocker

    logger    *slog.Logger
    opts      *Options
    compactor Compactor

    mtx    sync.RWMutex
    blocks []*Block      // persisted, immutable blocks sorted by time

    head *Head           // mutable in-memory block

    compactc chan struct{} // signal to trigger compaction
    stopc    chan struct{}
    donec    chan struct{}

    autoCompact bool
    ...
}

tsdb/head_append.go — initAppender (entry point) L50

// initAppender defers creation of the real headAppender until the first
// Append() call so that the mint/maxt are known.
type initAppender struct {
    app  storage.Appender    // nil until first append
    head *Head
    ...
}

func (a *initAppender) Append(ref storage.SeriesRef, lset labels.Labels,
    t int64, v float64) (storage.SeriesRef, error) {
    if a.app != nil {
        return a.app.Append(ref, lset, t, v)
    }
    // First append: create real headAppender
    a.app = a.head.appender()
    return a.app.Append(ref, lset, t, v)
}

▶📝 WAL — Write-Ahead Log write

The WAL provides durability. Every sample is persisted to disk before it enters the in-memory Head. On crash recovery Prometheus replays the WAL to rebuild the Head.

tsdb/wlog/wlog.go — WL struct L182

type WL struct {
    dir            string
    logger         *slog.Logger
    segmentSize    int    // default 128 MiB
    mtx            sync.RWMutex
    segment        *Segment  // active segment file
    donePages      int
    page           [pageSize]byte  // 32 KiB page buffer
    actorc         chan func()
    stopc          chan chan struct{}
    compress       CompressionType // snappy or zstd
    ...
}

Record Types

Record	Written when	Content
SERIES	First time a label set is seen	series ref + labels.Labels
SAMPLES	Every Append()	[]RefSample{ref, t, v}
EXEMPLARS	AppendExemplar()	[]RefExemplar{ref, exemplar}
HISTOGRAMS	AppendHistogram()	[]RefHistogramSample{ref, t, h}
METADATA	metadata updates	[]RefMetadata{ref, type, unit, help}
TOMBSTONES	Delete() intervals	[]Stone{ref, {mint,maxt}}
MMAPMARKERS	chunk m-mapped	refs of flushed chunks

Segment Layout

data/wal/
├── 00000001    ← completed segment (128 MiB)
├── 00000002    ← completed segment
└── 00000003    ← active segment (being written)

data/wal/chunks_head/
├── 000001      ← m-mapped head chunks

WAL segments are deleted once their time range has been fully compacted into a Block and the head has checkpointed past them. The WAL also has a checkpoint mechanism (wlog.WriteCheckpoint()) that compresses older segments.

▶🧠 Head — In-Memory Block

tsdb/head.go — Head struct (key fields) L71

type Head struct {
    chunkRange atomic.Int64   // maximum time range for a chunk (default 2h)
    numSeries  atomic.Uint64
    minTime, maxTime atomic.Int64

    wal, wbl *wlog.WL  // WAL and Write-Behind Log (OOO samples)

    exemplars ExemplarStorage

    // Hash-stripe sharded map: 512 stripes to reduce lock contention.
    series *stripeSeries

    // Pools to recycle slices without GC pressure.
    floatsPool          zeropool.Pool[[]record.RefSample]
    histogramsPool      zeropool.Pool[[]record.RefHistogramSample]
    ...
}

The stripeSeries structure is a 512-way hash-sharded map that maps HeadSeriesRef → *memSeries. This dramatically reduces lock contention under high write concurrency.

▶📈 memSeries — Per-Series Storage

tsdb/head.go — memSeries struct L2508

type memSeries struct {
    // Immutable after construction — no lock needed.
    ref       chunks.HeadSeriesRef
    shardHash uint64

    sync.Mutex  // guards everything below

    lset labels.Labels

    // mmappedChunks: completed chunks flushed to disk (memory-mapped).
    // Pointer arithmetic tracks firstChunkID to handle compaction shifts.
    mmappedChunks []*mmappedChunk
    firstChunkID  chunks.HeadChunkID

    // headChunks: linked list of in-memory chunks still being written.
    // headChunks → headChunks.prev → ... (most recent first)
    headChunks *memChunk

    ooo *memSeriesOOOFields  // out-of-order sample state
    ...
}

Chunk Lifecycle for a memSeries

First sample → allocate memChunk with XOR encoder; attach as headChunks
Samples appended to active headChunk via XOR encoding (~120 samples max)
Chunk full or time range exceeded → flush to chunks_head/ m-map file via chunkDiskMapper
Flushed chunk pointer stored in mmappedChunks; memory freed
Compaction moves mmappedChunks into a Block; firstChunkID advances

XOR Chunk Encoding

Prometheus uses the Gorilla XOR encoding for float samples, adapted from Facebook's paper:

Component	Encoding	Typical size
timestamp delta	delta-of-delta, variable bits	1–3 bytes
float value	XOR of previous, leading/trailing zeros compressed	0–9 bytes
per sample average	combined	~1.37 bytes

Implementation: tsdb/chunkenc/xor.go

▶✏ headAppender — Transactional Append

The headAppender collects samples in memory for a single scrape batch, writes WAL records, then appends to memSeries — all under a single lock lifecycle.

tsdb/head_append.go — headAppender.Append() L434

func (a *headAppender) Append(ref storage.SeriesRef,
    lset labels.Labels, t int64, v float64) (storage.SeriesRef, error) {

    // 1. Look up existing series by ref or by label hash.
    s := a.head.series.getByID(chunks.HeadSeriesRef(ref))
    if s == nil {
        // 2. New series: register it, get a new ref.
        var created bool
        s, created, err = a.head.getOrCreate(lset.Hash(), lset)
        if created {
            // 3. WAL SERIES record scheduled.
            a.series = append(a.series, record.RefSeries{
                Ref:    s.ref,
                Labels: lset,
            })
        }
    }

    // 4. Accumulate sample for batch WAL write.
    a.samples = append(a.samples, record.RefSample{
        Ref: s.ref, T: t, V: v,
    })
    a.sampleSeries = append(a.sampleSeries, s)
    return storage.SeriesRef(s.ref), nil
}

func (a *headAppender) Commit() error {
    // 5. Write WAL SERIES + SAMPLES records atomically.
    // 6. For each sample: s.append(t, v, ...) → XOR encode into headChunk.
    // 7. If chunk full: enqueue for m-map flush.
    ...
}

▶🔧 Compaction — Head → Block

When the Head accumulates more than chunkRange * 3/2 (default 3h) of data, a compaction is triggered. The oldest portion of the head is written to a new immutable on-disk Block.

tsdb/compact.go — LeveledCompactor compact.go

// Block directory layout after compaction:
data/
├── 01HJXXXXXXXXXXXXX/       ← ULID (time-sortable unique ID)
│   ├── chunks/
│   │   ├── 000001           ← raw XOR-compressed chunk data
│   │   └── 000002
│   ├── index                ← inverted index: label → posting list
│   ├── tombstones           ← delete intervals
│   └── meta.json            ← {ulid, minTime, maxTime, stats, compaction}
└── 01HJYYYYY.../

Block Merge (Level Compaction)

Level	Time range	Triggered by
0 (head flush)	≤ 2h	Head min time advancing
1	≤ 2h × 3 = 6h	3 overlapping L0 blocks
2	≤ 18h	3 overlapping L1 blocks
N	≤ 2h × 3^N	cascading merge

Retention is enforced after compaction. Blocks with maxTime < now - retentionDuration are marked for deletion and removed from DB.blocks.

▶⏪ Out-of-Order (OOO) Samples

Since Prometheus 2.39, OOO samples (arriving with timestamps older than the current Head maxTime) are buffered in a separate Write-Behind Log (WBL) and the memSeriesOOOFields structure, then merged at compaction time.

tsdb/head.go — OOO fields head.go

type memSeriesOOOFields struct {
    oooMmappedChunks []*mmappedChunk  // flushed OOO chunks
    oooHeadChunk     *oooHeadChunk    // current in-memory OOO chunk
    firstOOOChunkID  chunks.HeadChunkID
}

// OOO write path:
//  headAppender.Append() → detects t < s.maxTime
//  → s.appendOOO(t, v)
//  → written to wbl (Write-Behind Log, separate WAL)
//  → OOO compaction merges into regular blocks

OOO is controlled by --storage.tsdb.allow-overlapping-compaction and out_of_order_time_window in the TSDB config. OOO data older than the window is silently dropped.