Apache Spark Data Flow

The mental model in one paragraph

A Spark application has one driver process and many executor processes. The driver runs your code and holds a SparkContext (or a SparkSession for SQL/DataFrames). When you transform data, Spark does not compute anything; it records a lazy lineage graph of RDDs. When you call an action, the driver hands the lineage to the DAGScheduler, which cuts it into stages at shuffle boundaries and submits each stage as a set of tasks. The TaskScheduler sends tasks to executors over a SchedulerBackend. Each executor runs tasks, reads and writes blocks through its BlockManager, exchanges data through the shuffle system, and reports results back to the driver.

The two planes: control and data

             CONTROL PLANE                                  DATA PLANE
        ┌───────────────────────────┐               ┌───────────────────────────┐
        │           DRIVER           │               │         EXECUTOR(S)        │
        │                            │   tasks       │                            │
        │  SparkContext / Session    │ ───────────►  │  CoarseGrainedExecutor     │
        │    DAGScheduler            │               │    Executor + TaskRunner   │
        │    TaskSchedulerImpl       │ ◄───────────  │    BlockManager            │
        │    SchedulerBackend        │   results     │    ShuffleManager          │
        │    MapOutputTrackerMaster  │   heartbeats  │    MemoryManager           │
        └─────────────┬──────────────┘               └─────────────┬──────────────┘
                      │ asks for executors                         │ shuffle files
                      ▼                                            ▼
        ┌───────────────────────────┐               ┌───────────────────────────┐
        │   CLUSTER MANAGER          │               │   LOCAL DISK / REMOTE      │
        │  Standalone / YARN / K8s   │               │   BLOCK TRANSFER SERVICE   │
        └───────────────────────────┘               └───────────────────────────┘

The driver is the control plane: it decides what to run and where. Executors are the data plane: they hold cached partitions, run task code, and move shuffle data between each other. A cluster manager only places processes; it does not participate in scheduling individual tasks.

One job, end to end

Click a step to highlight the block that explains it. Every step is also detailed on a dedicated component page.

1. The driver builds its runtime first

Constructing a SparkContext is the moment a Spark application comes alive. It clones the SparkConf, creates a SparkEnv (the container of shared services: serializer, BlockManager, MapOutputTracker, shuffle and memory managers), then builds the scheduler trio: a SchedulerBackend, a TaskSchedulerImpl, and a DAGScheduler.

The factory createTaskScheduler matches on the master URL (local[*], spark://, YARN, Kubernetes) and picks the right backend, which is how the same code runs on a laptop or a thousand-node cluster.

SparkContext.scala L599-L603 — scheduler creation

Full detail: Driver & SparkContext.

2. Transformations only describe the computation

An RDD is an immutable, partitioned collection defined by five properties: its partitions, a compute function, a list of dependencies, an optional partitioner, and optional preferred locations. A transformation like map simply wraps the parent in a new MapPartitionsRDD — it builds a node in a graph rather than moving data.

The type of dependency decides everything downstream: a NarrowDependency (map, filter) can be pipelined; a ShuffleDependency forces a data exchange and therefore a stage boundary.

RDD.scala L69-L83 — the five properties

Full detail: RDD & Lineage.

3. Actions are the only thing that launches work

Transformations are lazy; actions are eager. collect, count, reduce, and save all funnel through SparkContext.runJob, documented in the source as "the main entry point for all actions in Spark." It cleans the user closure, logs the call site, and delegates to DAGScheduler.runJob.

SparkContext.scala L2471-L2499 — runJob

Full detail: RDD & Lineage.

4. The DAGScheduler turns a graph into stages

The DAGScheduler is the high-level, stage-oriented scheduler. It breaks the RDD graph at shuffle boundaries: narrow dependencies are pipelined into one stage, while each shuffle dependency creates a barrier between a ShuffleMapStage (which writes map output) and the stage that reads it. Every job ends in exactly one ResultStage.

DAGScheduler.scala L58-L72 — stage boundaries

Full detail: DAG & Task Scheduling.

5. Stages become TaskSets placed by locality

When a stage's parents are ready, submitMissingTasks builds one task per missing partition and submits them as a TaskSet. The TaskSchedulerImpl hands each TaskSet to a TaskSetManager, then fills executor "offers" by walking locality levels from PROCESS_LOCAL to ANY, preferring executors that already hold the data.

TaskSchedulerImpl.scala L512-L520 — resourceOffers

Full detail: DAG & Task Scheduling.

6. Executors deserialize and run each task

The driver sends a LaunchTask message; the executor's backend decodes the TaskDescription and calls Executor.launchTask, which submits a TaskRunner to a thread pool. The runner deserializes the task, fetches dependencies, and calls task.run. Inside, the task calls RDD.iterator, which returns cached data, reads a checkpoint, or calls compute to materialize the partition.

RDD.scala L334-L340 — iterator

Full detail: Executor, Storage & Memory.

7. The shuffle moves data between stages

A ShuffleMapTask sorts its records by target partition and writes a single data file plus an index file, then reports a MapStatus to the driver's MapOutputTrackerMaster. A downstream reduce task asks the tracker where its blocks live and fetches contiguous byte ranges from each map output through the block transfer service.

SortShuffleManager.scala L30-L45 — sort-based shuffle

Full detail: Shuffle.

8. Results flow back and the job completes

A ResultTask runs the action's function over its partition and sends the value to the driver (inline for small results, via the block manager for large ones). The DAGScheduler records each completion; when the ResultStage has all partitions, the job's JobWaiter wakes up and runJob returns.

ResultTask.scala L55-L72 — ResultTask

Full detail: DAG & Task Scheduling.

The core objects you will keep meeting

Object	Role in the data flow	Defined in
`SparkContext`	Driver-side entry point; creates the runtime and is the gateway to every action.	SparkContext.scala
`SparkEnv`	Container of shared runtime services, reachable on any node via `SparkEnv.get`.	SparkEnv.scala
`RDD[T]`	The lazy, partitioned collection; the unit of lineage.	RDD.scala
`Dependency`	An edge between RDDs; narrow vs. shuffle decides stage boundaries.	Dependency.scala
`Stage`	A set of parallel tasks over one RDD, bounded by shuffles.	Stage.scala
`Task`	The serializable unit of work shipped to an executor.	Task.scala
`BlockManager`	Per-node store for cached partitions, broadcast, and shuffle blocks.	BlockManager.scala
`MapOutputTracker`	Driver registry of where each shuffle map output lives.	MapOutputTracker.scala
`SparkSession`	Entry point for SQL and DataFrames; wraps a `SparkContext`.	SparkSession.scala

Component pages

Driver & SparkContext: boot sequence, SparkConf, SparkEnv, the scheduler trio.
RDD & Lineage: the five properties, transformations vs. actions, dependencies, partitioning, caching.
DAG & Task Scheduling: stages, tasks, TaskSetManager, locality, retries, fault recovery.
Executor, Storage & Memory: TaskRunner, BlockManager, unified memory.
Shuffle: sort-based shuffle, writers/readers, the map output tracker.
Spark SQL & Catalyst: logical plans, the optimizer, physical planning, Tungsten codegen.
Structured Streaming: the micro-batch loop, offsets, checkpoints, exactly-once.
Cluster & Deploy: spark-submit, deploy modes, Standalone master/worker.

External references

RDD paper (NSDI 2012) — the original abstraction.
Resilient Distributed Datasets (full)
Spark SQL paper (SIGMOD 2015) — Catalyst and DataFrames.
Structured Streaming paper (SIGMOD 2018)
Spark cluster mode overview
RDD programming guide
Spark SQL guide
Memory tuning guide

From your program to distributed computation

The mental model in one paragraph

The two planes: control and data

One job, end to end

The core objects you will keep meeting

Component pages

External references