3. Main runtime architecture
Problem solved
Spark solves large-scale data-parallel computation with a unified programming model. Users write declarative or functional transformations; Spark handles partitioning, scheduling, fault tolerance (lineage recomputation), shuffle, and data locality. It integrates with Hadoop-compatible filesystems, Hive metastore, and cloud object stores.
Runtime components
┌──────────────── Driver JVM ────────────────┐
│ SparkSession / SparkContext │
│ DAGScheduler, TaskSchedulerImpl │
│ BlockManagerMaster, MapOutputTrackerMaster │
│ SparkUI, LiveListenerBus │
│ [Connect] SparkConnectService (gRPC) │
└───────────────┬───────────────────────────────┘
│ RpcEnv (Netty)
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ Executor JVM │ │ Executor JVM │ │ Executor JVM │
│ BlockManager │ │ BlockManager │ │ BlockManager │
│ ShuffleManager │ │ ShuffleManager │ │ ShuffleManager │
│ Task threads │ │ Task threads │ │ Task threads │
└────────────────┘ └────────────────┘ └────────────────┘
Core abstractions
| Abstraction | Location | Meaning |
|---|---|---|
RDD[T] | core/.../rdd/RDD.scala | Immutable partitioned collection with lineage |
Dependency | core/.../Dependency.scala | Narrow (pipeline) vs ShuffleDependency (stage break) |
Stage | core/.../scheduler/Stage.scala | ShuffleMapStage or ResultStage — set of tasks on same RDD op chain |
Task | core/.../scheduler/Task.scala | ShuffleMapTask or ResultTask — unit sent to one core |
LogicalPlan | sql/catalyst/.../logical/ | Unresolved/resolved relational algebra tree |
SparkPlan | sql/core/.../execution/SparkPlan.scala | Physical operator tree (*Exec suffix) |
QueryExecution | sql/core/.../execution/QueryExecution.scala | Lazy pipeline: analyze → optimize → plan → execute |
SparkEnv | core/.../SparkEnv.scala | Per-JVM runtime bag: serializer, RpcEnv, BlockManager, etc. |
Data flows
- Input: Hadoop
FileSystem, DataSource V2 connectors, JDBC, Kafka (via connector), in-memory collections. - Shuffle: Map tasks write partitioned files (sort-based by default via
SortShuffleManager); reduce tasks fetch viaBlockStoreClient. - Cache:
BlockManagerstores serialized/deserialized blocks in memory or spills to disk (StorageLevel). - Output: Actions collect to driver, write to sinks (Parquet, ORC, Hive tables), or stream via Structured Streaming sinks.
Control flows
- Job submission: Action →
SparkContext.runJob→DAGScheduler.submitJob→ event loop →TaskScheduler.submitTasks. - Resource offers:
CoarseGrainedSchedulerBackendreceives executor registrations, sendsLaunchTaskRPC messages. - Failure recovery: Task failures retried by
TaskSchedulerImpl; shuffle fetch failures trigger stage resubmit inDAGScheduler. - SQL planning:
SessionState.executePlancreatesQueryExecution; each lazy val advances one compiler phase.
External systems
| System | Integration point |
|---|---|
| HDFS / S3 / GCS / ABFS | Hadoop FileSystem, hadoop-cloud/, DataSource readers |
| Hive Metastore | sql/hive/HiveExternalCatalog.scala |
| YARN | resource-managers/yarn/, --master yarn |
| Kubernetes | resource-managers/kubernetes/core/, --master k8s://... |
| Kafka | connector/kafka-0-10-sql/ |
| JDBC databases | sql/core/.../jdbc/ |
| History server / event log | core/.../deploy/history/ |
| Metrics (JMX, Prometheus sink) | core/.../metrics/ |
Startup sequence (driver, client mode)
spark-submit MyApp
→ launcher.Main builds java command
→ SparkSubmit.doSubmit → runMain (user main)
→ SparkSession.builder.getOrCreate()
→ SparkContext(conf)
→ SparkEnv.createDriverEnv
→ createTaskScheduler (Local / Standalone / YARN / K8s backend)
→ new DAGScheduler
→ taskScheduler.start → backend.start
→ blockManager.initialize
→ SparkUI.bind
→ application runs until sc.stop() or JVM exit
Startup sequence (executor)
CoarseGrainedExecutorBackend.main
→ RpcEnv + fetch SparkAppConfig from driver
→ SparkEnv.createExecutorEnv
→ RegisterExecutor RPC to driver
→ new Executor(...) — thread pool ready
→ await LaunchTask messages
System purpose — mental model for engineers
When debugging Spark, ask four questions:
- What graph? RDD lineage or Catalyst
df.queryExecution.executedPlan. - What stages? Spark UI /
DAGSchedulerstage list — shuffle boundaries. - Where is data? Block manager cache, shuffle files, or source partitions.
- Who schedules? Driver
TaskScheduler+ cluster manager for containers.
Spark is not a classic request/response server. User-facing "requests" are actions or streaming micro-batches that enqueue jobs on an internal event-driven scheduler.