Driver & SparkContext — Spark Data Flow

Boot sequence at a glance

new SparkConf()                      // 1. read spark.* settings
      │
      ▼
new SparkContext(conf)               // 2. the driver comes alive
      │
      ├─► createSparkEnv(...)         //    SparkEnv: serializer, blockManager,
      │                              //    mapOutputTracker, broadcastManager
      │
      ├─► initializeShuffleManager()  //    deferred so plugins can customize
      ├─► initializeMemoryManager()
      │
      ├─► createTaskScheduler(master) // 3. backend + TaskSchedulerImpl
      ├─► new DAGScheduler(this)
      │
      └─► taskScheduler.start()       // 4. backend registers, app gets an id

Everything below the constructor happens once, eagerly, inside the SparkContext initialization block. After it returns, the application is ready to build RDDs and run jobs.

SparkConf — the frozen configuration

SparkConf is a simple key/value store. Calling new SparkConf() loads every spark.* property from the JVM system properties. Once it is handed to a SparkContext, the context clones it, so later mutations to your original object have no effect — configuration is effectively immutable for the life of the application.

This is why settings such as spark.executor.memory or spark.sql.shuffle.partitions must be set before the context is created.

SparkConf.scala L287-L296 — class and default constructor

SparkContext — the entry point

The class doc calls it "the main entry point for Spark functionality" — the connection to a cluster used to create RDDs, accumulators, and broadcast variables. Only one active SparkContext may exist per JVM.

Its constructor declares four runtime pillars as private fields — _env, _schedulerBackend, _taskScheduler, _dagScheduler — and fills them in during initialization. Holding these references is what makes the context the hub of the whole system.

SparkContext.scala L86 — class definition

SparkContext.scala L220-L229 — the runtime fields

SparkEnv — the shared service container

SparkEnv "holds all the runtime environment objects for a running Spark instance." A single instance exists per process (driver or executor) and is reachable from anywhere via the global SparkEnv.get. This is how task code running deep inside an RDD can reach the local BlockManager without threading references through every call.

It bundles the serializer, the closure serializer, the BlockManager, the MapOutputTracker (a master on the driver, a worker on executors), the broadcast manager, and lazily-initialized shuffle and memory managers.

SparkEnv.scala L64-L76 — the held components

SparkEnv.scala L331-L360 — createDriverEnv

Why shuffle and memory managers initialize late

Notice that SparkEnv creates the shuffle and memory managers after the rest of the environment, through initializeShuffleManager() and initializeMemoryManager(...). This deferral lets user JARs and plugins register custom implementations (for example, a custom ShuffleManager) before they are instantiated.

SparkContext.scala L596-L597 — deferred init calls

SparkEnv.scala L293-L308 — initialize methods

createTaskScheduler — one code path, many cluster managers

This factory pattern-matches on the master URL and returns a (SchedulerBackend, TaskScheduler) pair. local[*] yields a LocalSchedulerBackend; spark:// yields a StandaloneSchedulerBackend; YARN and Kubernetes are loaded through the ExternalClusterManager service-provider interface. The rest of Spark is written against the SchedulerBackend trait, so the same scheduling logic runs unchanged everywhere.

SparkContext.scala L3295-L3399 — createTaskScheduler

runJob — the bridge from your code to the cluster

Every action ends here. runJob takes a target RDD, a function to run on each partition, the set of partition indices, and a result handler. It cleans the closure (so it can be serialized and shipped), records the call site for the UI, and delegates to dagScheduler.runJob. After the job finishes it triggers any pending checkpointing.

def runJob[T, U](rdd, func, partitions, resultHandler): Unit = {
  val cleanedFunc = clean(func)            // make the closure shippable
  dagScheduler.runJob(rdd, cleanedFunc,    // hand off to the DAG scheduler
                      partitions, callSite, resultHandler, ...)
  rdd.doCheckpoint()                       // materialize checkpoints if any
}

SparkContext.scala L2481-L2499 — runJob

How the pieces relate

The driver is a single JVM, but it holds several long-lived collaborators. The SparkContext owns them; SparkEnv makes the stateless services globally reachable; the scheduler trio coordinates execution.

Object	Lives on	Responsibility
`SparkContext`	Driver only	User API surface; owns the scheduler stack and tracks jobs.
`SparkEnv`	Driver and every executor	Shared services (serializer, block manager, trackers) via `SparkEnv.get`.
`DAGScheduler`	Driver only	Plans stages from RDD lineage; covered on the scheduling page.
`TaskSchedulerImpl`	Driver only	Assigns tasks to executor offers.
`SchedulerBackend`	Driver only	Talks to the cluster manager and to executors.

Key takeaways

Configuration is frozen at context creation; set spark.* properties first.
SparkEnv.get is the back-channel that lets remote task code reach local services.
One factory (createTaskScheduler) hides all cluster-manager differences.
SparkContext.runJob is the single funnel for every action.

The driver: SparkConf, SparkContext, SparkEnv

Boot sequence at a glance

How the pieces relate

Key takeaways

External references