Boot sequence at a glance
new SparkConf() // 1. read spark.* settings
│
▼
new SparkContext(conf) // 2. the driver comes alive
│
├─► createSparkEnv(...) // SparkEnv: serializer, blockManager,
│ // mapOutputTracker, broadcastManager
│
├─► initializeShuffleManager() // deferred so plugins can customize
├─► initializeMemoryManager()
│
├─► createTaskScheduler(master) // 3. backend + TaskSchedulerImpl
├─► new DAGScheduler(this)
│
└─► taskScheduler.start() // 4. backend registers, app gets an id
Everything below the constructor happens once, eagerly, inside the SparkContext initialization block. After it returns, the application is ready to build RDDs and run jobs.
SparkConf — the frozen configuration
SparkConf is a simple key/value store. Calling new SparkConf() loads every spark.* property from the JVM system properties. Once it is handed to a SparkContext, the context clones it, so later mutations to your original object have no effect — configuration is effectively immutable for the life of the application.
This is why settings such as spark.executor.memory or spark.sql.shuffle.partitions must be set before the context is created.
SparkContext — the entry point
The class doc calls it "the main entry point for Spark functionality" — the connection to a cluster used to create RDDs, accumulators, and broadcast variables. Only one active SparkContext may exist per JVM.
Its constructor declares four runtime pillars as private fields — _env, _schedulerBackend, _taskScheduler, _dagScheduler — and fills them in during initialization. Holding these references is what makes the context the hub of the whole system.
SparkEnv — the shared service container
SparkEnv "holds all the runtime environment objects for a running Spark instance." A single instance exists per process (driver or executor) and is reachable from anywhere via the global SparkEnv.get. This is how task code running deep inside an RDD can reach the local BlockManager without threading references through every call.
It bundles the serializer, the closure serializer, the BlockManager, the MapOutputTracker (a master on the driver, a worker on executors), the broadcast manager, and lazily-initialized shuffle and memory managers.
Why shuffle and memory managers initialize late
Notice that SparkEnv creates the shuffle and memory managers after the rest of the environment, through initializeShuffleManager() and initializeMemoryManager(...). This deferral lets user JARs and plugins register custom implementations (for example, a custom ShuffleManager) before they are instantiated.
createTaskScheduler — one code path, many cluster managers
This factory pattern-matches on the master URL and returns a (SchedulerBackend, TaskScheduler) pair. local[*] yields a LocalSchedulerBackend; spark:// yields a StandaloneSchedulerBackend; YARN and Kubernetes are loaded through the ExternalClusterManager service-provider interface. The rest of Spark is written against the SchedulerBackend trait, so the same scheduling logic runs unchanged everywhere.
runJob — the bridge from your code to the cluster
Every action ends here. runJob takes a target RDD, a function to run on each partition, the set of partition indices, and a result handler. It cleans the closure (so it can be serialized and shipped), records the call site for the UI, and delegates to dagScheduler.runJob. After the job finishes it triggers any pending checkpointing.
def runJob[T, U](rdd, func, partitions, resultHandler): Unit = {
val cleanedFunc = clean(func) // make the closure shippable
dagScheduler.runJob(rdd, cleanedFunc, // hand off to the DAG scheduler
partitions, callSite, resultHandler, ...)
rdd.doCheckpoint() // materialize checkpoints if any
}
How the pieces relate
The driver is a single JVM, but it holds several long-lived collaborators. The SparkContext owns them; SparkEnv makes the stateless services globally reachable; the scheduler trio coordinates execution.
| Object | Lives on | Responsibility |
|---|---|---|
SparkContext | Driver only | User API surface; owns the scheduler stack and tracks jobs. |
SparkEnv | Driver and every executor | Shared services (serializer, block manager, trackers) via SparkEnv.get. |
DAGScheduler | Driver only | Plans stages from RDD lineage; covered on the scheduling page. |
TaskSchedulerImpl | Driver only | Assigns tasks to executor offers. |
SchedulerBackend | Driver only | Talks to the cluster manager and to executors. |
Key takeaways
- Configuration is frozen at context creation; set
spark.*properties first. SparkEnv.getis the back-channel that lets remote task code reach local services.- One factory (
createTaskScheduler) hides all cluster-manager differences. SparkContext.runJobis the single funnel for every action.