1. Executive mental model
Apache Spark is a
distributed batch and stream analytics engine.
It exposes high-level APIs (Scala, Java, Python, R) that compile
user programs into a graph of parallel tasks executed on a
cluster of JVM processes. The codebase at
github.com/apache/spark (version
5.0.0-SNAPSHOT in master) is a large polyglot
monorepo centered on Scala 2.13 and Java 17+.
Think of Spark in three stacked layers:
Driver vs executor: One JVM (the
driver) holds SparkContext, builds the
execution plan, and schedules work. Worker JVMs
(executors) run tasks, cache blocks, and write shuffle
data. In local[*] mode, driver and executors share
one process.
Lazy evaluation: Transformations build a
logical graph (RDD lineage or Catalyst LogicalPlan)
without running cluster work. Actions (count,
collect, writing sinks) trigger job submission.
Two query paths coexist:
- RDD API — lineage graph → stages at shuffle boundaries → tasks.
-
SQL/DataFrame API — SQL text or DataFrame
ops → Catalyst parse/analyze/optimize → physical
SparkPlan→ RDD → same scheduler.
Spark Connect (since 3.4+, heavily expanded)
splits client and server: the client builds protobuf plans; the
server runs the classic Catalyst +
QueryExecution stack and returns Arrow batches over
gRPC.
2. Repository map
What kind of project
Apache Spark is an open-source
distributed computing framework, not a web app
or database server. It is packaged as Maven/SBT modules,
assembled into a tarball with bin/ launch scripts,
and deployed on YARN, Kubernetes, or Spark Standalone.
Languages, runtimes, build tools
| Technology | Role | Evidence |
|---|---|---|
| Scala 2.13 | Primary implementation language for core, SQL, streaming |
pom.xml artifact
spark-parent_2.13
|
| Java 17+ | Launcher, network, some core/tests | README.md, launcher/ |
| Python 3.11+ | PySpark via Py4J |
python/pyspark/,
pyproject.toml
|
| R (deprecated) | SparkR bindings | R/, README note |
| Apache Maven | Official build & release | ./build/mvn, root pom.xml |
| SBT | Developer/CI fast iteration |
project/SparkBuild.scala,
./build/sbt
|
| Protobuf/gRPC | Spark Connect wire protocol | sql/connect/ |
Top-level directories
| Directory | Role | Central vs peripheral |
|---|---|---|
core/ |
RDD, scheduling, deploy, RPC, storage, shuffle — the engine kernel | Core |
sql/ |
Catalyst compiler, execution engine, Hive, Connect, pipelines | Core |
common/ |
Shared libs: network, unsafe memory, kvstore, utils | Core |
launcher/ |
Minimal JVM to construct java command lines
|
Core (bootstrap) |
resource-managers/ |
YARN and Kubernetes integration | Core (when deployed) |
python/ |
PySpark client library and tests | Core (API surface) |
streaming/ |
Legacy DStream API (pre-Structured Streaming) | Peripheral (legacy) |
sql/.../streaming/ |
Structured Streaming (micro-batch, state store) | Core |
mllib/, graphx/ |
ML and graph algorithms on RDDs/DataFrames | Peripheral (libraries) |
connector/ |
Kafka, Avro, Protobuf, Kinesis connectors | Peripheral (integrations) |
assembly/ |
Fat JAR / distribution assembly | Build infra |
bin/, sbin/ |
CLI entry scripts, cluster daemons | Entry points |
conf/ |
Template configs
(spark-defaults.conf.template)
|
Configuration |
examples/ |
Sample apps (Pi, etc.) | Peripheral |
docs/ |
User-facing documentation site source | Docs |
dev/ |
Test runners, release scripts, lint | Tooling |
.github/workflows/ |
CI (63 workflow files) | Infra |
repl/ |
Scala REPL integration | Peripheral |
udf/ |
External UDF worker over gRPC | Peripheral (extension) |
ui-test/ |
Jest tests for Spark UI static assets | Tests |
Entry points
| Entry | Path | Main class |
|---|---|---|
spark-submit |
bin/spark-submit →
bin/spark-class
|
org.apache.spark.deploy.SparkSubmit |
| Launcher bootstrap | bin/spark-class |
org.apache.spark.launcher.Main |
| Interactive Scala | bin/spark-shell |
org.apache.spark.repl.Main |
| Interactive Python | bin/pyspark |
Py4J gateway → JVM shell |
| SQL CLI | bin/spark-sql |
SQL shell main via SparkSubmit |
| Standalone master | sbin/start-master.sh |
org.apache.spark.deploy.master.Master |
| Standalone worker | sbin/start-worker.sh |
org.apache.spark.deploy.worker.Worker |
| Executor process | Launched by cluster manager |
org.apache.spark.executor.CoarseGrainedExecutorBackend
|
| Spark Connect server | Started with Spark app / dedicated command |
org.apache.spark.sql.connect.service.SparkConnectService
|
| In-process API | User code | SparkSession.builder().getOrCreate() |
Configuration vs generated vs tests
-
Configuration:
conf/*.template,SparkConf,SQLConf, typed keys incore/.../internal/config/package.scala -
Generated: Protobuf classes from
sql/connect, Antlr parsers in Catalyst, build-info inbuild/spark-build-info -
Tests:
<module>/src/test/scala/**(ScalaTest),src/test/java/**(JUnit 5),python/pyspark/**/tests/(unittest) -
Scripts:
dev/run-tests.py,dev/make-distribution.sh,build/mvn,build/sbt
core/ is central: Every API
path eventually calls SparkContext.runJob, uses
SparkEnv singletons on each JVM, and depends on
DAGScheduler + BlockManager. SQL is a
compiler layered on top; without core, nothing runs on a
cluster.
Continue to Architecture & Runtime Components →