5. Core modules (module-by-module)
core/ — distributed runtime kernel
Responsibility: RDD API, scheduling, shuffle, block storage, RPC, deployment, UI.
Public API: org.apache.spark.SparkContext, SparkConf, org.apache.spark.rdd.RDD, org.apache.spark.sql.SparkSession re-exports (via SQL module).
Key internals:
DAGScheduler,TaskSchedulerImpl,CoarseGrainedSchedulerBackendBlockManager,MemoryStore,DiskStoreSortShuffleManager,MapOutputTrackerRpcEnv,NettyBlockTransferServicedeploy/—SparkSubmit, Standalone Master/Worker
Depends on: common/network-*, common/unsafe, Hadoop client libs.
Depended on by: All other Spark modules.
Invariants: One active SparkContext per JVM; shuffle IDs globally unique per app; stage cleanup when jobs complete.
sql/catalyst/ — query compiler
Responsibility: SQL parsing, analysis, optimization, expression trees, rule engine.
Public API: Mostly private[sql] / developer — ParserInterface, Analyzer, Optimizer, LogicalPlan, Rule[T].
Key classes: RuleExecutor (fixed-point batches), CatalystSqlParser, TreeNode (transformations via patterns).
Pattern: Compiler passes — batches of Rule[LogicalPlan] applied until fixpoint.
Non-obvious: Catalyst is largely cluster-agnostic; no RDD references in logical plans.
sql/core/ — SQL runtime
Responsibility: Physical planning, execution operators, session state, classic API, Structured Streaming runtime.
Public API: org.apache.spark.sql.classic.SparkSession, Dataset, DataFrameReader/Writer.
Key classes: QueryExecution, SparkPlan, SparkPlanner, SQLExecution, FileSourceScanExec, AdaptiveSparkPlanExec.
Depends on: catalyst, core.
sql/hive/ — Hive metastore integration
Responsibility: HiveExternalCatalog, Hive SerDe tables, HiveTableScanExec.
Activation: SparkSession.builder.enableHiveSupport() → HiveSessionStateBuilder.
Peripheral when: Using only DataSource V2 / in-memory catalog.
sql/connect/ — decoupled client/server
| Submodule | Role |
|---|---|
connect/common | Client SparkSession, SparkConnectClient, proto plan builders |
connect/server | SparkConnectService gRPC, SparkConnectPlanner, session manager |
connect/client/jvm | JVM client packaging |
connect/shims | Version compatibility shims |
Critical invariant: Server reuses classic QueryExecution — Connect is transport + plan deserialization, not a separate engine.
launcher/ — bootstrap only
Responsibility: Construct JVM commands without loading full Spark.
API: SparkLauncher (Java), Main, SparkSubmitCommandBuilder.
Why separate: Keeps shell scripts fast; classpath computed before heavy Spark classes load.
common/ — shared infrastructure
| Module | Purpose |
|---|---|
network-common | Netty RPC, transport config |
network-shuffle | External shuffle service protocol |
network-yarn | YARN shuffle integration |
unsafe | Off-heap memory access for Tungsten |
kvstore | RocksDB/InMemory KV for UI history |
sketch | Approximate algorithms (Bloom, quantiles) |
resource-managers/
YARN: Client.scala, ApplicationMaster.scala, YarnSchedulerBackend.scala.
Kubernetes: org.apache.spark.deploy.k8s — pod spec builders, driver/executor pod lifecycle.
Standalone: Lives in core/deploy/master and core/deploy/worker — not under resource-managers.
python/pyspark/
Responsibility: Python API mirroring JVM; Py4J bridge; Python worker for UDFs/Pandas UDFs.
Key files: java_gateway.py, worker.py, sql/, connect/.
Type: Glue + API — heavy lifting on JVM.
streaming/ (legacy DStreams)
Old micro-batch API over RDDs. Structured Streaming in sql/core/.../execution/streaming/ is the maintained path.
mllib/, graphx/
Algorithm libraries built on RDDs/DataFrames. Important for ML users but not scheduling core.
connector/
Kafka, Avro, Protobuf, Kinesis — optional Maven modules, DataSource V1/V2 implementations.
6. Data model and persistence
Domain entities (in-memory)
| Entity | Representation |
|---|---|
| Row (SQL) | InternalRow (Catalyst), Row (user-facing) |
| Schema | StructType, StructField |
| Table metadata | CatalogTable, V2 Table via TableCatalog |
| Block | BlockId (e.g. rdd_123_4, shuffle_0_0_0) |
| Shuffle metadata | MapStatus, index files + data files on disk |
| Streaming offset | OffsetSeqMetadata, JSON in checkpoint dir |
| State store | Versioned key-value per operator + partition (HDFSBackedStateStore) |
No ORM / app database
Spark itself does not persist application domain data in an internal database. Metadata goes to:
- Hive Metastore (external DB) via
HiveExternalCatalog - In-memory catalog (
SessionCatalog) for temp views - Event log — JSON files for Spark UI replay (
spark.eventLog.dir) - Checkpoint dirs — streaming state and offsets on Hadoop FS
- KVStore — UI elements in memory/RocksDB (
common/kvstore)
Serialization
- Default: Kryo (
spark.serializer) with registrator for common types - Closure:
closureSerializerfor task bytecode + captured variables - SQL:
Encoder/ expression codegen for row serialization - Connect: Protobuf plans + Apache Arrow for result batches
- Shuffle: Pluggable
Serializeron shuffle records
Config formats
SparkConf— string key/value, typed accessors inconfig._objectsSQLConf— session-level SQL settings (spark.sql.*)conf/spark-defaults.conf— deployment defaultsconf/spark-env.sh— env vars (SPARK_MASTER_HOST, etc.)
Data lifecycle example (batch write)
7. External interfaces (summary)
| Interface | Protocol / format | Entry |
|---|---|---|
| CLI | shell scripts | bin/* |
| Scala/Java API | in-process | SparkSession, SparkContext |
| Python API | Py4J TCP | pyspark |
| Spark Connect | gRPC + Protobuf + Arrow | SparkConnectClient |
| JDBC/ODBC | Thrift JDBC | sql/hive-thriftserver |
| REST submission | JSON | core/.../deploy/rest/ |
| Filesystem | Hadoop FileSystem API | DataSources, checkpoints |
| Plugin SPI | ServiceLoader | ExternalClusterManager, DataSource V2, catalog plugins |
Dependencies and architecture patterns
Dependency direction
Patterns in use
- Lineage-based DAG scheduler — not a general-purpose workflow engine
- Rule-based query compiler (Catalyst) — similar to functional compiler passes
- Volcano-style iterator model —
SparkPlan.execute()produces RDD pipelines - Whole-stage codegen — fuses operators into generated Java for hot paths
- Adaptive Query Execution (AQE) — re-optimizes mid-query based on runtime stats
- Plugin / SPI — cluster managers, catalogs, data sources via Java ServiceLoader
Coupling vs separation
| Clean separation | Tight coupling |
|---|---|
| Catalyst logical vs physical plans | SparkEnv global singleton |
| SchedulerBackend trait vs implementations | SQL ↔ Core via RDD bridge (necessary) |
| DataSource V2 connector API | Hive module patches SessionState builder |
| Launcher vs core classpath | PySpark requires matching JVM/Python versions |