Spark — Core Modules

5. Core modules (module-by-module)

`core/` — distributed runtime kernel

Responsibility: RDD API, scheduling, shuffle, block storage, RPC, deployment, UI.

Public API: org.apache.spark.SparkContext, SparkConf, org.apache.spark.rdd.RDD, org.apache.spark.sql.SparkSession re-exports (via SQL module).

Key internals:

DAGScheduler, TaskSchedulerImpl, CoarseGrainedSchedulerBackend
BlockManager, MemoryStore, DiskStore
SortShuffleManager, MapOutputTracker
RpcEnv, NettyBlockTransferService
deploy/ — SparkSubmit, Standalone Master/Worker

Depends on: common/network-*, common/unsafe, Hadoop client libs.

Depended on by: All other Spark modules.

Invariants: One active SparkContext per JVM; shuffle IDs globally unique per app; stage cleanup when jobs complete.

`sql/catalyst/` — query compiler

Responsibility: SQL parsing, analysis, optimization, expression trees, rule engine.

Public API: Mostly private[sql] / developer — ParserInterface, Analyzer, Optimizer, LogicalPlan, Rule[T].

Key classes: RuleExecutor (fixed-point batches), CatalystSqlParser, TreeNode (transformations via patterns).

Pattern: Compiler passes — batches of Rule[LogicalPlan] applied until fixpoint.

Non-obvious: Catalyst is largely cluster-agnostic; no RDD references in logical plans.

`sql/core/` — SQL runtime

Responsibility: Physical planning, execution operators, session state, classic API, Structured Streaming runtime.

Public API: org.apache.spark.sql.classic.SparkSession, Dataset, DataFrameReader/Writer.

Key classes: QueryExecution, SparkPlan, SparkPlanner, SQLExecution, FileSourceScanExec, AdaptiveSparkPlanExec.

Depends on: catalyst, core.

`sql/hive/` — Hive metastore integration

Responsibility: HiveExternalCatalog, Hive SerDe tables, HiveTableScanExec.

Activation: SparkSession.builder.enableHiveSupport() → HiveSessionStateBuilder.

Peripheral when: Using only DataSource V2 / in-memory catalog.

`sql/connect/` — decoupled client/server

Submodule	Role
`connect/common`	Client `SparkSession`, `SparkConnectClient`, proto plan builders
`connect/server`	`SparkConnectService` gRPC, `SparkConnectPlanner`, session manager
`connect/client/jvm`	JVM client packaging
`connect/shims`	Version compatibility shims

Critical invariant: Server reuses classic QueryExecution — Connect is transport + plan deserialization, not a separate engine.

`launcher/` — bootstrap only

Responsibility: Construct JVM commands without loading full Spark.

API: SparkLauncher (Java), Main, SparkSubmitCommandBuilder.

Why separate: Keeps shell scripts fast; classpath computed before heavy Spark classes load.

`common/` — shared infrastructure

Module	Purpose
`network-common`	Netty RPC, transport config
`network-shuffle`	External shuffle service protocol
`network-yarn`	YARN shuffle integration
`unsafe`	Off-heap memory access for Tungsten
`kvstore`	RocksDB/InMemory KV for UI history
`sketch`	Approximate algorithms (Bloom, quantiles)

`resource-managers/`

YARN: Client.scala, ApplicationMaster.scala, YarnSchedulerBackend.scala.

Kubernetes: org.apache.spark.deploy.k8s — pod spec builders, driver/executor pod lifecycle.

Standalone: Lives in core/deploy/master and core/deploy/worker — not under resource-managers.

`python/pyspark/`

Responsibility: Python API mirroring JVM; Py4J bridge; Python worker for UDFs/Pandas UDFs.

Key files: java_gateway.py, worker.py, sql/, connect/.

Type: Glue + API — heavy lifting on JVM.

`streaming/` (legacy DStreams)

Old micro-batch API over RDDs. Structured Streaming in sql/core/.../execution/streaming/ is the maintained path.

`mllib/`, `graphx/`

Algorithm libraries built on RDDs/DataFrames. Important for ML users but not scheduling core.

`connector/`

Kafka, Avro, Protobuf, Kinesis — optional Maven modules, DataSource V1/V2 implementations.

6. Data model and persistence

Domain entities (in-memory)

Entity	Representation
Row (SQL)	`InternalRow` (Catalyst), `Row` (user-facing)
Schema	`StructType`, `StructField`
Table metadata	`CatalogTable`, V2 `Table` via `TableCatalog`
Block	`BlockId` (e.g. `rdd_123_4`, `shuffle_0_0_0`)
Shuffle metadata	`MapStatus`, index files + data files on disk
Streaming offset	`OffsetSeqMetadata`, JSON in checkpoint dir
State store	Versioned key-value per operator + partition (`HDFSBackedStateStore`)

No ORM / app database

Spark itself does not persist application domain data in an internal database. Metadata goes to:

Hive Metastore (external DB) via HiveExternalCatalog
In-memory catalog (SessionCatalog) for temp views
Event log — JSON files for Spark UI replay (spark.eventLog.dir)
Checkpoint dirs — streaming state and offsets on Hadoop FS
KVStore — UI elements in memory/RocksDB (common/kvstore)

Serialization

Default: Kryo (spark.serializer) with registrator for common types
Closure: closureSerializer for task bytecode + captured variables
SQL: Encoder / expression codegen for row serialization
Connect: Protobuf plans + Apache Arrow for result batches
Shuffle: Pluggable Serializer on shuffle records

Config formats

SparkConf — string key/value, typed accessors in config._ objects
SQLConf — session-level SQL settings (spark.sql.*)
conf/spark-defaults.conf — deployment defaults
conf/spark-env.sh — env vars (SPARK_MASTER_HOST, etc.)

Data lifecycle example (batch write)

Parquet files on S3 → FileSourceScanExec reads splits → InternalRow in executor memory (possibly off-heap via UnsafeRow) → shuffle (optional) → sort/aggregate Exec → DataSourceWriterCommitMessage → committed files on S3 (transactional commit via OutputCommitter)

7. External interfaces (summary)

Interface	Protocol / format	Entry
CLI	shell scripts	`bin/*`
Scala/Java API	in-process	`SparkSession`, `SparkContext`
Python API	Py4J TCP	`pyspark`
Spark Connect	gRPC + Protobuf + Arrow	`SparkConnectClient`
JDBC/ODBC	Thrift JDBC	`sql/hive-thriftserver`
REST submission	JSON	`core/.../deploy/rest/`
Filesystem	Hadoop FileSystem API	DataSources, checkpoints
Plugin SPI	`ServiceLoader`	`ExternalClusterManager`, DataSource V2, catalog plugins

Dependencies and architecture patterns

Dependency direction

python/R APIs → sql/api → sql/core → sql/catalyst ↓ core → common/* ↓ hadoop-client, netty

Patterns in use

Lineage-based DAG scheduler — not a general-purpose workflow engine
Rule-based query compiler (Catalyst) — similar to functional compiler passes
Volcano-style iterator model — SparkPlan.execute() produces RDD pipelines
Whole-stage codegen — fuses operators into generated Java for hot paths
Adaptive Query Execution (AQE) — re-optimizes mid-query based on runtime stats
Plugin / SPI — cluster managers, catalogs, data sources via Java ServiceLoader

Coupling vs separation

Clean separation	Tight coupling
Catalyst logical vs physical plans	`SparkEnv` global singleton
SchedulerBackend trait vs implementations	SQL ↔ Core via RDD bridge (necessary)
DataSource V2 connector API	Hive module patches SessionState builder
Launcher vs core classpath	PySpark requires matching JVM/Python versions

Next: Error handling & concurrency →

5. Core modules (module-by-module)

core/ — distributed runtime kernel

sql/catalyst/ — query compiler

sql/core/ — SQL runtime

sql/hive/ — Hive metastore integration

sql/connect/ — decoupled client/server

launcher/ — bootstrap only

common/ — shared infrastructure

resource-managers/

python/pyspark/

streaming/ (legacy DStreams)

mllib/, graphx/

connector/