Skip to content

[SPARK-56698][PYTHON] Add Spark MCP (Model Context Protocol) server#55648

Open
viirya wants to merge 6 commits intoapache:masterfrom
viirya:spark-mcp-server
Open

[SPARK-56698][PYTHON] Add Spark MCP (Model Context Protocol) server#55648
viirya wants to merge 6 commits intoapache:masterfrom
viirya:spark-mcp-server

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented May 1, 2026

What changes were proposed in this pull request?

Adds an Apache Spark MCP server that acts as a thin client over Spark Connect, exposing Spark capabilities as MCP tools for LLM consumption. Catalog browsing, SQL execution, and query plan tools are read-only by default.

Module layout (python/pyspark/sql/mcp/):

  • server.py CLI entry point and MCP tool registration
  • config.py ServerConfig dataclass (env + CLI sources)
  • session.py Lazy SparkSession holder over Spark Connect
  • safety.py Read-only SQL guardrail
  • tools/registry.py Tool spec / handler abstraction
  • tools/session.py get_session_info (with config redaction)
  • tools/catalog.py list_catalogs, list_databases, list_tables, describe_table
  • tools/query.py list_functions, execute_sql, preview_table, explain_query, analyze_query

Tool handlers are MCP-SDK-agnostic and Connect-import-free at module load time, so the unit tests run without grpcio or the mcp SDK installed. 14 unit tests in python/pyspark/sql/tests/mcp/test_mcp_tools.py exercise the full tool surface against an in-memory fake session.

Why are the changes needed?

LLM clients can already talk to MCP servers; Spark Connect already separates client from cluster. This module connects the two: a Spark cluster shows up to an LLM as a set of safe, paginated tools — list_tables, describe_table, execute_sql, explain_query, etc. Users can interact with Spark using natural language.

Does this PR introduce any user-facing change?

Yes. User can do Spark queries with natural language in LLMs like Claude Code using these MCP tools.

How was this patch tested?

Unit test. Manual test in Claude Code.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

viirya added 5 commits April 30, 2026 19:57
… server

Adds an Apache Spark MCP server that acts as a thin client over Spark
Connect, exposing Spark capabilities as MCP tools for LLM consumption.
Catalog browsing, SQL execution, and query plan tools are read-only by
default.

Module layout (python/pyspark/sql/mcp/):
- server.py         CLI entry point and MCP tool registration
- config.py         ServerConfig dataclass (env + CLI sources)
- session.py        Lazy SparkSession holder over Spark Connect
- safety.py         Read-only SQL guardrail
- tools/registry.py Tool spec / handler abstraction
- tools/session.py  get_session_info (with config redaction)
- tools/catalog.py  list_catalogs, list_databases, list_tables, describe_table
- tools/query.py    list_functions, execute_sql, preview_table,
                    explain_query, analyze_query

Tool handlers are MCP-SDK-agnostic and Connect-import-free at module
load time, so the unit tests run without grpcio or the mcp SDK
installed. 14 unit tests in python/pyspark/sql/tests/mcp/test_mcp_tools.py
exercise the full tool surface against an in-memory fake session.

Co-authored-by: Claude Code
…t real Spark Connect

Adds python/pyspark/sql/tests/mcp/test_mcp_integration.py: an
end-to-end test that boots an in-process Spark Connect session via
SparkSession.builder.remote("local[2]") and exercises every MCP tool
handler against real Spark.

The integration test reuses the same registry and handlers as the
unit suite, swapping only the SessionHolder via a small _PreboundHolder
subclass. This validates that handler code agrees with the real
Connect client for catalog browsing (list_catalogs, list_databases,
list_tables, describe_table), SQL execution (execute_sql with paging
+ truncation, preview_table, read-only filter), and plan inspection
(explain_query, analyze_query).

setUpClass applies two Spark configs that local in-process Connect
needs but the MCP server itself does not impose:

  spark.ui.enabled=false                avoid Jetty classpath mismatch
  spark.driver.bindAddress=127.0.0.1    force loopback so executors
  spark.driver.host=127.0.0.1           can reach the REPL artifact
                                        server

The test is gated by pyspark.testing.utils.should_test_connect, so it
skips automatically when Connect dependencies (grpcio, mcp SDK,
zstandard, etc.) are missing.

Run with:
  SPARK_HOME=<spark-build> PYTHONPATH=python python -m unittest \
    pyspark.sql.tests.mcp.test_mcp_integration

Verified locally: 11/11 integration + 14/14 unit tests pass in 5.3s.

Co-authored-by: Claude Code
…o MCP server README

Captures a representative interaction: a user asks an LLM to interpret
the plan for a moderate aggregation query without executing it. The
LLM calls analyze_query, then produces a step-by-step explanation
grounded in operator semantics — Range splits, filter ordering /
PushDownPredicates, partial vs final HashAggregate, hash-partitioned
Exchange with skewed group keys, and AdaptiveSparkPlan
isFinalPlan=false. Used as the canonical motivating example for why
the plan tools are high-leverage.

Co-authored-by: Claude Code
Minor doc-only cleanups:
- README: drop status banner and "(planned)" qualifier on the tools
  list; tighten the plan-tools intro.
- session.py / tools/registry.py: shorten internal docstrings.

No behavioural change. Unit tests still pass.

Co-authored-by: Claude Code
…r README

Adds a "Configuring an MCP client" section covering:
- The general shape (stdio command + SPARK_REMOTE env var).
- Claude Code (`claude mcp add` invocation, scope options).
- Claude Desktop (claude_desktop_config.json snippet).
- Generic stdio MCP clients.
- Recognized environment variables (SPARK_REMOTE,
  SPARK_MCP_READ_ONLY, SPARK_MCP_USER_ID, SPARK_MCP_TRANSPORT).
- A quick verification suggestion (call get_session_info).

Co-authored-by: Claude Code
@viirya viirya changed the title [SPARK-XXXXX][PYTHON] Add Spark MCP (Model Context Protocol) server [SPARK-56698][PYTHON] Add Spark MCP (Model Context Protocol) server May 1, 2026
@gaogaotiantian
Copy link
Copy Markdown
Contributor

Personally I think this requires an SPIP. This is a completely new feature that requires some maintenance effort.

…catalog/db allow-list

Two related changes to the Spark MCP server:

1. Drop allowed_catalogs / allowed_databases. The fields existed on
   ServerConfig and were partially honored by catalog tools, but were
   never exposed via CLI or env vars and could not enforce isolation
   anyway -- execute_sql trivially bypasses them with fully-qualified
   names or USE. Catalog/database isolation belongs at the Spark
   Connect endpoint via the identity the MCP server authenticates as.

2. Wire up max_rows, query_timeout_seconds, and user_id from CLI and
   env vars (--max-rows / SPARK_MCP_MAX_ROWS, --query-timeout-seconds /
   SPARK_MCP_QUERY_TIMEOUT_SECONDS, --user-id / SPARK_MCP_USER_ID).
   Previously these dataclass fields could only be set by code, so the
   defaults were unreachable from the CLI. query_timeout_seconds is now
   actually enforced: execute_sql / explain_query / analyze_query run
   their blocking Spark Connect calls via asyncio.to_thread under
   asyncio.wait_for, raising QueryTimeoutError on timeout. A timeout of
   0 disables the cap.

Tests cover the new env / CLI parsing and both the timeout-fires and
timeout-disabled paths. README env-var table updated.

Co-authored-by: Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants