[SPARK-56698][PYTHON] Add Spark MCP (Model Context Protocol) server#55648
Open
viirya wants to merge 6 commits intoapache:masterfrom
Open
[SPARK-56698][PYTHON] Add Spark MCP (Model Context Protocol) server#55648viirya wants to merge 6 commits intoapache:masterfrom
viirya wants to merge 6 commits intoapache:masterfrom
Conversation
… server
Adds an Apache Spark MCP server that acts as a thin client over Spark
Connect, exposing Spark capabilities as MCP tools for LLM consumption.
Catalog browsing, SQL execution, and query plan tools are read-only by
default.
Module layout (python/pyspark/sql/mcp/):
- server.py CLI entry point and MCP tool registration
- config.py ServerConfig dataclass (env + CLI sources)
- session.py Lazy SparkSession holder over Spark Connect
- safety.py Read-only SQL guardrail
- tools/registry.py Tool spec / handler abstraction
- tools/session.py get_session_info (with config redaction)
- tools/catalog.py list_catalogs, list_databases, list_tables, describe_table
- tools/query.py list_functions, execute_sql, preview_table,
explain_query, analyze_query
Tool handlers are MCP-SDK-agnostic and Connect-import-free at module
load time, so the unit tests run without grpcio or the mcp SDK
installed. 14 unit tests in python/pyspark/sql/tests/mcp/test_mcp_tools.py
exercise the full tool surface against an in-memory fake session.
Co-authored-by: Claude Code
…t real Spark Connect
Adds python/pyspark/sql/tests/mcp/test_mcp_integration.py: an
end-to-end test that boots an in-process Spark Connect session via
SparkSession.builder.remote("local[2]") and exercises every MCP tool
handler against real Spark.
The integration test reuses the same registry and handlers as the
unit suite, swapping only the SessionHolder via a small _PreboundHolder
subclass. This validates that handler code agrees with the real
Connect client for catalog browsing (list_catalogs, list_databases,
list_tables, describe_table), SQL execution (execute_sql with paging
+ truncation, preview_table, read-only filter), and plan inspection
(explain_query, analyze_query).
setUpClass applies two Spark configs that local in-process Connect
needs but the MCP server itself does not impose:
spark.ui.enabled=false avoid Jetty classpath mismatch
spark.driver.bindAddress=127.0.0.1 force loopback so executors
spark.driver.host=127.0.0.1 can reach the REPL artifact
server
The test is gated by pyspark.testing.utils.should_test_connect, so it
skips automatically when Connect dependencies (grpcio, mcp SDK,
zstandard, etc.) are missing.
Run with:
SPARK_HOME=<spark-build> PYTHONPATH=python python -m unittest \
pyspark.sql.tests.mcp.test_mcp_integration
Verified locally: 11/11 integration + 14/14 unit tests pass in 5.3s.
Co-authored-by: Claude Code
…o MCP server README Captures a representative interaction: a user asks an LLM to interpret the plan for a moderate aggregation query without executing it. The LLM calls analyze_query, then produces a step-by-step explanation grounded in operator semantics — Range splits, filter ordering / PushDownPredicates, partial vs final HashAggregate, hash-partitioned Exchange with skewed group keys, and AdaptiveSparkPlan isFinalPlan=false. Used as the canonical motivating example for why the plan tools are high-leverage. Co-authored-by: Claude Code
Minor doc-only cleanups: - README: drop status banner and "(planned)" qualifier on the tools list; tighten the plan-tools intro. - session.py / tools/registry.py: shorten internal docstrings. No behavioural change. Unit tests still pass. Co-authored-by: Claude Code
…r README Adds a "Configuring an MCP client" section covering: - The general shape (stdio command + SPARK_REMOTE env var). - Claude Code (`claude mcp add` invocation, scope options). - Claude Desktop (claude_desktop_config.json snippet). - Generic stdio MCP clients. - Recognized environment variables (SPARK_REMOTE, SPARK_MCP_READ_ONLY, SPARK_MCP_USER_ID, SPARK_MCP_TRANSPORT). - A quick verification suggestion (call get_session_info). Co-authored-by: Claude Code
Contributor
|
Personally I think this requires an SPIP. This is a completely new feature that requires some maintenance effort. |
…catalog/db allow-list Two related changes to the Spark MCP server: 1. Drop allowed_catalogs / allowed_databases. The fields existed on ServerConfig and were partially honored by catalog tools, but were never exposed via CLI or env vars and could not enforce isolation anyway -- execute_sql trivially bypasses them with fully-qualified names or USE. Catalog/database isolation belongs at the Spark Connect endpoint via the identity the MCP server authenticates as. 2. Wire up max_rows, query_timeout_seconds, and user_id from CLI and env vars (--max-rows / SPARK_MCP_MAX_ROWS, --query-timeout-seconds / SPARK_MCP_QUERY_TIMEOUT_SECONDS, --user-id / SPARK_MCP_USER_ID). Previously these dataclass fields could only be set by code, so the defaults were unreachable from the CLI. query_timeout_seconds is now actually enforced: execute_sql / explain_query / analyze_query run their blocking Spark Connect calls via asyncio.to_thread under asyncio.wait_for, raising QueryTimeoutError on timeout. A timeout of 0 disables the cap. Tests cover the new env / CLI parsing and both the timeout-fires and timeout-disabled paths. README env-var table updated. Co-authored-by: Claude Code
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Adds an Apache Spark MCP server that acts as a thin client over Spark Connect, exposing Spark capabilities as MCP tools for LLM consumption. Catalog browsing, SQL execution, and query plan tools are read-only by default.
Module layout (python/pyspark/sql/mcp/):
Tool handlers are MCP-SDK-agnostic and Connect-import-free at module load time, so the unit tests run without grpcio or the mcp SDK installed. 14 unit tests in python/pyspark/sql/tests/mcp/test_mcp_tools.py exercise the full tool surface against an in-memory fake session.
Why are the changes needed?
LLM clients can already talk to MCP servers; Spark Connect already separates client from cluster. This module connects the two: a Spark cluster shows up to an LLM as a set of safe, paginated tools —
list_tables,describe_table,execute_sql,explain_query, etc. Users can interact with Spark using natural language.Does this PR introduce any user-facing change?
Yes. User can do Spark queries with natural language in LLMs like Claude Code using these MCP tools.
How was this patch tested?
Unit test. Manual test in Claude Code.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code