[SPARK-56698][PYTHON] Add Spark MCP (Model Context Protocol) server by viirya · Pull Request #55648 · apache/spark

viirya · 2026-05-01T17:04:29Z

What changes were proposed in this pull request?

Adds an Apache Spark MCP server that acts as a thin client over Spark Connect, exposing Spark capabilities as MCP tools for LLM consumption. Catalog browsing, SQL execution, and query plan tools are read-only by default.

Module layout (python/pyspark/sql/mcp/):

server.py CLI entry point and MCP tool registration
config.py ServerConfig dataclass (env + CLI sources)
session.py Lazy SparkSession holder over Spark Connect
safety.py Read-only SQL guardrail
tools/registry.py Tool spec / handler abstraction
tools/session.py get_session_info (with config redaction)
tools/catalog.py list_catalogs, list_databases, list_tables, describe_table
tools/query.py list_functions, execute_sql, preview_table, explain_query, analyze_query

Tool handlers are MCP-SDK-agnostic and Connect-import-free at module load time, so the unit tests run without grpcio or the mcp SDK installed. 14 unit tests in python/pyspark/sql/tests/mcp/test_mcp_tools.py exercise the full tool surface against an in-memory fake session.

Why are the changes needed?

LLM clients can already talk to MCP servers; Spark Connect already separates client from cluster. This module connects the two: a Spark cluster shows up to an LLM as a set of safe, paginated tools — list_tables, describe_table, execute_sql, explain_query, etc. Users can interact with Spark using natural language.

Does this PR introduce any user-facing change?

Yes. User can do Spark queries with natural language in LLMs like Claude Code using these MCP tools.

How was this patch tested?

Unit test. Manual test in Claude Code.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

… server Adds an Apache Spark MCP server that acts as a thin client over Spark Connect, exposing Spark capabilities as MCP tools for LLM consumption. Catalog browsing, SQL execution, and query plan tools are read-only by default. Module layout (python/pyspark/sql/mcp/): - server.py CLI entry point and MCP tool registration - config.py ServerConfig dataclass (env + CLI sources) - session.py Lazy SparkSession holder over Spark Connect - safety.py Read-only SQL guardrail - tools/registry.py Tool spec / handler abstraction - tools/session.py get_session_info (with config redaction) - tools/catalog.py list_catalogs, list_databases, list_tables, describe_table - tools/query.py list_functions, execute_sql, preview_table, explain_query, analyze_query Tool handlers are MCP-SDK-agnostic and Connect-import-free at module load time, so the unit tests run without grpcio or the mcp SDK installed. 14 unit tests in python/pyspark/sql/tests/mcp/test_mcp_tools.py exercise the full tool surface against an in-memory fake session. Co-authored-by: Claude Code

…t real Spark Connect Adds python/pyspark/sql/tests/mcp/test_mcp_integration.py: an end-to-end test that boots an in-process Spark Connect session via SparkSession.builder.remote("local[2]") and exercises every MCP tool handler against real Spark. The integration test reuses the same registry and handlers as the unit suite, swapping only the SessionHolder via a small _PreboundHolder subclass. This validates that handler code agrees with the real Connect client for catalog browsing (list_catalogs, list_databases, list_tables, describe_table), SQL execution (execute_sql with paging + truncation, preview_table, read-only filter), and plan inspection (explain_query, analyze_query). setUpClass applies two Spark configs that local in-process Connect needs but the MCP server itself does not impose: spark.ui.enabled=false avoid Jetty classpath mismatch spark.driver.bindAddress=127.0.0.1 force loopback so executors spark.driver.host=127.0.0.1 can reach the REPL artifact server The test is gated by pyspark.testing.utils.should_test_connect, so it skips automatically when Connect dependencies (grpcio, mcp SDK, zstandard, etc.) are missing. Run with: SPARK_HOME=<spark-build> PYTHONPATH=python python -m unittest \ pyspark.sql.tests.mcp.test_mcp_integration Verified locally: 11/11 integration + 14/14 unit tests pass in 5.3s. Co-authored-by: Claude Code

…o MCP server README Captures a representative interaction: a user asks an LLM to interpret the plan for a moderate aggregation query without executing it. The LLM calls analyze_query, then produces a step-by-step explanation grounded in operator semantics — Range splits, filter ordering / PushDownPredicates, partial vs final HashAggregate, hash-partitioned Exchange with skewed group keys, and AdaptiveSparkPlan isFinalPlan=false. Used as the canonical motivating example for why the plan tools are high-leverage. Co-authored-by: Claude Code

Minor doc-only cleanups: - README: drop status banner and "(planned)" qualifier on the tools list; tighten the plan-tools intro. - session.py / tools/registry.py: shorten internal docstrings. No behavioural change. Unit tests still pass. Co-authored-by: Claude Code

…r README Adds a "Configuring an MCP client" section covering: - The general shape (stdio command + SPARK_REMOTE env var). - Claude Code (`claude mcp add` invocation, scope options). - Claude Desktop (claude_desktop_config.json snippet). - Generic stdio MCP clients. - Recognized environment variables (SPARK_REMOTE, SPARK_MCP_READ_ONLY, SPARK_MCP_USER_ID, SPARK_MCP_TRANSPORT). - A quick verification suggestion (call get_session_info). Co-authored-by: Claude Code

gaogaotiantian · 2026-05-01T18:06:53Z

Personally I think this requires an SPIP. This is a completely new feature that requires some maintenance effort.

…catalog/db allow-list Two related changes to the Spark MCP server: 1. Drop allowed_catalogs / allowed_databases. The fields existed on ServerConfig and were partially honored by catalog tools, but were never exposed via CLI or env vars and could not enforce isolation anyway -- execute_sql trivially bypasses them with fully-qualified names or USE. Catalog/database isolation belongs at the Spark Connect endpoint via the identity the MCP server authenticates as. 2. Wire up max_rows, query_timeout_seconds, and user_id from CLI and env vars (--max-rows / SPARK_MCP_MAX_ROWS, --query-timeout-seconds / SPARK_MCP_QUERY_TIMEOUT_SECONDS, --user-id / SPARK_MCP_USER_ID). Previously these dataclass fields could only be set by code, so the defaults were unreachable from the CLI. query_timeout_seconds is now actually enforced: execute_sql / explain_query / analyze_query run their blocking Spark Connect calls via asyncio.to_thread under asyncio.wait_for, raising QueryTimeoutError on timeout. A timeout of 0 disables the cap. Tests cover the new env / CLI parsing and both the timeout-fires and timeout-disabled paths. README env-var table updated. Co-authored-by: Claude Code

viirya added 5 commits April 30, 2026 19:57

viirya changed the title ~~[SPARK-XXXXX][PYTHON] Add Spark MCP (Model Context Protocol) server~~ [SPARK-56698][PYTHON] Add Spark MCP (Model Context Protocol) server May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56698][PYTHON] Add Spark MCP (Model Context Protocol) server#55648

[SPARK-56698][PYTHON] Add Spark MCP (Model Context Protocol) server#55648
viirya wants to merge 6 commits intoapache:masterfrom
viirya:spark-mcp-server

viirya commented May 1, 2026 •

edited

Loading

Uh oh!

gaogaotiantian commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

viirya commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

gaogaotiantian commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

viirya commented May 1, 2026 •

edited

Loading