feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs#4267
Draft
mbutrovich wants to merge 24 commits intoapache:mainfrom
Draft
feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs#4267mbutrovich wants to merge 24 commits intoapache:mainfrom
mbutrovich wants to merge 24 commits intoapache:mainfrom
Conversation
This was referenced May 8, 2026
Contributor
Author
|
There are like 4 Spark SQL test failures that look like they might need updating, but otherwise it's looking good. Not gonna worry about them until we discuss moving forward. |
…ted body" on Spark 3.5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Draft while we discuss with #4233 and #4239.
Which issue does this PR close?
Closes #.
Rationale for this change
#4232 merged the JVM UDF bridge. This PR adds a codegen dispatcher on top: one generic
CometUDFthat compiles a specialized batch kernel per bound CatalystExpression+ input schema via Janino.Payoff:
ScalaUDFor Catalyst expression routes through native without a hand-writtenCometUDF.ScalaUDFs and Catalyst expressions share one expression tree, so Comet keeps surrounding native operators in place.Opt-in via
spark.comet.exec.codegenDispatch.mode = auto | force | disabled. Primary targets: string expressions and userScalaUDFs.What changes are included in this PR?
CometBatchKernelCodegen(orchestrator) +CometBatchKernelCodegenInput/CometBatchKernelCodegenOutput(per-side emission) +CometCodegenDispatchUDF(bridge entry, three-layer cache).ArrayTypeandStructTypeas both input and output, including arbitrary nesting (Array<Array>,Array<Struct>,Struct<Array>,Struct<Struct>, and deeper). SealedArrowColumnSpec+ recursive nested-class emission.RegExpReplaceemitter bypassing theUTF8Stringround-trip Spark'sdoGenCodeforces.(expression, input schema): zero-copy UTF8 reads (VarCharVectorandViewVarCharVector), non-nullableisNullAtelision, decimal short-precision fast path on both input and output, UTF8 on-heap write shortcut, pre-sized variable-length output buffers,NullIntolerantshort-circuit, non-nullable output short-circuit, subexpression elimination.numRowsparameter (zero-column expressions);TaskContextpropagation across JNI so partition-sensitive expressions (Rand,Uuid,MonotonicallyIncreasingID, user UDFs readingTaskContext.get()) see the Spark task context from the Tokio worker.CometScalaUDFroutes anyScalaUDF; the regex family (rlike,regexp_replace,regexp_extract,regexp_extract_all,regexp_instr,split) gets a uniformpickWithModeswitch; native Rust paths preserved where they exist.docs/source/contributor-guide/jvm_udf_dispatch.mdcovers the architecture, the regex routing matrix, and the opt-in steps for adding an expression to codegen dispatch.How are these changes tested?
CometCodegenSourceSuite- generated-source assertions for every optimization and every complex-type shape.CometCodegenDispatchSmokeSuite- end-to-end correctness across the type surface, composed-UDF trees, decimal boundaries, subquery reuse,TaskContextpropagation, array input/output.CometCodegenDispatchFuzzSuite- randomized string fuzz + decimal identity fuzz at several null densities.CometRegExpJvmSuite- SQL-level Spark-vs-Comet correctness for the regex family.CometScalaUDFCompositionBenchmark- Spark vs Comet native built-ins vs dispatcherdisabledvs dispatcherforceover three shapes.