Skip to content

feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs#4267

Draft
mbutrovich wants to merge 24 commits intoapache:mainfrom
mbutrovich:codegen_scala_udf
Draft

feat: Arrow-direct codegen dispatcher for Spark expressions and Scala UDFs#4267
mbutrovich wants to merge 24 commits intoapache:mainfrom
mbutrovich:codegen_scala_udf

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented May 8, 2026

Draft while we discuss with #4233 and #4239.

Which issue does this PR close?

Closes #.

Rationale for this change

#4232 merged the JVM UDF bridge. This PR adds a codegen dispatcher on top: one generic CometUDF that compiles a specialized batch kernel per bound Catalyst Expression + input schema via Janino.

Payoff:

  • Any supported ScalaUDF or Catalyst expression routes through native without a hand-written CometUDF.
  • UDFs stop being opaque operator boundaries - ScalaUDFs and Catalyst expressions share one expression tree, so Comet keeps surrounding native operators in place.
  • An entire expression subtree compiles into one per-row loop with stack-local intermediates instead of per-subexpression Arrow batches.

Opt-in via spark.comet.exec.codegenDispatch.mode = auto | force | disabled. Primary targets: string expressions and user ScalaUDFs.

What changes are included in this PR?

  • Codegen dispatcher: CometBatchKernelCodegen (orchestrator) + CometBatchKernelCodegenInput / CometBatchKernelCodegenOutput (per-side emission) + CometCodegenDispatchUDF (bridge entry, three-layer cache).
  • Complex type support: ArrayType and StructType as both input and output, including arbitrary nesting (Array<Array>, Array<Struct>, Struct<Array>, Struct<Struct>, and deeper). Sealed ArrowColumnSpec + recursive nested-class emission.
  • Per-expression specialization: direct-bytes RegExpReplace emitter bypassing the UTF8String round-trip Spark's doGenCode forces.
  • Optimization set applied per (expression, input schema): zero-copy UTF8 reads (VarCharVector and ViewVarCharVector), non-nullable isNullAt elision, decimal short-precision fast path on both input and output, UTF8 on-heap write shortcut, pre-sized variable-length output buffers, NullIntolerant short-circuit, non-nullable output short-circuit, subexpression elimination.
  • Bridge contract additions: numRows parameter (zero-column expressions); TaskContext propagation across JNI so partition-sensitive expressions (Rand, Uuid, MonotonicallyIncreasingID, user UDFs reading TaskContext.get()) see the Spark task context from the Tokio worker.
  • Serde routing: CometScalaUDF routes any ScalaUDF; the regex family (rlike, regexp_replace, regexp_extract, regexp_extract_all, regexp_instr, split) gets a uniform pickWithMode switch; native Rust paths preserved where they exist.
  • Docs: docs/source/contributor-guide/jvm_udf_dispatch.md covers the architecture, the regex routing matrix, and the opt-in steps for adding an expression to codegen dispatch.

How are these changes tested?

  • CometCodegenSourceSuite - generated-source assertions for every optimization and every complex-type shape.
  • CometCodegenDispatchSmokeSuite - end-to-end correctness across the type surface, composed-UDF trees, decimal boundaries, subquery reuse, TaskContext propagation, array input/output.
  • CometCodegenDispatchFuzzSuite - randomized string fuzz + decimal identity fuzz at several null densities.
  • CometRegExpJvmSuite - SQL-level Spark-vs-Comet correctness for the regex family.
  • CometScalaUDFCompositionBenchmark - Spark vs Comet native built-ins vs dispatcher disabled vs dispatcher force over three shapes.

@mbutrovich
Copy link
Copy Markdown
Contributor Author

There are like 4 Spark SQL test failures that look like they might need updating, but otherwise it's looking good. Not gonna worry about them until we discuss moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant