feat: surface native opt-in expressions as compatible-by-default with a COMET-INFO plan hint#4721
Merged
Merged
Conversation
Wire two dispatch branches in QueryPlanSerde.convert to emit [COMET-INFO] hints via CometSparkSessionExtensions.withInfo when a faster native implementation is available but not currently selected. - Compatible branch: emit hint when nativeOptIn field is set - CodegenDispatchFallback branch: emit hint after successful dispatch, using nativeOptInConfigKeyOverride or the per-expression allowIncompatible key Test: hour(timestamp_ntz) routes through CodegenDispatchFallback and the plan explain now carries [COMET-INFO: A native implementation of Hour...].
Add NativeOptInAvailable to CometLengthOfJsonArray (json.scala), CometGetJsonObject (strings.scala), CometStructsToJson and CometJsonToStructs (structs.scala), and CometDateFormat (datetime.scala). Each serde now returns Compatible(nativeOptIn = Some(...)) when the native path is applicable but the user has not yet enabled allowIncompatible, surfacing a COMET-INFO hint in EXPLAIN output. CometDateFormat factors a nativeApplicable() predicate shared between getSupportLevel() and convert() to avoid duplicating the format-whitelist check.
When the session timezone is UTC, date_format already routes to the native to_char path regardless of the per-expression allowIncompatible config. The previous getSupportLevel hint condition (!isExprAllowIncompat && nativeApplicable) omitted the isUtc term, so it wrongly told the user to flip a config that would have no effect. Factor isUtc into a shared private helper and tighten the hint condition to nativeApplicable && !isUtc && !isExprAllowIncompat. Extend the regression test to assert the UTC true-negative (no hint) alongside the existing non-UTC positive case.
Replace the `codegenDispatch: Boolean` field on `ExprNotes` with `nativeOptIn: Boolean` + `nativeOptInConfigKey: String`. Detect native opt-in via `NativeOptInAvailable` (which `CodegenDispatchFallback` extends), covering both codegen-dispatch expressions and Pattern B serdes. Derive the config key from `nativeOptInConfigKeyOverride` when present. Update the incompatibilities header to read "By default, Comet runs a Spark-compatible implementation of X. Set <key>=true to use Comet's faster native implementation instead..." instead of the old JVM-codegen-dispatch-specific wording. Clarify `getIncompatibleReasons` scaladoc to note these reasons describe the native path only; for NativeOptInAvailable expressions the default path is always Spark-compatible. Regenerate docs/source/user-guide/latest/compatibility/expressions/spark-3.5/ and configs.md for Spark 3.5. Other Spark version directories (3.4, 4.0, 4.1) are left for CI to regenerate on publish.
Add getIncompatibleReasons() to 11 NativeOptInAvailable serdes so the compatibility guide documents the behavioral differences users accept when enabling the native path: RLike, RegExpReplace, StringSplit, InitCap, Upper/Lower (CometCaseConversionBase), StringReplace, GetJsonObject, LengthOfJsonArray, StructsToJson, JsonToStructs, and DateFormat. Regenerate spark-3.5 compatibility markdown.
Replace generic "Native and JVM results may differ for some inputs" with the specific "Produces different results from Spark when the search string is empty", matching the inline comment in convert().
…ty golden files Add a "Compatible by default, opt in to native" section to the compatibility index page explaining that Comet runs a Spark-compatible path by default, some expressions have a faster native opt-in via spark.comet.expression.<Name>.allowIncompatible=true, and how the [COMET-INFO: ...] plan segment (distinct from [COMET: ...] fallback) surfaces available opt-in paths in verbose extended explain output. Fix a compilation error in the withInfo test: withSQLConf returns Unit in Spark 3.5 test infrastructure, so assign the rendered string inside the closure rather than capturing the closure's return value. Regenerate Spark 3.5 plan-stability golden files: four queries (q24a, q24b, q9 for v1.4; q24 for v2.7) now have spark3.5-specific extended.txt files because they contain [COMET-INFO] hints for Upper (spark.comet.caseConversion.enabled). Seven previously redundant spark3.5 overrides (q28, q61, q77, q88, q90 for v1.4; q22, q77a for v2.7) are removed because their content was identical to the base approved-plans directory. Both CometTPCDSV1_4_PlanStabilitySuite and CometTPCDSV2_7_PlanStabilitySuite pass with the regenerated files.
… [skip ci] Render the [COMET-INFO] segment additively without changing how the existing [COMET: ...] fallback segment renders, so plans that carry no info message are unchanged. Drops an incidental empty-[COMET: ] cleanup that perturbed an unrelated plan-stability golden file.
mbutrovich
approved these changes
Jun 24, 2026
mbutrovich
left a comment
Contributor
There was a problem hiding this comment.
Seems helpful, thanks @andygrove!
Member
Author
|
Merged. Thanks @mbutrovich |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #4006.
Rationale for this change
Several Comet expressions have two execution paths: a JVM codegen-dispatch path that runs Spark's own generated code inside the Comet pipeline and is always Spark-compatible, and a faster native (Rust) implementation that differs from Spark for some inputs and is gated behind a config. Comet runs the compatible path by default, but a user has no way to discover, for a given query, that a faster native opt-in exists short of reading the compatibility guide. Comet also had only one way to annotate a plan node (a fallback reason), so there was no channel for a purely informational hint.
This PR frames these expressions as "compatible by default, opt into native if you accept the documented differences" and surfaces that opt-in directly in the query plan.
What changes are included in this PR?
CometSparkSessionExtensions.withInforecords messages on a newEXTENSION_INFOtag, and verbose extended explain renders them as a[COMET-INFO: ...]segment, distinct from the[COMET: ...]fallback segment.CometExecRulerolls expression-level info up onto the operator node so it appears in explain output.SupportLevel.Compatiblegains an optionalnativeOptInfield (NativeOptIn(configKey)) with a shared message builder so the docs and the runtime hint cannot drift. ANativeOptInAvailablemarker trait (extended byCodegenDispatchFallback) is the single static signal used for docs detection.QueryPlanSerdeemits the hint centrally in the two existing dispatch branches, with no per-serde imperative calls. The hint shows only when enabling the config would actually switch that specific expression instance to native: RLike only with a literal pattern, RegExpReplace only with offset 1, date_format only for non-UTC sessions, and so on. A single shared predicate per input-dependent serde drives bothgetSupportLevelandconvert().getIncompatibleReasonsare restored so the compatibility guide documents the native differences users accept when they opt in.GenerateDocsrenders a compatible-by-default header for these expressions, keyed off the marker, and derives the gating config key (so Upper/Lower showspark.comet.caseConversion.enabled). ThegetIncompatibleReasonsscaladoc is clarified, and a short "compatible by default, opt in to native" narrative is added to the compatibility index page.How are these changes tested?
CometExpressionSuite: the info channel renders[COMET-INFO]without setting a fallback reason and accumulates rather than overwriting; the hint appears on the codegen-dispatch path (Hour on TimestampNTZ); per-instance precision is verified (RLike with a literal pattern shows the hint, a non-literal pattern does not; date_format shows it for a non-UTC session and is suppressed for UTC, where native already runs); the hint uses the dedicatedspark.comet.caseConversion.enabledkey for Upper.