Skip to content

feat: surface native opt-in expressions as compatible-by-default with a COMET-INFO plan hint#4721

Merged
andygrove merged 15 commits into
apache:mainfrom
andygrove:native-opt-in-info
Jun 25, 2026
Merged

feat: surface native opt-in expressions as compatible-by-default with a COMET-INFO plan hint#4721
andygrove merged 15 commits into
apache:mainfrom
andygrove:native-opt-in-info

Conversation

@andygrove

@andygrove andygrove commented Jun 24, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4006.

Rationale for this change

Several Comet expressions have two execution paths: a JVM codegen-dispatch path that runs Spark's own generated code inside the Comet pipeline and is always Spark-compatible, and a faster native (Rust) implementation that differs from Spark for some inputs and is gated behind a config. Comet runs the compatible path by default, but a user has no way to discover, for a given query, that a faster native opt-in exists short of reading the compatibility guide. Comet also had only one way to annotate a plan node (a fallback reason), so there was no channel for a purely informational hint.

This PR frames these expressions as "compatible by default, opt into native if you accept the documented differences" and surfaces that opt-in directly in the query plan.

What changes are included in this PR?

  • A non-fallback informational channel: CometSparkSessionExtensions.withInfo records messages on a new EXTENSION_INFO tag, and verbose extended explain renders them as a [COMET-INFO: ...] segment, distinct from the [COMET: ...] fallback segment. CometExecRule rolls expression-level info up onto the operator node so it appears in explain output.
  • SupportLevel.Compatible gains an optional nativeOptIn field (NativeOptIn(configKey)) with a shared message builder so the docs and the runtime hint cannot drift. A NativeOptInAvailable marker trait (extended by CodegenDispatchFallback) is the single static signal used for docs detection.
  • QueryPlanSerde emits the hint centrally in the two existing dispatch branches, with no per-serde imperative calls. The hint shows only when enabling the config would actually switch that specific expression instance to native: RLike only with a literal pattern, RegExpReplace only with offset 1, date_format only for non-UTC sessions, and so on. A single shared predicate per input-dependent serde drives both getSupportLevel and convert().
  • 11 expressions now advertise the opt-in: RLike, RegExpReplace, StringSplit, InitCap, Upper, Lower, StringReplace, GetJsonObject, LengthOfJsonArray, StructsToJson, JsonToStructs, and DateFormat. Their getIncompatibleReasons are restored so the compatibility guide documents the native differences users accept when they opt in.
  • GenerateDocs renders a compatible-by-default header for these expressions, keyed off the marker, and derives the gating config key (so Upper/Lower show spark.comet.caseConversion.enabled). The getIncompatibleReasons scaladoc is clarified, and a short "compatible by default, opt in to native" narrative is added to the compatibility index page.

How are these changes tested?

  • New unit and explain tests in CometExpressionSuite: the info channel renders [COMET-INFO] without setting a fallback reason and accumulates rather than overwriting; the hint appears on the codegen-dispatch path (Hour on TimestampNTZ); per-instance precision is verified (RLike with a literal pattern shows the hint, a non-literal pattern does not; date_format shows it for a non-UTC session and is suppressed for UTC, where native already runs); the hint uses the dedicated spark.comet.caseConversion.enabled key for Upper.
  • Plan-stability golden files regenerated for Spark 3.5.

andygrove added 15 commits June 24, 2026 10:38
Wire two dispatch branches in QueryPlanSerde.convert to emit [COMET-INFO]
hints via CometSparkSessionExtensions.withInfo when a faster native
implementation is available but not currently selected.

- Compatible branch: emit hint when nativeOptIn field is set
- CodegenDispatchFallback branch: emit hint after successful dispatch,
  using nativeOptInConfigKeyOverride or the per-expression allowIncompatible key

Test: hour(timestamp_ntz) routes through CodegenDispatchFallback and the
plan explain now carries [COMET-INFO: A native implementation of Hour...].
Add NativeOptInAvailable to CometLengthOfJsonArray (json.scala),
CometGetJsonObject (strings.scala), CometStructsToJson and
CometJsonToStructs (structs.scala), and CometDateFormat (datetime.scala).

Each serde now returns Compatible(nativeOptIn = Some(...)) when the
native path is applicable but the user has not yet enabled
allowIncompatible, surfacing a COMET-INFO hint in EXPLAIN output.
CometDateFormat factors a nativeApplicable() predicate shared between
getSupportLevel() and convert() to avoid duplicating the format-whitelist
check.
When the session timezone is UTC, date_format already routes to the native
to_char path regardless of the per-expression allowIncompatible config. The
previous getSupportLevel hint condition (!isExprAllowIncompat && nativeApplicable)
omitted the isUtc term, so it wrongly told the user to flip a config that would
have no effect. Factor isUtc into a shared private helper and tighten the hint
condition to nativeApplicable && !isUtc && !isExprAllowIncompat. Extend the
regression test to assert the UTC true-negative (no hint) alongside the existing
non-UTC positive case.
Replace the `codegenDispatch: Boolean` field on `ExprNotes` with
`nativeOptIn: Boolean` + `nativeOptInConfigKey: String`. Detect
native opt-in via `NativeOptInAvailable` (which `CodegenDispatchFallback`
extends), covering both codegen-dispatch expressions and Pattern B serdes.
Derive the config key from `nativeOptInConfigKeyOverride` when present.

Update the incompatibilities header to read "By default, Comet runs a
Spark-compatible implementation of X. Set <key>=true to use Comet's
faster native implementation instead..." instead of the old
JVM-codegen-dispatch-specific wording.

Clarify `getIncompatibleReasons` scaladoc to note these reasons describe
the native path only; for NativeOptInAvailable expressions the default
path is always Spark-compatible.

Regenerate docs/source/user-guide/latest/compatibility/expressions/spark-3.5/
and configs.md for Spark 3.5. Other Spark version directories (3.4, 4.0, 4.1)
are left for CI to regenerate on publish.
Add getIncompatibleReasons() to 11 NativeOptInAvailable serdes so the
compatibility guide documents the behavioral differences users accept
when enabling the native path: RLike, RegExpReplace, StringSplit,
InitCap, Upper/Lower (CometCaseConversionBase), StringReplace,
GetJsonObject, LengthOfJsonArray, StructsToJson, JsonToStructs, and
DateFormat. Regenerate spark-3.5 compatibility markdown.
Replace generic "Native and JVM results may differ for some inputs" with
the specific "Produces different results from Spark when the search string
is empty", matching the inline comment in convert().
…ty golden files

Add a "Compatible by default, opt in to native" section to the compatibility
index page explaining that Comet runs a Spark-compatible path by default,
some expressions have a faster native opt-in via
spark.comet.expression.<Name>.allowIncompatible=true, and how the
[COMET-INFO: ...] plan segment (distinct from [COMET: ...] fallback) surfaces
available opt-in paths in verbose extended explain output.

Fix a compilation error in the withInfo test: withSQLConf returns Unit in
Spark 3.5 test infrastructure, so assign the rendered string inside the closure
rather than capturing the closure's return value.

Regenerate Spark 3.5 plan-stability golden files: four queries (q24a, q24b, q9
for v1.4; q24 for v2.7) now have spark3.5-specific extended.txt files because
they contain [COMET-INFO] hints for Upper (spark.comet.caseConversion.enabled).
Seven previously redundant spark3.5 overrides (q28, q61, q77, q88, q90 for
v1.4; q22, q77a for v2.7) are removed because their content was identical to
the base approved-plans directory.

Both CometTPCDSV1_4_PlanStabilitySuite and CometTPCDSV2_7_PlanStabilitySuite
pass with the regenerated files.
… [skip ci]

Render the [COMET-INFO] segment additively without changing how the existing
[COMET: ...] fallback segment renders, so plans that carry no info message are
unchanged. Drops an incidental empty-[COMET: ] cleanup that perturbed an
unrelated plan-stability golden file.

@mbutrovich mbutrovich left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems helpful, thanks @andygrove!

@andygrove andygrove merged commit 46b5841 into apache:main Jun 25, 2026
70 checks passed
@andygrove andygrove deleted the native-opt-in-info branch June 25, 2026 13:50
@andygrove

Copy link
Copy Markdown
Member Author

Merged. Thanks @mbutrovich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add distinction between "info" and "fallback" messages

2 participants