Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
9cd1566
docs: add implement-comet-expression Claude skill
andygrove Apr 30, 2026
953cb86
docs: reference PR template and add skill-acknowledgement note
andygrove Apr 30, 2026
422d2b3
docs: check datafusion-spark crate before writing native code
andygrove Apr 30, 2026
88f2331
Merge branch 'add-implement-expression-skill'
andygrove Apr 30, 2026
eb8aa14
feat: add CometUDF trait for JVM-side scalar UDFs
andygrove May 1, 2026
60a2ecd
feat: add RegExpLikeUDF using java.util.regex.Pattern
andygrove May 1, 2026
633b75e
feat: add CometUdfBridge JNI entry point for native UDF dispatch
andygrove May 1, 2026
1c64070
feat: add JvmScalarUdf proto message for JVM UDF dispatch
andygrove May 1, 2026
8f78436
feat: register CometUdfBridge in JVMClasses for native UDF dispatch
andygrove May 1, 2026
cf233d5
feat: add JvmScalarUdfExpr PhysicalExpr that dispatches to JVM via JNI
andygrove May 1, 2026
d8ab411
feat: wire JvmScalarUdf proto into native planner
andygrove May 1, 2026
4970c9c
feat: add spark.comet.exec.regexp.useJVM config
andygrove May 1, 2026
54ddd50
feat: route RLike through JVM UDF when spark.comet.exec.regexp.useJVM…
andygrove May 1, 2026
0a942ad
test: add end-to-end suite for JVM-backed RLike
andygrove May 1, 2026
fbfc158
fix: use project-wide CometArrowAllocator in RegExpLikeUDF
andygrove May 1, 2026
909ab91
docs: correct CometUdfBridge thread cache lifetime comment
andygrove May 1, 2026
862ed2e
docs: document from_ffi consumption invariant in JvmScalarUdfExpr
andygrove May 1, 2026
a943de5
style: apply make format
andygrove May 1, 2026
e1b9b2a
docs: mark spark.comet.exec.regexp.useJVM experimental and generalize…
andygrove May 1, 2026
76418c6
test: add CometRegExpBenchmark covering all rlike modes
andygrove May 1, 2026
8ac45be
ci: register new RLike JVM-bridge test suites in PR workflows
andygrove May 1, 2026
a1f8ecf
build: exclude docs/superpowers from rat and git
andygrove May 1, 2026
23a9e52
remove skill
andygrove May 1, 2026
1c66f44
refactor: rename regexp.useJVM boolean to regexp.engine enum (rust|java)
andygrove May 1, 2026
56327ed
fix: ensure UDF bridge inputs/result close on every path and resolve …
andygrove May 1, 2026
fee5ab2
fix: validate regex pattern at convert time so invalid or null patter…
andygrove May 1, 2026
7d0f25c
fix: tolerate missing CometUdfBridge class at JVMClasses init
andygrove May 1, 2026
2a43867
refactor: introduce REGEXP_ENGINE_RUST/REGEXP_ENGINE_JAVA constants
andygrove May 1, 2026
760cd94
perf: send scalar UDF arguments as length-1 vectors
andygrove May 1, 2026
85029c5
test: cover empty and all-null subject vectors in RegExpLikeUDF unit …
andygrove May 1, 2026
a16f336
feat: propagate result nullability through JvmScalarUdf proto
andygrove May 1, 2026
5937650
fix: validate UDF result row count matches longest input
andygrove May 1, 2026
1dd81fb
fix: qualify CometRLike incompat reasons by engine config
andygrove May 1, 2026
42462c3
fix: bound UDF and pattern caches with LRU eviction
andygrove May 1, 2026
8073cf3
test: stop using per-test RootAllocator in RegExpLikeUDFSuite
andygrove May 1, 2026
ce01339
test: remove RegExpLikeUDFSuite due to shading boundary
andygrove May 1, 2026
eb544d6
Merge remote-tracking branch 'apache/main' into prototype-jvm-scalar-udf
andygrove May 6, 2026
4683199
feat: add all Spark regexp expressions via JVM UDF framework
andygrove May 6, 2026
6cac094
docs: update regexp compatibility guide for java vs rust engine
andygrove May 6, 2026
1ad838b
Merge remote-tracking branch 'apache/main' into java-regexp
andygrove May 8, 2026
250b469
fix: use ConcurrentHashMap for pattern cache in regexp UDFs
andygrove May 8, 2026
941d9c7
refactor: use computeIfAbsent for pattern cache lookup
andygrove May 8, 2026
336ec6e
Merge remote-tracking branch 'apache/main' into worktree-pr-4239-rege…
andygrove May 12, 2026
ea939ce
fix: default regexp engine back to rust, mark java engine experimental
andygrove May 12, 2026
5e18c62
style: prettier format regex compatibility docs
andygrove May 12, 2026
8b92370
style: drop unused idx binding in RegExpInStrUDF to fix scalafix lint
andygrove May 12, 2026
ca6628b
style: drop unused idx bindings in regexp serde to fix scalafix lint
andygrove May 12, 2026
c4e88fb
test: set regexp engine to java in SQL tests that need it
andygrove May 13, 2026
b55adb0
Merge remote-tracking branch 'apache/main' into java-regexp
andygrove May 13, 2026
0fa237f
fix: update regexp UDFs to new CometUDF.evaluate(inputs, numRows) sig…
andygrove May 14, 2026
f6b4096
Merge branch 'main' of github.com:apache/datafusion-comet into worktr…
andygrove May 19, 2026
2eb06c9
feat: gate JVM UDF framework behind spark.comet.jvmUdf.enabled
andygrove May 19, 2026
5dd2398
refactor: simplify regexp engine config to {rust, java}, default java
andygrove May 20, 2026
be487f1
refactor: surface engine=rust as the optedInBy opt-in for regex
andygrove May 20, 2026
0f21e19
fix: address CI failures for java-regexp PR
andygrove May 21, 2026
29428e5
Merge apache/main into java-regexp
andygrove May 26, 2026
bec171e
refactor: route regex expressions through codegen dispatcher instead …
andygrove May 26, 2026
3a13aa7
test: route rlike non-scalar-pattern fallback test through engine=rust
andygrove May 27, 2026
8c9deeb
feat: default JVM UDF codegen dispatcher to enabled
andygrove May 28, 2026
4ead54d
feat: under engine=rust, fall through to JVM dispatcher for unimpleme…
andygrove May 28, 2026
bb2c641
style: prettier-format regex compatibility doc
andygrove May 28, 2026
ca61034
style: drop unused interpolator on CometConf regexp engine doc
andygrove May 28, 2026
7f22f92
Merge branch 'main' into java-regexp
andygrove May 28, 2026
57c471c
revert: default JVM UDF codegen dispatcher back to disabled
andygrove May 29, 2026
7be7783
docs: drop experimental language from regex compatibility guide
andygrove May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/pr_build_linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -408,6 +408,7 @@ jobs:
org.apache.comet.expressions.conditional.CometIfSuite
org.apache.comet.expressions.conditional.CometCoalesceSuite
org.apache.comet.expressions.conditional.CometCaseWhenSuite
org.apache.comet.CometRegExpJvmSuite
org.apache.comet.CometCodegenSuite
org.apache.comet.CometCodegenSourceSuite
org.apache.comet.CometCodegenHOFSuite
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pr_build_macos.yml
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,7 @@ jobs:
org.apache.comet.expressions.conditional.CometIfSuite
org.apache.comet.expressions.conditional.CometCoalesceSuite
org.apache.comet.expressions.conditional.CometCaseWhenSuite
org.apache.comet.CometRegExpJvmSuite
org.apache.comet.CometCodegenSuite
org.apache.comet.CometCodegenSourceSuite
org.apache.comet.CometCodegenHOFSuite
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ output
docs/comet-*/
docs/build/
docs/temp/
docs/superpowers/
117 changes: 114 additions & 3 deletions docs/source/user-guide/latest/compatibility/regex.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,117 @@ under the License.

# Regular Expressions

Comet uses the Rust regexp crate for evaluating regular expressions, and this has different behavior from Java's
regular expression engine. Comet will fall back to Spark for patterns that are known to produce different results, but
this can be overridden by setting `spark.comet.expression.regexp.allowIncompatible=true`.
Comet provides two regexp engines for evaluating regular expressions: a **Rust engine** that uses the Rust
[`regex`] crate natively, and a **Java engine** that runs Spark's own `doGenCode` for the
expression inside Comet's Arrow-direct codegen dispatcher (the same dispatcher used by Comet's
`ScalaUDF` codegen path). The engine is selected with `spark.comet.exec.regexp.engine`, which accepts:

- `java` (default) — route through the Java engine for full Spark compatibility. Requires
`spark.comet.exec.scalaUDF.codegen.enabled=true`; otherwise regex expressions fall back to Spark with
an explanatory message.
- `rust` — run the Rust engine when an expression has a native implementation. Setting this is itself
the opt-in for the semantic differences between Java and Rust regex (no separate `allowIncompatible`
flag needed). Expressions without a native Rust implementation (`regexp_extract`,
`regexp_extract_all`, `regexp_instr`) fall through to the Java engine so users still get Comet
acceleration with full Spark semantics.

With `engine=java` and `scalaUDF.codegen.enabled=true`, all regex expressions run on the Comet
path with full Spark compatibility.

## Disabling Comet for individual regex expressions

Each regex expression has a per-class `spark.comet.expression.<ClassName>.enabled` flag (default
`true`) that disables Comet's serde for that expression and forces a Spark fallback. This is
useful for narrowing a regression or comparing performance on a single operator without changing
the engine selector:

| Expression | Config |
| -------------------- | ------------------------------------------------------- |
| `rlike` | `spark.comet.expression.RLike.enabled=false` |
| `regexp_extract` | `spark.comet.expression.RegExpExtract.enabled=false` |
| `regexp_extract_all` | `spark.comet.expression.RegExpExtractAll.enabled=false` |
| `regexp_instr` | `spark.comet.expression.RegExpInStr.enabled=false` |
| `regexp_replace` | `spark.comet.expression.RegExpReplace.enabled=false` |
| `split` | `spark.comet.expression.StringSplit.enabled=false` |

## Choosing an engine

| | Rust engine | Java engine (default) |
| -------------------- | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- |
| **Compatibility** | Differs from Java regex (see below) | 100% compatible with Spark |
| **Feature coverage** | `rlike`, `regexp_replace`, `split` natively; `regexp_extract`, `regexp_extract_all`, `regexp_instr` via fallthrough | All regexp expressions (`rlike`, `regexp_extract`, `regexp_extract_all`, `regexp_instr`, `regexp_replace`, `split`) |
| **Performance** | Fully native, no JNI overhead | One JNI round-trip per batch (Arrow vectors stay columnar) |
| **Pattern support** | Linear-time subset only | All Java regex features (backreferences, lookaround, etc.) |

The **Rust engine** is faster but cannot match Java regex semantics for every pattern. Because the engine
choice is itself the opt-in, setting `spark.comet.exec.regexp.engine=rust` declares acceptance of those
differences without a separate per-expression flag.

The **Java engine** is the default and is gated behind `spark.comet.exec.scalaUDF.codegen.enabled`
so the codegen dispatcher can be disabled globally without changing the regex engine selector.

## Why the engines differ

Java's `java.util.regex` is a backtracking engine in the Perl/PCRE family. It supports the full range of
features that style of engine provides, including some whose worst-case running time grows exponentially with
the input.

Rust's [`regex`] crate is a finite-automaton engine in the [RE2] family. It deliberately omits features that
cannot be implemented with a guarantee of linear-time matching. In exchange, every pattern it does accept runs
in time linear in the size of the input. This is the same trade-off RE2, Go's `regexp`, and several other
engines make.

The practical consequence is that Java accepts a strictly larger set of patterns than the Rust engine, and
several constructs that look the same in source have different semantics on the two sides.

## Features supported by Java but not by the Rust engine

Patterns that use any of the following will not compile in Comet's Rust engine and must run on Spark (or use
the Java engine):

- **Backreferences** such as `\1`, `\2`, or `\k<name>`. The Rust engine has no backtracking and cannot match
a previously captured group.
- **Lookaround**, including lookahead (`(?=...)`, `(?!...)`) and lookbehind (`(?<=...)`, `(?<!...)`).
- **Atomic groups** (`(?>...)`).
- **Possessive quantifiers** (`*+`, `++`, `?+`, `{n,m}+`). Rust supports greedy and lazy quantifiers but not
possessive.
- **Embedded code, conditionals, and recursion** such as `(?(cond)yes|no)` or `(?R)`. Rust accepts none of
these.

## Features that exist on both sides but behave differently

Even where both engines accept a construct, the matching behavior is not always the same.

- **Unicode-aware character classes.** In the Rust engine, `\d`, `\w`, `\s`, and `.` are Unicode-aware by
default, so `\d` matches every digit codepoint defined by Unicode rather than only `0`-`9`. Java's defaults
match ASCII only and require the `UNICODE_CHARACTER_CLASS` flag (or `(?U)` inline) to switch to Unicode
semantics. The same pattern can therefore match a different set of characters on each side.
- **Line terminators.** In multiline mode, Java treats `\r`, `\n`, `\r\n`, and a few additional Unicode line
separators as line boundaries by default. The Rust engine treats only `\n` as a line boundary unless CRLF
mode is enabled. `^`, `$`, and `.` (with `(?s)` off) all depend on this definition.
- **Case-insensitive matching.** Both engines support `(?i)`, but Java's default is ASCII case folding while
the Rust engine uses full Unicode simple case folding when Unicode mode is on. Patterns that match characters
outside ASCII can produce different results.
- **POSIX character classes.** The Rust engine supports `[[:alpha:]]` style POSIX classes inside bracket
expressions but not Java's `\p{Alpha}` shorthand. Java accepts both. Unicode property escapes (`\p{L}`,
`\p{Greek}`, etc.) are supported by both engines but cover slightly different sets of properties.
- **Octal and Unicode escapes.** Java accepts `\0nnn` for octal and `\uXXXX` for a BMP codepoint. Rust uses
`\x{...}` for arbitrary codepoints and does not accept Java's bare `\uXXXX` form.
- **Empty matches in `split`.** Spark's `StringSplit`, which is built on Java's regex, includes leading empty
strings produced by zero-width matches at the start of the input. The Rust engine's `split` follows different
rules, so split results can differ in edge cases involving empty matches even when the pattern itself is
identical on both sides.

## When the Rust engine is safe

For most ASCII-only, non-anchored patterns that use only literal characters, simple character classes, and
ordinary quantifiers, the two engines produce the same results. If you are confident your patterns fit this
shape and want to avoid the JNI overhead of the Java engine, switching to the Rust engine with
`allowIncompatible=true` is generally safe.

For anything that uses backreferences, lookaround, or relies on Java's specific Unicode or line-handling
defaults, use the Java engine.

[`java.util.regex`]: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
[`regex`]: https://docs.rs/regex/latest/regex/
[RE2]: https://github.com/google/re2/wiki/Syntax
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1170,6 +1170,7 @@ under the License.
<exclude>native/proto/src/generated/**</exclude>
<exclude>benchmarks/tpc/queries/**</exclude>
<exclude>.claude/**</exclude>
<exclude>docs/superpowers/**</exclude>
</excludes>
</configuration>
</plugin>
Expand Down
26 changes: 25 additions & 1 deletion spark/src/main/scala/org/apache/comet/CometConf.scala
Original file line number Diff line number Diff line change
Expand Up @@ -369,10 +369,34 @@ object CometConf extends ShimCometConf {
"Arrow-direct codegen dispatcher. When enabled, a supported ScalaUDF is compiled into " +
"a per-batch kernel that reads and writes Arrow vectors directly from native " +
"execution. When disabled, plans containing a ScalaUDF fall back to Spark for the " +
"enclosing operator.")
"enclosing operator. The same dispatcher backs `spark.comet.exec.regexp.engine=java` " +
"so the regex family routes through it as well.")
.booleanConf
.createWithDefault(false)

val REGEXP_ENGINE_RUST = "rust"
val REGEXP_ENGINE_JAVA = "java"

val COMET_REGEXP_ENGINE: ConfigEntry[String] =
conf("spark.comet.exec.regexp.engine")
.category(CATEGORY_EXEC)
.doc(
"Selects the engine used to evaluate Spark regular-expression expressions. " +
s"`$REGEXP_ENGINE_JAVA` (default) routes through the Arrow-direct codegen dispatcher " +
"so Spark's own `doGenCode` (backed by `java.util.regex.Pattern`) runs inside the " +
s"Comet pipeline; this requires ${COMET_SCALA_UDF_CODEGEN_ENABLED.key}=true and " +
s"falls back to Spark otherwise. `$REGEXP_ENGINE_RUST` runs the " +
"native DataFusion regexp engine when an implementation exists; setting this is " +
"itself the opt-in for the semantic differences between Java and Rust regex. " +
"Expressions without a native Rust implementation (`regexp_extract`, " +
"`regexp_extract_all`, `regexp_instr`) fall through to the JVM codegen dispatcher " +
s"under `$REGEXP_ENGINE_RUST` so users still get Comet acceleration with full " +
"Spark semantics.")
.stringConf
.transform(_.toLowerCase(Locale.ROOT))
.checkValues(Set(REGEXP_ENGINE_RUST, REGEXP_ENGINE_JAVA))
.createWithDefault(REGEXP_ENGINE_JAVA)

val COMET_EXEC_SHUFFLE_WITH_HASH_PARTITIONING_ENABLED: ConfigEntry[Boolean] =
conf("spark.comet.native.shuffle.partitioning.hash.enabled")
.category(CATEGORY_SHUFFLE)
Expand Down
2 changes: 1 addition & 1 deletion spark/src/main/scala/org/apache/comet/GenerateDocs.scala
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ object GenerateDocs {
annotations += ((fromTypeName, toTypeName, note.trim.replace("(10,2)", "")))
}
"C"
case Incompatible(notes) =>
case Incompatible(notes, _) =>
notes.filter(_.trim.nonEmpty).foreach { note =>
annotations += ((fromTypeName, toTypeName, note.trim.replace("(10,2)", "")))
}
Expand Down
32 changes: 0 additions & 32 deletions spark/src/main/scala/org/apache/comet/expressions/RegExp.scala

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -723,7 +723,7 @@ case class CometExecRule(session: SparkSession)
case Unsupported(notes) =>
withInfo(op, notes.getOrElse(""))
false
case Incompatible(notes) =>
case Incompatible(notes, _) =>
val allowIncompat = CometConf.isOperatorAllowIncompat(opName)
val incompatConf = CometConf.getOperatorAllowIncompatConfigKey(opName)
if (allowIncompat) {
Expand Down
62 changes: 46 additions & 16 deletions spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,9 @@ object QueryPlanSerde extends Logging with CometExprShim with CometTypeShim {
classOf[Like] -> CometLike,
classOf[Lower] -> CometLower,
classOf[OctetLength] -> CometScalarFunction("octet_length"),
classOf[RegExpExtract] -> CometRegExpExtract,
classOf[RegExpExtractAll] -> CometRegExpExtractAll,
classOf[RegExpInStr] -> CometRegExpInStr,
classOf[RegExpReplace] -> CometRegExpReplace,
classOf[Reverse] -> CometReverse,
classOf[RLike] -> CometRLike,
Expand Down Expand Up @@ -580,23 +583,29 @@ object QueryPlanSerde extends Logging with CometExprShim with CometTypeShim {
case Unsupported(notes) =>
withInfo(fn, notes.getOrElse(""))
None
case Incompatible(notes) =>
case Incompatible(notes, optedInBy) =>
val exprAllowIncompat = CometConf.isExprAllowIncompat(exprConfName)
if (exprAllowIncompat) {
val namedConfOptIn = optedInBy.exists(isOptedInVia)
if (exprAllowIncompat || namedConfOptIn) {
if (notes.isDefined) {
logWarning(
s"Comet supports $fn when " +
s"${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true " +
s"but has notes: ${notes.get}")
val optInDesc = if (namedConfOptIn) {
optedInBy.get
} else {
s"${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true"
}
logWarning(s"Comet supports $fn when $optInDesc but has notes: ${notes.get}")
}
aggHandler.convert(aggExpr, fn, inputs, binding, conf)
} else {
val optionalNotes = notes.map(str => s" ($str)").getOrElse("")
val extraOptIn = optedInBy
.map(kv => s" or by setting $kv")
.getOrElse("")
withInfo(
fn,
s"$fn is not fully compatible with Spark$optionalNotes. To enable it anyway, " +
s"set ${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true. " +
s"${CometConf.COMPAT_GUIDE}.")
s"set ${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true" +
s"$extraOptIn. ${CometConf.COMPAT_GUIDE}.")
None
}
case Compatible(notes) =>
Expand Down Expand Up @@ -672,6 +681,21 @@ object QueryPlanSerde extends Logging with CometExprShim with CometTypeShim {
exprToProtoInternal(newExpr, inputs, binding)
}

/**
* True when the current SQLConf has the named config set to the given value. The argument is a
* `key=value` string used by `Incompatible.optedInBy` to declare which config opts the user
* into running an otherwise-incompatible expression. The configured value is compared
* case-insensitively after splitting on the first `=`.
*/
private def isOptedInVia(keyEqualsValue: String): Boolean = {
keyEqualsValue.split("=", 2) match {
case Array(key, expected) =>
Option(SQLConf.get.getConfString(key, null))
.exists(_.equalsIgnoreCase(expected))
case _ => false
}
}

/**
* Convert a Spark expression to a protocol-buffer representation of a native Comet/DataFusion
* expression.
Expand Down Expand Up @@ -705,23 +729,29 @@ object QueryPlanSerde extends Logging with CometExprShim with CometTypeShim {
case Unsupported(notes) =>
withInfo(expr, notes.getOrElse(""))
None
case Incompatible(notes) =>
case Incompatible(notes, optedInBy) =>
val exprAllowIncompat = CometConf.isExprAllowIncompat(exprConfName)
if (exprAllowIncompat) {
val namedConfOptIn = optedInBy.exists(isOptedInVia)
if (exprAllowIncompat || namedConfOptIn) {
if (notes.isDefined) {
logWarning(
s"Comet supports $expr when " +
s"${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true " +
s"but has notes: ${notes.get}")
val optInDesc = if (namedConfOptIn) {
optedInBy.get
} else {
s"${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true"
}
logWarning(s"Comet supports $expr when $optInDesc but has notes: ${notes.get}")
}
handler.convert(expr, inputs, binding)
} else {
val optionalNotes = notes.map(str => s" ($str)").getOrElse("")
val extraOptIn = optedInBy
.map(kv => s" or by setting $kv")
.getOrElse("")
withInfo(
expr,
s"$expr is not fully compatible with Spark$optionalNotes. To enable it anyway, " +
s"set ${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true. " +
s"${CometConf.COMPAT_GUIDE}.")
s"set ${CometConf.getExprAllowIncompatConfigKey(exprConfName)}=true" +
s"$extraOptIn. ${CometConf.COMPAT_GUIDE}.")
None
}
case Compatible(notes) =>
Expand Down
Loading
Loading