[SPARK-57436][SQL] Support start and occurrence parameters in instr function#56498
[SPARK-57436][SQL] Support start and occurrence parameters in instr function#56498loftiest wants to merge 3 commits into
Conversation
|
|
||
| override def nullSafeEval(string: Any, sub: Any, start: Any, occurrence: Any): Any = { | ||
| val occ = occurrence.asInstanceOf[Int] | ||
| if (occ <= 0) { |
There was a problem hiding this comment.
ANSI-gated error for occurrence <= 0 is inconsistent with convention. INVALID_PARAMETER_VALUE.* errors are usage errors and are conventionally thrown unconditionally, not gated on spark.sql.ansi.enabled (see e.g. BIT_POSITION_RANGE, PATTERN, etc.). Returning null in non-ANSI mode is surprising and differs from how a clearly-invalid argument is normally treated.
I'd recommend either always throwing. cc @cloud-fan @MaxGekk @srielau for this design decision
| }, | ||
| "OCCURRENCE" : { | ||
| "message" : [ | ||
| "expects a positive value for `occurrence`, but got <actual>." |
There was a problem hiding this comment.
Error message restates the parameter name and hardcodes backticks. The parent INVALID_PARAMETER_VALUE message already prepends The value of parameter(s) occurrence in instr is invalid: using the / params you pass. Restating for occurrence is redundant with the convention used by every other subclass (they say e.g. "expects a binary value..." with no parameter name). Suggestion to use instead: "expects a positive integer, but got ."
| } | ||
| } | ||
|
|
||
| public static class StringInstr4 { |
There was a problem hiding this comment.
Nit: StringInstr4 is a poor class name. Naming the expression after its arity (4) is unusual for the codebase and will read oddly in stack traces / future maintenance. Something like StringInstrWithOccurrence (or folding the logic so there's one expression) would be more clear.
There was a problem hiding this comment.
I've renamed StringInstr4 to StringInstrWithOccurrence, both in the expression class and in CollationSupport. This makes the purpose clearer.
| since = "1.5.0", | ||
| group = "string_funcs") | ||
| // scalastyle:on line.size.limit | ||
| object StringInstrExpressionBuilder extends ExpressionBuilder { |
There was a problem hiding this comment.
There are some unused auxiliary constructors. StringInstr4 defines three def this(...) overloads (2/3/4-arg), but StringInstrExpressionBuilder constructs the case class directly with explicit Literal(1) defaults, so these constructors appear to be dead code. Either use them from the builder or drop them.
There was a problem hiding this comment.
Thanks for catching that. I've removed the unused def this(...) overloads.
| IntegerType | ||
| ) | ||
|
|
||
| override def contextIndependentFoldable: Boolean = super.contextIndependentFoldable |
There was a problem hiding this comment.
This looks like a no-op override (it just calls super). It was copied from StringInstr (which has the same pointless line), but there's no need to propagate it.
There was a problem hiding this comment.
Removed the no-op override. Thanks!
| public int indexOf(UTF8String pattern, int start, int occurrence) { | ||
| assert occurrence > 0; | ||
| if (pattern.numBytes() == 0) { | ||
| return indexOfEmpty(start); |
There was a problem hiding this comment.
There is one thing that I don't particularly like here, and it's reinforcing an old issue (noted in https://issues.apache.org/jira/browse/SPARK-48284).
The empty-pattern path delegates to indexOfEmpty(start), which unconditionally returns 0 (→ result 1) regardless of sign. So instr('hello', '', -1) returns 1, which is a bit arbitrary for a backward search.
I guess we're consistent with the old semantics (2-arg behavior) in the current PR, but I would suggest to keep this in mind for future improvements.
In any case, at least add some tests for this @loftiest.
There was a problem hiding this comment.
I've added tests for empty substring with negative start.
Co-authored-by: Uros Bojanic <221401595+uros-b@users.noreply.github.com>
…e, clean up, add tests - Rename StringInstr4 to StringInstrWithOccurrence for clarity. - Remove unused auxiliary constructors (def this(...) overloads). - Remove no-op contextIndependentFoldable override. - Add unit tests for instr with empty substring and negative start. - Ensure collation-aware code references the updated class name.
What changes were proposed in this pull request?
This PR extends the
instrfunction to accept optionalstartandoccurrenceparameters, supporting forward and backward search with full collation awareness. The changes include:StringInstr4that handles 3‑ and 4‑argument invocations, while the existing two‑argumentinstrcontinues to useStringInstr.StringInstrExpressionBuilderthat routes calls based on the number of arguments: 2 arguments →StringInstr, 3 or 4 arguments →StringInstr4.UTF8String.indexOf(pattern, start, occurrence)with both forward (positivestart) and backward (negativestart) search, using efficient single‑pass UTF‑8 byte scanning.CollationAwareUTF8String.lowercaseIndexOfand ICUindexOfto support multi‑occurrence search forUTF8_BINARY_LCASEand ICU collations.FunctionRegistryto use the newStringInstrExpressionBuilder.functions.scalaand corresponding Python APIs (pyspark.sql.functions.builtin) with proper default‑parameter handling.OCCURRENCEfor invalid (≤ 0) occurrence values.The behavior when
substris empty remains unchanged (returnsstartposition), preserving backward compatibility with the existing two‑argumentinstr.Why are the changes needed?
Currently, Spark's
instronly supports two arguments, forcing users to write complex workarounds when they need to start searching from a specific position or find a particular occurrence. Oracle, Impala, and Db2 all provide a four‑argumentINSTRwith these capabilities. This PR aligns Spark SQL with those databases, simplifies user queries, and reduces migration friction.Does this PR introduce any user-facing change?
Yes, but only additive.
instr(str, substr [, start [, occurrence]])wherestartandoccurrenceare optional integers (default 1). Negativestartperforms backward search.occurrence <= 0.Documentation for the new parameters is included in the expression description and will be surfaced in SQL function docs.
How was this patch tested?
StringFunctionsSuite,CollationAwareUTF8StringSuite, andUTF8StringSuitecovering:'aa'in'aaa')UTF8_BINARY,UTF8_BINARY_LCASE,UNICODE_CI(including sigma variants, Germanß, and Turkishİexpansion)start = 0,occurrence <= 0,startout of range, empty substring)SparkThrowableSuite.Was this patch authored or co-authored using generative AI tooling?
DeepSeek