Skip to content

[SPARK-57436][SQL] Support start and occurrence parameters in instr function#56498

Open
loftiest wants to merge 3 commits into
apache:masterfrom
loftiest:instr-4parameters
Open

[SPARK-57436][SQL] Support start and occurrence parameters in instr function#56498
loftiest wants to merge 3 commits into
apache:masterfrom
loftiest:instr-4parameters

Conversation

@loftiest

@loftiest loftiest commented Jun 14, 2026

Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR extends the instr function to accept optional start and occurrence parameters, supporting forward and backward search with full collation awareness. The changes include:

  • A new expression StringInstr4 that handles 3‑ and 4‑argument invocations, while the existing two‑argument instr continues to use StringInstr.
  • A new StringInstrExpressionBuilder that routes calls based on the number of arguments: 2 arguments → StringInstr, 3 or 4 arguments → StringInstr4.
  • Extended UTF8String.indexOf(pattern, start, occurrence) with both forward (positive start) and backward (negative start) search, using efficient single‑pass UTF‑8 byte scanning.
  • Extended CollationAwareUTF8String.lowercaseIndexOf and ICU indexOf to support multi‑occurrence search for UTF8_BINARY_LCASE and ICU collations.
  • Updated FunctionRegistry to use the new StringInstrExpressionBuilder.
  • New overloads in functions.scala and corresponding Python APIs (pyspark.sql.functions.builtin) with proper default‑parameter handling.
  • New error class OCCURRENCE for invalid (≤ 0) occurrence values.
  • Comprehensive unit tests covering binary, lowercase, and ICU collations, forward/backward/overlapping searches, multi‑byte characters, and boundary conditions.

The behavior when substr is empty remains unchanged (returns start position), preserving backward compatibility with the existing two‑argument instr.

Why are the changes needed?

Currently, Spark's instr only supports two arguments, forcing users to write complex workarounds when they need to start searching from a specific position or find a particular occurrence. Oracle, Impala, and Db2 all provide a four‑argument INSTR with these capabilities. This PR aligns Spark SQL with those databases, simplifies user queries, and reduces migration friction.

Does this PR introduce any user-facing change?

Yes, but only additive.

  • New SQL syntax: instr(str, substr [, start [, occurrence]]) where start and occurrence are optional integers (default 1). Negative start performs backward search.
  • New Scala/Python API overloads with corresponding optional parameters.
  • Error raised when occurrence <= 0.
  • All existing two‑argument usage remains completely unaffected.

Documentation for the new parameters is included in the expression description and will be surfaced in SQL function docs.

How was this patch tested?

  • Added extensive unit tests to StringFunctionsSuite, CollationAwareUTF8StringSuite, and UTF8StringSuite covering:
    • Forward and backward search with multiple occurrences
    • Overlapping matches (e.g., 'aa' in 'aaa')
    • Multi‑byte and supplementary characters
    • Collation scenarios: UTF8_BINARY, UTF8_BINARY_LCASE, UNICODE_CI (including sigma variants, German ß, and Turkish İ expansion)
    • Boundary conditions (start = 0, occurrence <= 0, start out of range, empty substring)
  • Verified error condition formatting in SparkThrowableSuite.
  • Verified Python API behavior matches SQL execution through existing PySpark doctests.

Was this patch authored or co-authored using generative AI tooling?

DeepSeek

@loftiest

Copy link
Copy Markdown
Author

@uros-b @sarutak Could you please take a look at this PR when you have a chance? Thanks!


override def nullSafeEval(string: Any, sub: Any, start: Any, occurrence: Any): Any = {
val occ = occurrence.asInstanceOf[Int]
if (occ <= 0) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANSI-gated error for occurrence <= 0 is inconsistent with convention. INVALID_PARAMETER_VALUE.* errors are usage errors and are conventionally thrown unconditionally, not gated on spark.sql.ansi.enabled (see e.g. BIT_POSITION_RANGE, PATTERN, etc.). Returning null in non-ANSI mode is surprising and differs from how a clearly-invalid argument is normally treated.

I'd recommend either always throwing. cc @cloud-fan @MaxGekk @srielau for this design decision

},
"OCCURRENCE" : {
"message" : [
"expects a positive value for `occurrence`, but got <actual>."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message restates the parameter name and hardcodes backticks. The parent INVALID_PARAMETER_VALUE message already prepends The value of parameter(s) occurrence in instr is invalid: using the / params you pass. Restating for occurrence is redundant with the convention used by every other subclass (they say e.g. "expects a binary value..." with no parameter name). Suggestion to use instead: "expects a positive integer, but got ."

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cc @srielau for error handling ^^

}
}

public static class StringInstr4 {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: StringInstr4 is a poor class name. Naming the expression after its arity (4) is unusual for the codebase and will read oddly in stack traces / future maintenance. Something like StringInstrWithOccurrence (or folding the logic so there's one expression) would be more clear.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed StringInstr4 to StringInstrWithOccurrence, both in the expression class and in CollationSupport. This makes the purpose clearer.

since = "1.5.0",
group = "string_funcs")
// scalastyle:on line.size.limit
object StringInstrExpressionBuilder extends ExpressionBuilder {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some unused auxiliary constructors. StringInstr4 defines three def this(...) overloads (2/3/4-arg), but StringInstrExpressionBuilder constructs the case class directly with explicit Literal(1) defaults, so these constructors appear to be dead code. Either use them from the builder or drop them.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. I've removed the unused def this(...) overloads.

IntegerType
)

override def contextIndependentFoldable: Boolean = super.contextIndependentFoldable

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a no-op override (it just calls super). It was copied from StringInstr (which has the same pointless line), but there's no need to propagate it.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the no-op override. Thanks!

Comment thread python/pyspark/sql/functions/builtin.py Outdated
public int indexOf(UTF8String pattern, int start, int occurrence) {
assert occurrence > 0;
if (pattern.numBytes() == 0) {
return indexOfEmpty(start);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one thing that I don't particularly like here, and it's reinforcing an old issue (noted in https://issues.apache.org/jira/browse/SPARK-48284).

The empty-pattern path delegates to indexOfEmpty(start), which unconditionally returns 0 (→ result 1) regardless of sign. So instr('hello', '', -1) returns 1, which is a bit arbitrary for a backward search.

I guess we're consistent with the old semantics (2-arg behavior) in the current PR, but I would suggest to keep this in mind for future improvements.

In any case, at least add some tests for this @loftiest.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added tests for empty substring with negative start.

loftiest and others added 2 commits June 16, 2026 22:56
Co-authored-by: Uros Bojanic <221401595+uros-b@users.noreply.github.com>
…e, clean up, add tests

- Rename StringInstr4 to StringInstrWithOccurrence for clarity.
- Remove unused auxiliary constructors (def this(...) overloads).
- Remove no-op contextIndependentFoldable override.
- Add unit tests for instr with empty substring and negative start.
- Ensure collation-aware code references the updated class name.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants