Skip to content

[Enhancement] Enable mixed partial/final execution for approx_count_distinct (HyperLogLogPlusPlus) #4820

Description

@andygrove

Background

The initial approx_count_distinct support (PR #4819) stores its HyperLogLog++ registers in Spark's exact packed-Long buffer layout (numWords Long columns, 10 six-bit registers per word), so Comet's partial-aggregation state is byte-identical to Spark's HyperLogLogPlusPlus.aggBufferSchema.

CometApproxCountDistinct currently leaves supportsMixedPartialFinal at the default false, which means that whenever a plan has an approx_count_distinct at a Comet/Spark boundary, allAggsSupportMixedExecution forces both the partial and final aggregate onto the same engine.

Proposal

Because the intermediate buffer format now matches Spark exactly, supportsMixedPartialFinal should be safe to set to true (as CometMin, CometMax, and the bitwise aggregates already do). This would let Comet accelerate the partial aggregate even when the final falls back to Spark (and vice versa), broadening native coverage.

Work required

  • Set supportsMixedPartialFinal = true in CometApproxCountDistinct.
  • Add a partial-merge test (see spark/src/test/resources/sql-tests/expressions/aggregate/partial_merge.sql) that exercises Comet-partial + Spark-final and Spark-partial + Comet-final, confirming the result stays bit-identical to Spark.

This was deferred from PR #4819 because it needs interop verification rather than being a trivial one-line change.

Metadata

Metadata

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions