Skip to content

HIVE-29503: Prevent Join cardinality overestimation of joins with NDV(0) columns#6356

Open
konstantinb wants to merge 8 commits into
apache:masterfrom
konstantinb:HIVE-29503
Open

HIVE-29503: Prevent Join cardinality overestimation of joins with NDV(0) columns#6356
konstantinb wants to merge 8 commits into
apache:masterfrom
konstantinb:HIVE-29503

Conversation

@konstantinb
Copy link
Copy Markdown
Contributor

@konstantinb konstantinb commented Mar 12, 2026

What changes were proposed in this pull request?

HIVE-29503: Use existing conservative heuristic for estimating join product for queries with unknown (0) NDV(s)

Why are the changes needed?

The result file generated with the original code shows a massive cardinality explosion of a fairly small query. This frequently leads to suboptimal query plans

Does this PR introduce any user-facing change?

No

How was this patch tested?

Nest code, new test files, and a private Hive implementation

@konstantinb konstantinb changed the title HIVE-29503: Use the fallback of half the number of rows when estimating the join product row count with an NDV of 0 HIVE-29503: Prevent Join cardinality overestimation of joins with NDV(0) columns Mar 23, 2026
@konstantinb konstantinb marked this pull request as ready for review March 23, 2026 23:19
input vertices:
1 Map 2
Statistics: Num rows: 4 Data size: 1184 Basic stats: COMPLETE Column stats: COMPLETE
Statistics: Num rows: 2 Data size: 325 Basic stats: COMPLETE Column stats: NONE
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the moment. NDV of datetime/timestamp columns is not being assigned to colstats even if available. Changing that will make this estimate better; however, doing impacts over 100 .out files so perhaps doing so belongs to a separate story?

@konstantinb
Copy link
Copy Markdown
Contributor Author

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Apr 9, 2026

@konstantinb
Copy link
Copy Markdown
Contributor Author

konstantinb commented May 11, 2026

Hi @zabetak — this is a companion to #6418, addressing the same NDV=0 "unknown stats" problem but in the join cardinality estimator rather than GROUP BY.

The bug: when join keys have NDV=0 on both sides (common for binary columns, columns without populated NDV, or giant tables with incomplete stats), getDenominator returns 0, which computeRowCountAssumingInnerJoin replaces with 1. The join formula then becomes:

  result = otherSideRows × (maxRows / 1)

For two 100M-row tables, that's 100M × 100M = 10^16 — a full cartesian product estimate for an equi-join. This cascades into downstream operators (aggregations, subsequent joins) and typically forces suboptimal plans by making the join output appear astronomically larger than it actually is.

The fix intercepts after PK-FK inference fails but before the NDV-based denominator path, and applies the existing hive.stats.join.factor heuristic (default 1.1× the largest input). This is the same conservative estimate already used in the "no column statistics at all" branch — just triggered earlier when we can detect that NDV=0 makes the denominator meaningless.

Would you be willing to take a look when you have time? Happy to provide additional context or adjust the approach. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants