HIVE-29503: Prevent Join cardinality overestimation of joins with NDV(0) columns#6356
HIVE-29503: Prevent Join cardinality overestimation of joins with NDV(0) columns#6356konstantinb wants to merge 8 commits into
Conversation
| input vertices: | ||
| 1 Map 2 | ||
| Statistics: Num rows: 4 Data size: 1184 Basic stats: COMPLETE Column stats: COMPLETE | ||
| Statistics: Num rows: 2 Data size: 325 Basic stats: COMPLETE Column stats: NONE |
There was a problem hiding this comment.
at the moment. NDV of datetime/timestamp columns is not being assigned to colstats even if available. Changing that will make this estimate better; however, doing impacts over 100 .out files so perhaps doing so belongs to a separate story?
these were introduced by #6361 |
…ng the join product row count with an NDV of 0
|
|
Hi @zabetak — this is a companion to #6418, addressing the same NDV=0 "unknown stats" problem but in the join cardinality estimator rather than GROUP BY. The bug: when join keys have NDV=0 on both sides (common for binary columns, columns without populated NDV, or giant tables with incomplete stats), For two 100M-row tables, that's 100M × 100M = 10^16 — a full cartesian product estimate for an equi-join. This cascades into downstream operators (aggregations, subsequent joins) and typically forces suboptimal plans by making the join output appear astronomically larger than it actually is. The fix intercepts after PK-FK inference fails but before the NDV-based denominator path, and applies the existing Would you be willing to take a look when you have time? Happy to provide additional context or adjust the approach. Thanks! |



What changes were proposed in this pull request?
HIVE-29503: Use existing conservative heuristic for estimating join product for queries with unknown (0) NDV(s)
Why are the changes needed?
The result file generated with the original code shows a massive cardinality explosion of a fairly small query. This frequently leads to suboptimal query plans
Does this PR introduce any user-facing change?
No
How was this patch tested?
Nest code, new test files, and a private Hive implementation