Skip to content

[core] Track NaN counts in column statistics#7879

Closed
ArnavBalyan wants to merge 1 commit into
apache:masterfrom
ArnavBalyan:arnavb/stats-nan-count
Closed

[core] Track NaN counts in column statistics#7879
ArnavBalyan wants to merge 1 commit into
apache:masterfrom
ArnavBalyan:arnavb/stats-nan-count

Conversation

@ArnavBalyan
Copy link
Copy Markdown
Member

@ArnavBalyan ArnavBalyan commented May 17, 2026

Purpose

  • Spark, Flink and Iceberg all support IsNaN as a first class predicate and use NaN counts in file/partition pruning.
  • Paimon's SimpleColStats only tracks min, max and null count today, there is no signal at the manifest layer to skip files.
  • Add a nanCount field to SimpleColStats and update the collectors to count the nans, and further be used for engine level predicate pushdown.

Tests

  • UT

@ArnavBalyan
Copy link
Copy Markdown
Member Author

cc @JingsongLi could you PTAL thanks! :)

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a significant impact. I want to determine what changes have been made to the underlying storage, is it necessary?

@ArnavBalyan
Copy link
Copy Markdown
Member Author

Will do small poc on the improvement of isnan pushdown and perf for stat collection, will revisit on mailing list for larger consensus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants