Skip to content

Optimizing Variant read path with lazy caching#3481

Open
nssalian wants to merge 2 commits intoapache:masterfrom
nssalian:variant-read-changes
Open

Optimizing Variant read path with lazy caching#3481
nssalian wants to merge 2 commits intoapache:masterfrom
nssalian:variant-read-changes

Conversation

@nssalian
Copy link
Copy Markdown

@nssalian nssalian commented Apr 15, 2026

Rationale for this change

Profiling in 3452 identified Variant.getFieldAtIndex() and metadata string lookups as hotspots during variant reads. Every call to getFieldByKey, getFieldAtIndex, and getElementAtIndex re-parses headers and re-allocates objects that could be cached.

What changes are included in this PR?

Adds lazy caching to Variant.java for metadata strings, object headers, and array headers. Field lookups in getFieldByKey now defer value construction until a match is found, and child Variants share the parent's metadata cache. Also removes two unused static helper methods.

Includes @steveloughran's string converter optimization from #3452: VariantBuilder.appendAsString(Binary) and its use in VariantConverters.

Are these changes tested?

Ran the benchmarks from 3452 locally

Before:

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt      Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  11248.133 ±  696.176  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  15531.391 ± 1025.506  us/op

1st run:

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt     Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  4601.967 ± 4434.474  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  7457.942 ± 3645.281  us/op

2nd run after initial thread safety changes:

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt     Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  6142.534 ± 2839.243  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  8013.900 ± 2725.291  us/op

3rd run after trimming down to setting volatile on the metadataCache

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt     Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  5103.283 ± 3785.282  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  7316.450 ± 3462.591  us/op

Are there any user-facing changes?

No.

@nssalian nssalian marked this pull request as ready for review April 15, 2026 15:00
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good; made some minor changes.

This should make a very big difference when selectively retrieving multiple fields within a single variant, or within a variant and nested children.

I do worry about concurrency now. The existing Variant didn't have issues here precisely because it recalculated everything.

We have to be confident that even if concurrent access triggers a duplicate cache operation, there's no harm in this. Otherwise cache access will have to be synchronized.

It all looks good to me.

Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 15, 2026

I started the workflows

Co-authored-by: Steve Loughran <stevel@cloudera.com>
@nssalian nssalian force-pushed the variant-read-changes branch from 6c6db2e to 6f540f4 Compare April 15, 2026 18:05
@nssalian nssalian requested a review from steveloughran April 15, 2026 18:08
@nssalian nssalian changed the title WIP: Optimizing Variant read path with lazy caching Optimizing Variant read path with lazy caching Apr 15, 2026
Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewing the code again to see if it's possible to get back some of those performance numbers lost with the move to volatile.

We're only reading and caching data, so there's no real write conflicts -is the use of volatile everywhere being over-cautious? it's forcing memory reads everywhere.

And I don't know how common cross-thread reading will actually be in production systems; in spark each worker is its own thread, after all.

Maybe the goal should just be all reviewers being confident that if there are dual writers, the output will always be consistent.

cache = new String[dictSize];
metadataCache = cache;
}
if (cache[id] == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set cache to be metadataCache here, so if there is any race condition the same cache array is updated. If two threads are writing to the same [id], well, one lookup is wasted. The joint cache is still (probably) updated with each value

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this. Now it re-reads metadataCache after assignment so concurrent threads converge on the same array.
Also, removed the volatile fields. I ran some local concurrent tests and with multiple iterations and confirmed. I don't see any concurrency tests in the codebase but happy to add it to the test suite if it helps build confidence in the approach here.

@steveloughran
Copy link
Copy Markdown
Contributor

I would propose a javadoc on concurrency here.

Till now we've had a mutable builder, the caching makes it mutable. But even in a race condition, as the only change whjch ever takes place is a decode an update of the dictionary, if the dictionary was safe then the worst outcome is one of the lazy evals gets lost.

Lets
-make sure that the thread doing the lazy eval retains the values it needs for the duration of get()
-look at the dictionary impl.

Java HashMap is not thread safe, the very old HashTable is, but it may have its own penalty when used.

That rust variant puts a lot of effort into memory efficiency too. Maybe we should make sure that these changes don't completely explode memory consumption. I know, it's a tradeoff (speed, space, code complexity). And queries like speed. And I think the focus should be "single thread speed and no inconsistency on multithreaded use" as single thread workers is what the query engines do. After all, they shouldn't expect the input streams to be thread-safe, should they? Two threads doing parallel reads of a stream is already making some big assumptions about the underlying layers (*)

(*) hadoop input streams are thread safe precisely because code makes those assumptions, FWIW. Going thread unsafe broke hbase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants