Optimizing Variant read path with lazy caching by nssalian · Pull Request #3481 · apache/parquet-java

nssalian · 2026-04-15T14:52:23Z

Rationale for this change

Profiling in 3452 identified Variant.getFieldAtIndex() and metadata string lookups as hotspots during variant reads. Every call to getFieldByKey, getFieldAtIndex, and getElementAtIndex re-parses headers and re-allocates objects that could be cached.

What changes are included in this PR?

Adds lazy caching to Variant.java for metadata strings, object headers, and array headers. Field lookups in getFieldByKey now defer value construction until a match is found, and child Variants share the parent's metadata cache. Also removes two unused static helper methods.

Includes @steveloughran's string converter optimization from #3452: VariantBuilder.appendAsString(Binary) and its use in VariantConverters.

Are these changes tested?

Ran the benchmarks from 3452 locally

Before:

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt      Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  11248.133 ±  696.176  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  15531.391 ± 1025.506  us/op

1st run:

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt     Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  4601.967 ± 4434.474  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  7457.942 ± 3645.281  us/op

2nd run after initial thread safety changes:

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt     Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  6142.534 ± 2839.243  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  8013.900 ± 2725.291  us/op

3rd run after trimming down to setting volatile on the metadataCache

Benchmark                                   (depth)  (fieldCount)  Mode  Cnt     Score      Error  Units
VariantBuilderBenchmark.deserializeVariant     Flat           200    ss    5  5103.283 ± 3785.282  us/op
VariantBuilderBenchmark.deserializeVariant   Nested           200    ss    5  7316.450 ± 3462.591  us/op

Are there any user-facing changes?

No.

steveloughran

code looks good; made some minor changes.

This should make a very big difference when selectively retrieving multiple fields within a single variant, or within a variant and nested children.

I do worry about concurrency now. The existing Variant didn't have issues here precisely because it recalculated everything.

We have to be confident that even if concurrent access triggers a duplicate cache operation, there's no harm in this. Otherwise cache access will have to be synchronized.

It all looks good to me.

alamb · 2026-04-15T17:18:08Z

I started the workflows

Co-authored-by: Steve Loughran <stevel@cloudera.com>

steveloughran

reviewing the code again to see if it's possible to get back some of those performance numbers lost with the move to volatile.

We're only reading and caching data, so there's no real write conflicts -is the use of volatile everywhere being over-cautious? it's forcing memory reads everywhere.

And I don't know how common cross-thread reading will actually be in production systems; in spark each worker is its own thread, after all.

Maybe the goal should just be all reviewers being confident that if there are dual writers, the output will always be consistent.

steveloughran · 2026-04-17T12:56:22Z

+      cache = new String[dictSize];
+      metadataCache = cache;
+    }
+    if (cache[id] == null) {


set cache to be metadataCache here, so if there is any race condition the same cache array is updated. If two threads are writing to the same [id], well, one lookup is wasted. The joint cache is still (probably) updated with each value

Fixed this. Now it re-reads metadataCache after assignment so concurrent threads converge on the same array.
Also, removed the volatile fields. I ran some local concurrent tests and with multiple iterations and confirmed. I don't see any concurrency tests in the codebase but happy to add it to the test suite if it helps build confidence in the approach here.

steveloughran · 2026-04-21T09:57:16Z

I would propose a javadoc on concurrency here.

Till now we've had a mutable builder, the caching makes it mutable. But even in a race condition, as the only change whjch ever takes place is a decode an update of the dictionary, if the dictionary was safe then the worst outcome is one of the lazy evals gets lost.

Lets
-make sure that the thread doing the lazy eval retains the values it needs for the duration of get()
-look at the dictionary impl.

Java HashMap is not thread safe, the very old HashTable is, but it may have its own penalty when used.

That rust variant puts a lot of effort into memory efficiency too. Maybe we should make sure that these changes don't completely explode memory consumption. I know, it's a tradeoff (speed, space, code complexity). And queries like speed. And I think the focus should be "single thread speed and no inconsistency on multithreaded use" as single thread workers is what the query engines do. After all, they shouldn't expect the input streams to be thread-safe, should they? Two threads doing parallel reads of a stream is already making some big assumptions about the underlying layers (*)

(*) hadoop input streams are thread safe precisely because code makes those assumptions, FWIW. Going thread unsafe broke hbase

nssalian marked this pull request as ready for review April 15, 2026 15:00

steveloughran reviewed Apr 15, 2026

View reviewed changes

Fix thread-safety in Variant lazy caches and add comments

6f540f4

Co-authored-by: Steve Loughran <stevel@cloudera.com>

nssalian force-pushed the variant-read-changes branch from 6c6db2e to 6f540f4 Compare April 15, 2026 18:05

nssalian requested a review from steveloughran April 15, 2026 18:08

nssalian changed the title ~~WIP: Optimizing Variant read path with lazy caching~~ Optimizing Variant read path with lazy caching Apr 15, 2026

steveloughran mentioned this pull request Apr 15, 2026

GH-3471: Fix ByteBuffer handling in VariantUtil and VariantBuilder #3472

Open

steveloughran reviewed Apr 17, 2026

View reviewed changes

Comment thread parquet-variant/src/main/java/org/apache/parquet/variant/Variant.java Outdated

steveloughran reviewed Apr 17, 2026

View reviewed changes

Remove unnecessary volatile fields and fix PR comments

fcbef75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing Variant read path with lazy caching#3481

Optimizing Variant read path with lazy caching#3481
nssalian wants to merge 2 commits intoapache:masterfrom
nssalian:variant-read-changes

nssalian commented Apr 15, 2026 •

edited

Loading

Uh oh!

steveloughran left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Apr 15, 2026

Uh oh!

Uh oh!

steveloughran left a comment

Uh oh!

steveloughran Apr 17, 2026

Uh oh!

nssalian Apr 20, 2026

Uh oh!

steveloughran commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nssalian commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Apr 15, 2026

Uh oh!

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

nssalian Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nssalian commented Apr 15, 2026 •

edited

Loading