GH-49720: [C++] Optimize base64_decode validation using lookup table#49748
GH-49720: [C++] Optimize base64_decode validation using lookup table#49748Reranko05 wants to merge 1 commit intoapache:mainfrom
Conversation
|
|
9c55ff0 to
e48fd5c
Compare
|
Hi @kou @dmitry-chirkov-dremio, this implements the lookup-table optimization we discussed in PR #49660 . |
There was a problem hiding this comment.
Pull request overview
Optimizes C++ base64_decode character validation by replacing per-character std::string::find lookups with a precomputed 256-entry lookup table.
Changes:
- Added a static 256-entry lookup table (
kBase64Lookup) mapping bytes to Base64 sextet values (or invalid marker). - Updated
base64_decodeto validate and decode using constant-time table lookup instead ofstd::string::find.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e48fd5c to
9735d72
Compare
|
Hi @kou, I've addressed the copilot review feedback:
Please take another look when convenient. Thanks! |
kou
left a comment
There was a problem hiding this comment.
+1
Have you checked whether this optimization improves performance or not?
|
Hi @kou, I ran a quick benchmark comparing the previous implementation and this change. Results (200 iterations): Small (~100B):
Medium (~10KB):
Large (~1MB):
The improvement is negligible for very small inputs but becomes significant as input size grows. This aligns with expectations, since the change replaces repeated linear |
|
Great! @dmitry-chirkov-dremio do you want to review this before we merge this? |
|
@Reranko05 do you have before before numbers? |
|
Hi @dmitry-chirkov-dremio, just to clarify, by "before before" do you mean comparing against the original implementation prior to #49660, or are you asking for something else? |
|
Hi @dmitry-chirkov-dremio, if by “before before” you meant the implementation prior to #49660, here are benchmarks across all three stages: Before-before (original, pre-#49660): Before (after #49660): After (this PR): The validation changes introduced some overhead (expected due to additional checks), and this optimization not only removes that regression but also improves performance beyond the original implementation by replacing repeated linear lookups with constant-time table access. |
Rationale for this change
The current implementation of
base64_decodevalidates characters usingstd::string::findfor each byte, which introduces unnecessary overhead due to repeated linear searches.This change replaces those lookups with a precomputed 256-entry lookup table, enabling constant-time validation and value lookup per character.
What changes are included in this PR?
kBase64Lookup) to map base64 characters to their corresponding valuesstd::string::findwith constant-time table lookup for character validationAre these changes tested?
Yes. Existing base64 decoding behavior remains unchanged and continues to pass all current tests. This change is a performance optimization and does not alter functional output.
Are there any user-facing changes?
No.
This change is internal and does not affect public APIs.