fix: prefer embedded PDF text over OCR for hi_res table tokens#4347
fix: prefer embedded PDF text over OCR for hi_res table tokens#4347claytonlin1110 wants to merge 1 commit into
Conversation
c708857 to
78e0906
Compare
| "line_num": 0, | ||
| "block_num": 0, | ||
| } | ||
| ) |
There was a problem hiding this comment.
Non-contiguous span_num when tokens are skipped
Medium Severity
In _get_table_tokens_from_extracted_regions, span_num is assigned from enumerate(sorted_indices), but when a token is skipped by the continue on the degenerate-bbox check, the counter still increments, producing gaps in span_num values (e.g., [0, 2, 4] instead of [0, 1, 2]). The OCR path always produces contiguous span numbers. Downstream consumers (table transformer) receive tokens whose span_num field is inconsistent with the contract established by the OCR path, which could affect cell-to-token assignment or reading-order logic.
Reviewed by Cursor Bugbot for commit 1fda8ae. Configure here.
6c44e97 to
945c186
Compare
|
@SudSampath Would you please review? |
|
@cragwolfe Would you please review? |
945c186 to
321bfaa
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 321bfaa. Configure here.
| return ( | ||
| extracted_count >= token_ratio_threshold * ocr_count | ||
| and extracted_chars >= text_ratio_threshold * ocr_chars | ||
| ) |
There was a problem hiding this comment.
Embedded tokens lose preference
Medium Severity
The _prefer_extracted_table_tokens function compares token counts across different granularities (line-level for embedded text vs. word-level for OCR). This can cause complete embedded table text to be incorrectly rejected, leading to a fallback to OCR and reintroducing OCR-related substitutions.
Reviewed by Cursor Bugbot for commit 321bfaa. Configure here.
|
|
||
| valid = [ | ||
| (idx, text) for idx, text in enumerate(selected_regions.texts) if text and str(text).strip() | ||
| ] |
There was a problem hiding this comment.
Low-fidelity text can replace OCR
Medium Severity
_get_table_tokens_from_extracted_regions() accepts any non-empty PDFMiner text and ignores extraction quality. Invisible or low-fidelity OCR layers in scanned PDFs can be preferred over fresh OCR, so the intended OCR fallback for scanned documents is bypassed.
Reviewed by Cursor Bugbot for commit 321bfaa. Configure here.


Summary
Closes #4092
Root cause
What changed
Note
Medium Risk
Changes table token generation in the hi_res PDF table-extraction path, which can affect table-structure inference outputs; fallback logic helps, but coverage heuristics could shift behavior on edge-case PDFs.
Overview
Updates hi_res PDF table extraction to generate
tatrtable OCR tokens from embedded PDF text (extracted_regions) when it provides comparable coverage, falling back to OCR tokens when embedded text is missing or sparse.This threads table context (
table_bbox+ croppadding) intoget_table_tokens(), adds selection/coordinate-mapping helpers and a coverage heuristic, and adds regression tests for both the preferred-embedded and OCR-fallback paths. Also bumps__version__to0.22.28and adds a0.22.28changelog entry.Reviewed by Cursor Bugbot for commit 321bfaa. Bugbot is set up for automated code reviews on this repo. Configure here.