fix: prefer embedded PDF text over OCR for hi_res table tokens by claytonlin1110 · Pull Request #4347 · Unstructured-IO/unstructured

claytonlin1110 · 2026-04-28T20:26:44Z

Summary

Fixes table-token generation in hi_res pipeline to prioritize embedded text (extracted_regions) for infer_table_structure=True.
Preserves OCR as fallback for scanned/image-only documents.
Adds regression test to ensure embedded text is used when both sources are available.

Closes #4092

Root cause

supplement_element_with_table_extraction() passed extracted_regions through the pipeline but get_table_tokens() ignored it and always generated tokens from OCR output.
This made digital PDFs susceptible to OCR hallucinations/substitutions in table cells.

What changed

Updated get_table_tokens() signature and logic to accept:
- extracted_regions
- table_bbox
- padding
Added _get_table_tokens_from_extracted_regions() to:
- select text regions inside table bbox
- convert coordinates into cropped-table space
- keep stable reading order
Updated caller in supplement_element_with_table_extraction() to pass required context.
Added test_get_table_tokens_prefers_extracted_regions_over_ocr().

Note

Medium Risk
Changes table token generation in the hi_res PDF table-extraction path, which can affect table-structure inference outputs; fallback logic helps, but coverage heuristics could shift behavior on edge-case PDFs.

Overview
Updates hi_res PDF table extraction to generate tatr table OCR tokens from embedded PDF text (extracted_regions) when it provides comparable coverage, falling back to OCR tokens when embedded text is missing or sparse.

This threads table context (table_bbox + crop padding) into get_table_tokens(), adds selection/coordinate-mapping helpers and a coverage heuristic, and adds regression tests for both the preferred-embedded and OCR-fallback paths. Also bumps __version__ to 0.22.28 and adds a 0.22.28 changelog entry.

^{Reviewed by Cursor Bugbot for commit 321bfaa. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor · 2026-04-28T21:01:27Z

+                "line_num": 0,
+                "block_num": 0,
+            }
+        )


Non-contiguous span_num when tokens are skipped

Medium Severity

In _get_table_tokens_from_extracted_regions, span_num is assigned from enumerate(sorted_indices), but when a token is skipped by the continue on the degenerate-bbox check, the counter still increments, producing gaps in span_num values (e.g., [0, 2, 4] instead of [0, 1, 2]). The OCR path always produces contiguous span numbers. Downstream consumers (table transformer) receive tokens whose span_num field is inconsistent with the contract established by the OCR path, which could affect cell-to-token assignment or reading-order logic.

^{Reviewed by Cursor Bugbot for commit 1fda8ae. Configure here.}

claytonlin1110 · 2026-05-05T03:48:17Z

@SudSampath Would you please review?

claytonlin1110 · 2026-05-06T06:15:55Z

@cragwolfe Would you please review?

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 321bfaa. Configure here.}

cursor · 2026-05-06T10:54:54Z

+    return (
+        extracted_count >= token_ratio_threshold * ocr_count
+        and extracted_chars >= text_ratio_threshold * ocr_chars
+    )


Embedded tokens lose preference

Medium Severity

The _prefer_extracted_table_tokens function compares token counts across different granularities (line-level for embedded text vs. word-level for OCR). This can cause complete embedded table text to be incorrectly rejected, leading to a fallback to OCR and reintroducing OCR-related substitutions.

^{Reviewed by Cursor Bugbot for commit 321bfaa. Configure here.}

cursor · 2026-05-06T10:54:54Z

+
+    valid = [
+        (idx, text) for idx, text in enumerate(selected_regions.texts) if text and str(text).strip()
+    ]


Low-fidelity text can replace OCR

Medium Severity

_get_table_tokens_from_extracted_regions() accepts any non-empty PDFMiner text and ignores extraction quality. Invisible or low-fidelity OCR layers in scanned PDFs can be preferred over fresh OCR, so the intended OCR fallback for scanned documents is bypassed.

^{Reviewed by Cursor Bugbot for commit 321bfaa. Configure here.}

claytonlin1110 force-pushed the fix/pdf-hires-table-use-embedded-text-first branch from c708857 to 78e0906 Compare April 28, 2026 20:29

cursor Bot reviewed Apr 28, 2026

View reviewed changes

claytonlin1110 force-pushed the fix/pdf-hires-table-use-embedded-text-first branch 2 times, most recently from 6c44e97 to 945c186 Compare April 28, 2026 21:57

fix: prefer embedded PDF text over OCR for hi_res table tokens

321bfaa

claytonlin1110 force-pushed the fix/pdf-hires-table-use-embedded-text-first branch from 945c186 to 321bfaa Compare May 6, 2026 10:47

cursor Bot reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prefer embedded PDF text over OCR for hi_res table tokens#4347

fix: prefer embedded PDF text over OCR for hi_res table tokens#4347
claytonlin1110 wants to merge 1 commit into
Unstructured-IO:mainfrom
claytonlin1110:fix/pdf-hires-table-use-embedded-text-first

claytonlin1110 commented Apr 28, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot Apr 28, 2026

Uh oh!

claytonlin1110 commented May 5, 2026

Uh oh!

claytonlin1110 commented May 6, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 6, 2026

Uh oh!

cursor Bot May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

claytonlin1110 commented Apr 28, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

What changed

Uh oh!

cursor Bot Apr 28, 2026

Choose a reason for hiding this comment

Non-contiguous span_num when tokens are skipped

Uh oh!

claytonlin1110 commented May 5, 2026

Uh oh!

claytonlin1110 commented May 6, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 6, 2026

Choose a reason for hiding this comment

Embedded tokens lose preference

Uh oh!

cursor Bot May 6, 2026

Choose a reason for hiding this comment

Low-fidelity text can replace OCR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claytonlin1110 commented Apr 28, 2026 •

edited by cursor Bot

Loading

Non-contiguous `span_num` when tokens are skipped