Skip to content

fix: prefer embedded PDF text over OCR for hi_res table tokens#4347

Open
claytonlin1110 wants to merge 1 commit into
Unstructured-IO:mainfrom
claytonlin1110:fix/pdf-hires-table-use-embedded-text-first
Open

fix: prefer embedded PDF text over OCR for hi_res table tokens#4347
claytonlin1110 wants to merge 1 commit into
Unstructured-IO:mainfrom
claytonlin1110:fix/pdf-hires-table-use-embedded-text-first

Conversation

@claytonlin1110
Copy link
Copy Markdown
Contributor

@claytonlin1110 claytonlin1110 commented Apr 28, 2026

Summary

  • Fixes table-token generation in hi_res pipeline to prioritize embedded text (extracted_regions) for infer_table_structure=True.
  • Preserves OCR as fallback for scanned/image-only documents.
  • Adds regression test to ensure embedded text is used when both sources are available.

Closes #4092

Root cause

  • supplement_element_with_table_extraction() passed extracted_regions through the pipeline but get_table_tokens() ignored it and always generated tokens from OCR output.
  • This made digital PDFs susceptible to OCR hallucinations/substitutions in table cells.

What changed

  • Updated get_table_tokens() signature and logic to accept:
    • extracted_regions
    • table_bbox
    • padding
  • Added _get_table_tokens_from_extracted_regions() to:
    • select text regions inside table bbox
    • convert coordinates into cropped-table space
    • keep stable reading order
  • Updated caller in supplement_element_with_table_extraction() to pass required context.
  • Added test_get_table_tokens_prefers_extracted_regions_over_ocr().

Note

Medium Risk
Changes table token generation in the hi_res PDF table-extraction path, which can affect table-structure inference outputs; fallback logic helps, but coverage heuristics could shift behavior on edge-case PDFs.

Overview
Updates hi_res PDF table extraction to generate tatr table OCR tokens from embedded PDF text (extracted_regions) when it provides comparable coverage, falling back to OCR tokens when embedded text is missing or sparse.

This threads table context (table_bbox + crop padding) into get_table_tokens(), adds selection/coordinate-mapping helpers and a coverage heuristic, and adds regression tests for both the preferred-embedded and OCR-fallback paths. Also bumps __version__ to 0.22.28 and adds a 0.22.28 changelog entry.

Reviewed by Cursor Bugbot for commit 321bfaa. Bugbot is set up for automated code reviews on this repo. Configure here.

@claytonlin1110 claytonlin1110 force-pushed the fix/pdf-hires-table-use-embedded-text-first branch from c708857 to 78e0906 Compare April 28, 2026 20:29
"line_num": 0,
"block_num": 0,
}
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-contiguous span_num when tokens are skipped

Medium Severity

In _get_table_tokens_from_extracted_regions, span_num is assigned from enumerate(sorted_indices), but when a token is skipped by the continue on the degenerate-bbox check, the counter still increments, producing gaps in span_num values (e.g., [0, 2, 4] instead of [0, 1, 2]). The OCR path always produces contiguous span numbers. Downstream consumers (table transformer) receive tokens whose span_num field is inconsistent with the contract established by the OCR path, which could affect cell-to-token assignment or reading-order logic.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1fda8ae. Configure here.

@claytonlin1110 claytonlin1110 force-pushed the fix/pdf-hires-table-use-embedded-text-first branch 2 times, most recently from 6c44e97 to 945c186 Compare April 28, 2026 21:57
@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@SudSampath Would you please review?

@claytonlin1110
Copy link
Copy Markdown
Contributor Author

@cragwolfe Would you please review?

@claytonlin1110 claytonlin1110 force-pushed the fix/pdf-hires-table-use-embedded-text-first branch from 945c186 to 321bfaa Compare May 6, 2026 10:47
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 321bfaa. Configure here.

return (
extracted_count >= token_ratio_threshold * ocr_count
and extracted_chars >= text_ratio_threshold * ocr_chars
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Embedded tokens lose preference

Medium Severity

The _prefer_extracted_table_tokens function compares token counts across different granularities (line-level for embedded text vs. word-level for OCR). This can cause complete embedded table text to be incorrectly rejected, leading to a fallback to OCR and reintroducing OCR-related substitutions.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 321bfaa. Configure here.


valid = [
(idx, text) for idx, text in enumerate(selected_regions.texts) if text and str(text).strip()
]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low-fidelity text can replace OCR

Medium Severity

_get_table_tokens_from_extracted_regions() accepts any non-empty PDFMiner text and ignores extraction quality. Invisible or low-fidelity OCR layers in scanned PDFs can be preferred over fresh OCR, so the intended OCR fallback for scanned documents is bypassed.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 321bfaa. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/incorrect text extraction by partition_pdf with hi_res strategy

1 participant