Skip to content

Add pdf_oxide to benchmark suite#19

Open
yfedoseev wants to merge 2 commits into
py-pdf:mainfrom
yfedoseev:add-pdf-oxide
Open

Add pdf_oxide to benchmark suite#19
yfedoseev wants to merge 2 commits into
py-pdf:mainfrom
yfedoseev:add-pdf-oxide

Conversation

@yfedoseev

Copy link
Copy Markdown

Adds pdf_oxide to the benchmark suite with text extraction and image extraction support.

Changes

  • pdf_benchmark/library_code.py: Added pdf_oxide_get_text() and pdf_oxide_image_extraction() functions (uses tempfile approach, same as pdftotext, since pdf_oxide accepts file paths)
  • benchmark.py: Registered pdf_oxide as a library with imports
  • requirements/main.in: Added pdf-oxide dependency

About pdf_oxide

  • Rust core with Python bindings via PyO3
  • MIT / Apache-2.0 licensed
  • Text extraction, image extraction, markdown conversion
  • v0.3.6 released: PyPI | GitHub

Adds text extraction and image extraction benchmarks for pdf_oxide,
a Rust-powered PDF library with Python bindings.

- Text extraction via tempfile (pdf_oxide accepts file paths)
- Image extraction with format-aware naming
- MIT/Apache-2.0 licensed

https://github.com/yfedoseev/pdf_oxide
https://pypi.org/project/pdf-oxide/
@bosd

bosd commented May 23, 2026

Copy link
Copy Markdown

@yfedoseev This tool looks very promising.
Love the speed. But it does seem to struggle with text directions.
Hope the accuracy can go up from 80% to something in the 90% 's

This was referenced May 24, 2026
Bump reported version 0.3.6 -> 0.3.57 and release date to 2026-05-30.
Text-extraction accuracy on the benchmark corpus improves substantially
in this release (column/reading-order and parser fixes).
@yfedoseev

Copy link
Copy Markdown
Author

@bosd Thanks for the feedback! Accuracy has been a major focus, and the text-direction / reading-order handling has improved significantly over the last several releases.

I've bumped this PR to pdf_oxide 0.3.57. Re-running the benchmark's own scoring (Levenshtein ratio vs the ground-truth files) across all 14 documents:

Metric 0.3.6 (original) 0.3.57
Median 64.4% 92.7%
Mean 67.0% 89.7%

So the typical document is now solidly in the 90's, up from the ~80% you saw earlier. The mean is held back by a couple of harder layouts that we're still improving — and more accuracy gains are landing over the coming weeks. A re-run on your side would be very welcome to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants