Add pdf_oxide to benchmark suite#19
Conversation
Adds text extraction and image extraction benchmarks for pdf_oxide, a Rust-powered PDF library with Python bindings. - Text extraction via tempfile (pdf_oxide accepts file paths) - Image extraction with format-aware naming - MIT/Apache-2.0 licensed https://github.com/yfedoseev/pdf_oxide https://pypi.org/project/pdf-oxide/
|
@yfedoseev This tool looks very promising. |
Bump reported version 0.3.6 -> 0.3.57 and release date to 2026-05-30. Text-extraction accuracy on the benchmark corpus improves substantially in this release (column/reading-order and parser fixes).
|
@bosd Thanks for the feedback! Accuracy has been a major focus, and the text-direction / reading-order handling has improved significantly over the last several releases. I've bumped this PR to pdf_oxide 0.3.57. Re-running the benchmark's own scoring (Levenshtein ratio vs the ground-truth files) across all 14 documents:
So the typical document is now solidly in the 90's, up from the ~80% you saw earlier. The mean is held back by a couple of harder layouts that we're still improving — and more accuracy gains are landing over the coming weeks. A re-run on your side would be very welcome to confirm. |
Adds pdf_oxide to the benchmark suite with text extraction and image extraction support.
Changes
pdf_benchmark/library_code.py: Addedpdf_oxide_get_text()andpdf_oxide_image_extraction()functions (uses tempfile approach, same as pdftotext, since pdf_oxide accepts file paths)benchmark.py: Registered pdf_oxide as a library with importsrequirements/main.in: Addedpdf-oxidedependencyAbout pdf_oxide