Add pdf_oxide to benchmark suite by yfedoseev · Pull Request #19 · py-pdf/benchmarks

yfedoseev · 2026-02-16T22:00:12Z

Adds pdf_oxide to the benchmark suite with text extraction and image extraction support.

Changes

pdf_benchmark/library_code.py: Added pdf_oxide_get_text() and pdf_oxide_image_extraction() functions (uses tempfile approach, same as pdftotext, since pdf_oxide accepts file paths)
benchmark.py: Registered pdf_oxide as a library with imports
requirements/main.in: Added pdf-oxide dependency

About pdf_oxide

Rust core with Python bindings via PyO3
MIT / Apache-2.0 licensed
Text extraction, image extraction, markdown conversion
v0.3.6 released: PyPI | GitHub

Adds text extraction and image extraction benchmarks for pdf_oxide, a Rust-powered PDF library with Python bindings. - Text extraction via tempfile (pdf_oxide accepts file paths) - Image extraction with format-aware naming - MIT/Apache-2.0 licensed https://github.com/yfedoseev/pdf_oxide https://pypi.org/project/pdf-oxide/

bosd · 2026-05-23T20:46:38Z

@yfedoseev This tool looks very promising.
Love the speed. But it does seem to struggle with text directions.
Hope the accuracy can go up from 80% to something in the 90% 's

Bump reported version 0.3.6 -> 0.3.57 and release date to 2026-05-30. Text-extraction accuracy on the benchmark corpus improves substantially in this release (column/reading-order and parser fixes).

yfedoseev · 2026-05-30T22:50:26Z

@bosd Thanks for the feedback! Accuracy has been a major focus, and the text-direction / reading-order handling has improved significantly over the last several releases.

I've bumped this PR to pdf_oxide 0.3.57. Re-running the benchmark's own scoring (Levenshtein ratio vs the ground-truth files) across all 14 documents:

Metric	0.3.6 (original)	0.3.57
Median	64.4%	92.7%
Mean	67.0%	89.7%

So the typical document is now solidly in the 90's, up from the ~80% you saw earlier. The mean is held back by a couple of harder layouts that we're still improving — and more accuracy gains are landing over the coming weeks. A re-run on your side would be very welcome to confirm.

Update pdf_oxide to 0.3.57

89b6fea

Bump reported version 0.3.6 -> 0.3.57 and release date to 2026-05-30. Text-extraction accuracy on the benchmark corpus improves substantially in this release (column/reading-order and parser fixes).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pdf_oxide to benchmark suite#19

Add pdf_oxide to benchmark suite#19
yfedoseev wants to merge 2 commits into
py-pdf:mainfrom
yfedoseev:add-pdf-oxide

yfedoseev commented Feb 16, 2026

Uh oh!

bosd commented May 23, 2026

Uh oh!

yfedoseev commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yfedoseev commented Feb 16, 2026

Changes

About pdf_oxide

Uh oh!

bosd commented May 23, 2026

Uh oh!

yfedoseev commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants