fix(deepdoc): silence ORT CUDA EP load failure on CUDA-13 hosts (#15687)#15701
fix(deepdoc): silence ORT CUDA EP load failure on CUDA-13 hosts (#15687)#15701Rene0422 wants to merge 3 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a runtime CUDA/cuDNN probe to ChangesOCR runtime, cleanup, and clipping updates
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
974c598 to
212cd13
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
deepdoc/vision/ocr.py (2)
86-114:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe CUDA fallback warning is still emitted once per model load, not once per worker.
cuda_is_available()is recreated on eachload_model()miss, anddet.onnx/rec.onnxuse different cache keys. On a CUDA-13 host this will log the same warning twice in one worker, which misses the PR goal of a single actionable warning per worker. Cache the probe result/warning at module scope or perdevice_id.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deepdoc/vision/ocr.py` around lines 86 - 114, The CUDA probe in cuda_is_available() is run every time load_model() misses (and det.onnx/rec.onnx use different cache keys), causing duplicate warnings; change it to cache the boolean probe result and warning state at module scope keyed by device_id (or a single module-level sentinel if device_id is unused) so cuda_is_available() returns the cached value and emits the warning only the first time per worker/device_id; update any callers (e.g., load_model(), det.onnx, rec.onnx paths) to call cuda_is_available() unchanged but rely on the cached result.
390-447:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift
close()/__del__()still do not release the cached ONNX sessions.Deleting
self.predictoronly drops the instance reference.load_model()keeps the same(sess, run_options)tuple alive in module-globalloaded_models, so these new finalizers will not actually reclaim the session or GPU memory until process exit. If cleanup is part of this change, the cache needs ref-counting or explicit eviction too.Also applies to: 540-579
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@deepdoc/vision/ocr.py`:
- Around line 753-754: The docstring for OCR.__call__ claims it returns
(filtered_boxes, [(text, score), ...], time_dict) but the implementation returns
a single list via list(zip(...)); update the docstring to describe the actual
return value (a list of (box, (text, score)) or whatever the list(zip(...))
contains) or alternatively change the implementation to return the documented
tuple (filtered_boxes, texts_scores_list, time_dict). Locate OCR.__call__ and
either edit its docstring to match the output of list(zip(...)) or modify the
return statement to assemble and return (filtered_boxes, list(zip(...)),
time_dict) so the API and docs are consistent.
---
Outside diff comments:
In `@deepdoc/vision/ocr.py`:
- Around line 86-114: The CUDA probe in cuda_is_available() is run every time
load_model() misses (and det.onnx/rec.onnx use different cache keys), causing
duplicate warnings; change it to cache the boolean probe result and warning
state at module scope keyed by device_id (or a single module-level sentinel if
device_id is unused) so cuda_is_available() returns the cached value and emits
the warning only the first time per worker/device_id; update any callers (e.g.,
load_model(), det.onnx, rec.onnx paths) to call cuda_is_available() unchanged
but rely on the cached result.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7573b17b-c26c-4ccf-8842-5aa6ad910aed
📒 Files selected for processing (1)
deepdoc/vision/ocr.py
Summary
Fixes #15687.
After bumping
onnxruntime-gpufrom1.19.2→1.23.2(commitf128a1fa9, shipped in v0.24.0 and v0.25.x), users running RAGFlow on hosts with CUDA 13 see the following errors on every OCR model load:Root cause
onnxruntime-gpu==1.23.2(the only pre-built x86_64 Linux wheel on PyPI) is built against the CUDA 12 ABI anddlopenslibcublasLt.so.12/libcudnn.so.9at provider-registration time. The Docker image (ubuntu:24.04base) bundles no CUDA user-mode libs; it relies onnvidia-container-toolkitto inject them from the host at/usr/lib/x86_64-linux-gnu/. On a CUDA-13 host the toolkit injectslibcublasLt.so.13/libcudnn.so.10, so the cu12 SONAMEs the ORT wheel needs are nowhere onLD_LIBRARY_PATHand provider registration fails.cuda_is_available()indeepdoc/vision/ocr.pywas deciding to ask forCUDAExecutionProviderbased solely ontorch.cuda.is_available(). Torch only needslibcuda.so.1(the driver lib, backwards-compatible), so it's happy on a CUDA-13 host — but ORT then fails the actual CUDA EP load, prints two warnings per model, and silently falls back to CPU.Fix
Before reporting CUDA as available, probe with
ctypes.CDLLfor the exact cu12 SONAMEs ORT will need. If either is missing, log one actionable warning and returnFalseso the existing CPU code path is taken explicitly. GPU inference is unchanged when the cu12 libs are present (CUDA-12 host or future bundled wheels).This is a targeted, dependency-free fix:
onnxruntime-gpu(no cu13 stable wheel on default PyPI yet).nvidia-*-cu12wheels (would add ~1 GB to the image).LD_LIBRARY_PATHsetup inentrypoint.sh.Users who want GPU inference on a CUDA-13 host now get a clear single-line hint and can either install the cu12 user-mode libs in the container or switch to a CUDA-12 host. Users on CUDA-12 hosts see no change. Users with no GPU at all see one cleaner warning instead of two ORT errors per model.
Test plan
CUDAExecutionProvider(look forload_model ... uses GPUlog lines).CPUExecutionProvider(look forload_model ... uses CPU) and thelibcublasLt.so.12 not foundwarning appears once per worker instead of the two ORT errors per model.CPUExecutionProviderwith no extra warnings (the probe runs only whentorch.cuda.is_available()is true, so CPU hosts are unaffected).Files changed
cuda_is_available()now probes forlibcublasLt.so.12/libcudnn.so.9after the torch check.Type of change