fix: update Wikipedia dataset config from 20220301.en to 20231101.en#252
fix: update Wikipedia dataset config from 20220301.en to 20231101.en#252kukudan wants to merge 3 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughDocumentation example for streaming large datasets was updated: the Wikimedia Wikipedia snapshot was bumped to ChangesData Management Documentation Update
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
phases/00-setup-and-tooling/09-data-management/docs/en.md (1)
143-151:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd language tag to code block.
As per coding guidelines, every fenced code block needs a language tag. This .gitignore example should have a language tag such as
textorgitignore.📝 Proposed fix
-``` +```text *.bin *.safetensors🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@phases/00-setup-and-tooling/09-data-management/docs/en.md` around lines 143 - 151, The fenced code block showing ignore patterns (the block starting with ``` and containing lines like *.bin, *.safetensors, *.pt, *.onnx, data/*.parquet, data/*.csv, models/) needs a language tag; change the opening fence from ``` to ```text or ```gitignore so the block becomes a tagged code block (e.g., ```text) and preserve the existing lines inside unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@phases/00-setup-and-tooling/09-data-management/docs/en.md`:
- Around line 143-151: The fenced code block showing ignore patterns (the block
starting with ``` and containing lines like *.bin, *.safetensors, *.pt, *.onnx,
data/*.parquet, data/*.csv, models/) needs a language tag; change the opening
fence from ``` to ```text or ```gitignore so the block becomes a tagged code
block (e.g., ```text) and preserve the existing lines inside unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7ff6738b-278b-4fe7-99cb-dd32df562eb3
📒 Files selected for processing (1)
phases/00-setup-and-tooling/09-data-management/docs/en.md
Problem
The Wikipedia dataset config
20220301.enreferenced in Lesson 09 (Data Management) Step 3 no longer exists. Running the code as written produces:ValueError: BuilderConfig '20220301.en' not found. Available: ['20231101.ab', '20231101.ace', ..., '20231101.en', ...]
The dataset has been re-exported with the
20231101snapshot, making the20220301config unavailable.Fix
Updated the config from
20220301.ento20231101.enin:phases/00-setup-and-tooling/09-data-management/docs/en.md(line 60)Verification
Tested with
datasets==4.8.5and the updated config loads successfully: