Zir is a lightweight, command-line utility designed to identify and clean unwanted Persian diacritics (like Kasra and Tanvin) and other junk characters from your text files. It is particularly useful for cleaning up text generated by Large Language Models (LLMs).
As AI models often introduce hidden characters or incorrect diacritics into Persian text, Zir helps you maintain the quality and consistency of your documentation and source code files by identifying these issues quickly.
You can use the following commands to scan your project.
This command lists all affected files and shows the number of matches in each:
grep -rIE "$(printf '[\u0650\u064D]')" . --exclude-dir={_build,build,logs,venv,.venv,.git} --files-with-matches | xargs grep -cE "$(printf '[\u0650\u064D]')"This command displays the exact line, content, and the total count of matches:
grep --color=always -rnIE "$(printf '[\u0650\u064D]')" . --exclude-dir={_build,build,logs,venv,.venv,.git} | tee /dev/tty | wc -lThe tool automatically ignores common directories like venv, build, and .git. You can add more directories or customize the character list directly within the commands.
Feel free to open an issue or submit a pull request if you want to add support for more character types or improve the scanning logic.