Skip to content

DataistOS/zir

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

Zir (a diacritic cleaner)

Zir is a lightweight, command-line utility designed to identify and clean unwanted Persian diacritics (like Kasra and Tanvin) and other junk characters from your text files. It is particularly useful for cleaning up text generated by Large Language Models (LLMs).

🚀 Why Zir?

As AI models often introduce hidden characters or incorrect diacritics into Persian text, Zir helps you maintain the quality and consistency of your documentation and source code files by identifying these issues quickly.

🛠 Usage

You can use the following commands to scan your project.

1. Identify files with diacritics and count occurrences

This command lists all affected files and shows the number of matches in each:

grep -rIE "$(printf '[\u0650\u064D]')" . --exclude-dir={_build,build,logs,venv,.venv,.git} --files-with-matches | xargs grep -cE "$(printf '[\u0650\u064D]')"

2. Detailed scan with line numbers

This command displays the exact line, content, and the total count of matches:

grep --color=always -rnIE "$(printf '[\u0650\u064D]')" . --exclude-dir={_build,build,logs,venv,.venv,.git} | tee /dev/tty | wc -l

⚙️ Configuration

The tool automatically ignores common directories like venv, build, and .git. You can add more directories or customize the character list directly within the commands.

🤝 Contributing

Feel free to open an issue or submit a pull request if you want to add support for more character types or improve the scanning logic.

Releases

No releases published

Packages

 
 
 

Contributors