- ✏️ [May 22, 2026] Released a corrected paper. Check out mdm-prime for perplexity evaluation on OWT.
- 📓 [May 1, 2026] Released errata note. The current NLL evaluation has bugs. (old preprint)
This repository contains the code implementation of the experiments presented in the paper MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Scaling of Diffusion Language Models.
- 🐳 Docker environments for easy installation
- 🤗 Pretrained weights for inference and evaluation
- 📉 Weights and Biases logs for enhanced reproducibility
- 🔬 Code for all experiments in our paper:
- Scaling Analysis
- Larger-scale Pretraining
- Folder: mdm-prime-v2/megatron
- Dataset: allenai/c4
- Weights & Biases Logs: lance_chao/megatron-all-runs
- Best for: (1) Studying the loss behavior; (2) Pretraining with advanced parallelism
- Folder: mdm-prime-v2/lit_gpt
- Dataset: cerebras/SlimPajama-627B (or gmongaras/SlimPajama-627B_Reupload)
- Best for: (1) Pretraining 1.1B models; (2) Running inference and downstream applications
- Download our docker image and launch
gradio_demo.py:
# Pull and launch the docker image
docker pull chenhaochao/mdm-prime-v2-litgpt:latest
docker run -v $(pwd):/workspace --rm -it --gpus all --ipc=host -p 3000:3000 chenhaochao/mdm-prime-v2-litgpt:latest
# Install gradio and run gradio_demo.py
uv pip install gradio
/venv/mdm-prime-v2-litgpt/bin/python gradio_demo.py- Loading the model's weights takes a few minutes. After running the commands, the demo website will be available at
http://localhost:3000/.
This code implementation is developed based on the following repositories.
- ML-GSAI/SMDM (at commit
1df2e12), licensed under theApache-2.0license. - jzhang38/TinyLlama (at commit
bf12224), licensed under theApache-2.0license. - NVIDIA/Megatron-LM (at commit
636179d), licensed under theApache-2.0license. - wmn-231314/diffusion-data-constraint (at commit
61002b2), licensed under theApache-2.0license.
Further changes based on the code in this folder are licensed under the Apache-2.0 license.
If you find this code implementation useful, please consider citing our papers.
@article{chao2026mdmprimev2,
title = {{MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Scaling of Diffusion Language Models}},
author = {Chen-Hao Chao, Wei-Fang Sun, Junwei Quan, Chun-Yi Lee, Rahul G. Krishnan},
year = {2026},
}
@article{chao2026dependency,
title = {{Dependency Breaks Validity of Loss Functions in Masked Diffusion Models}},
author = {Chao, Chen-Hao and Xu, Minkai and Geffner, Tomas and Vahdat, Arash and Krishnan, Rahul G.},
journal = {chen-hao-chao.github.io},
year = {2026}
}
@inproceedings{chao2025mdmprime,
title = {{Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking}},
author = {Chen-Hao Chao, Wei-Fang Sun, Hanwen Liang, Chun-Yi Lee, Rahul G. Krishnan},
booktitle = {Proceedings of the Conference on Neural Information Processing Systems (NeurIPS)},
year = {2025},
}
