Marco-LLM: Towards Multilingual and Multiculture Large Language Models

⭐Alibaba International Digital Commerce⭐

GitHub 🤗 Model 📝 Marco-MoE Paper 📝 Marco-Bench-MIF Paper

Marco-LLM is a research initiative from Alibaba International Digital Commerce dedicated to building multilingual and multicultural large language models. Our work spans efficient multilingual model architectures and rigorous evaluation benchmarks, with the goal of delivering strong performance across diverse languages and cultures — especially for underserved and low-resource communities.

🔥 News

[2026.4] 🔥 We released DetectRL-X, a comprehensive multilingual benchmark for LLM-generated text detection, covering 8 languages, 6 domains, and 4 commercial LLMs across 8 evaluation dimensions. The paper has been accepted by ACL 2026.
[2026.4] 🔥 We released CulturAll — a comprehensive benchmark for evaluating LLMs' multilingual and multicultural competence on grounded tasks, covering 14 languages, 51 regions, and 16 topics with 2,610 samples.
[2026.4] 🔥 We released Marco-MoE — a family of compact, highly sparse multilingual Mixture-of-Expert language models. Marco-MoE achieves state-of-the-art performance-to-compute ratios across both English and multilingual benchmarks, covering 29 to 64 languages while activating only 5-7.5% of total parameters per token. Models, data, and training recipes are fully open-sourced.
[2025.5] 🔥 The paper Marco-Bench-MIF has been accepted by ACL 2025 — the first deeply localized multilingual instruction-following benchmark across 30 languages, revealing that machine translation underestimates model performance by 7-22%.

Marco-MoE: Open Multilingual MoE LLMs with Efficient Upcycling

📄 Full Details

Marco-MoE addresses the "curse of multilinguality" — the challenge that expanding language coverage in fixed-parameter models degrades per-language performance. By upcycling a dense Qwen3-0.6B-Base into fine-grained sparse MoE architectures via a novel Drop-Upcycling method, Marco-MoE achieves superior multilingual performance at a fraction of the training cost. Marco-Instruct variants further surpass models with 3-14x more activated parameters through cascaded on-policy distillation.

Key Highlights

First Sparse Multilingual Upcycling: The first work to leverage MoE upcycling specifically for multilingual performance in compact model sizes.
Fine-Grained Expert Specialization: Sub-matrix splitting initializes hundreds of fine-grained experts, combined with Drop-Upcycling to promote expert diversification — unlike conventional coarse-grained FFN replication.
Full Transparency: Complete pre-training datasets, data synthesis pipelines, and the four-stage training curriculum (5.1T tokens) are fully disclosed and open-sourced.
Superior Efficiency: Marco-Mini-Base (0.86B activated / 17.3B total) matches or outperforms Qwen3-4B-Base (4B activated) while using 5.5x fewer training FLOPs.
Strong Instruct Models: Marco-Mini-Instruct achieves 75.5 avg (English) and 71.0 avg (cultural/regional), surpassing Qwen3-4B-Instruct and models with 3-14x more activated parameters.

Model Release

Base Models:

Model	Total Params	Active Params	Active Ratio	Languages	HuggingFace
Marco-Nano-Base	8B	0.6B	7.5%	29	🤗 AIDC-AI/Marco-Nano-Base
Marco-Mini-Base	17.3B	0.86B	5%	29	🤗 AIDC-AI/Marco-Mini-Base
Marco-Mini-Global-Base	17.3B	0.86B	5%	64	🤗 AIDC-AI/Marco-Mini-Global-Base

All models are upcycled from Qwen3-0.6B-Base.

Instruct Models:

Model	Total Params	Active Params	Languages	HuggingFace
Marco-Nano-Instruct	8B	0.6B	29	🤗 AIDC-AI/Marco-Nano-Instruct
Marco-Mini-Instruct	17.3B	0.86B	29	🤗 AIDC-AI/Marco-Mini-Instruct

Performance

Left: Marco-MoE establishes a new Pareto frontier for multilingual performance vs. training compute. Right: Marco-MoE excels in both English and multilingual capabilities simultaneously.

Marco-Instruct models achieve strong performance that surpasses models with significantly more activated parameters.

Base Models (Marco-Mini-Base vs. Qwen3-4B-Base):

Category	Benchmarks	Marco-Mini-Base	Qwen3-4B-Base	Delta
English	15 tasks (MMLU, BBH, GSM8K, ...)	63.7	63.3	+0.4
Multilingual General	11 tasks (GlobalMMLU, MGSM, FLORES, ...)	50.9	48.3	+2.6
Cultural & Regional	11 tasks (INCLUDE, TurkishMMLU, ...)	65.0	65.6	-0.6

Marco-Mini-Base uses 5.5x fewer FLOPs than Qwen3-4B-Base (1.56 vs 8.64 x 10²³) and activates only 0.86B of 17.3B total parameters.

Instruct Models (Marco-Mini-Instruct vs. Qwen3-4B-Instruct):

Category	Benchmarks	Marco-Mini-Instruct	Qwen3-4B-Instruct	Delta
English	7 tasks (MMLU, MATH, GSM8K, ...)	75.5	73.3	+2.2
Multilingual General	10 tasks (GlobalMMLU, MGSM, ...)	50.8	47.9	+2.9
Cultural & Regional	11 tasks (INCLUDE, TurkishMMLU, ...)	71.0	69.1	+1.9

Marco-Mini-Instruct surpasses models with 3-14x more activated parameters, including LFM2-24B-A2B-Instruct and Gemma3-12B-Instruct.

Marco-MoE demonstrates the largest gains in West Asia and South Asia, and in low-resource languages where capacity bottlenecks are most acute.

Scaling to 64 Languages: Marco-Mini-Global extends to 64 languages (adding 35 new languages) while preserving English proficiency (63.6 avg) and increasing the multilingual advantage over Qwen3-4B from 2.6% to 3.6%.

Marco-Bench-MIF: Multilingual Instruction-Following Benchmark

📄 Full Details | 📝 Paper | 🤗 Dataset | ACL 2025

Marco-Bench-MIF is the first deeply localized multilingual benchmark for evaluating instruction-following capabilities across 30 languages spanning 6 language families. Unlike benchmarks relying on machine translation, it implements fine-grained cultural adaptations — revealing that machine-translated evaluations underestimate model performance by 7-22%.

Key Features:

30 languages across 6 families, from high-resource (English, Chinese, German) to low-resource (Yoruba, Nepali)
Deep cultural localization: lexical replacement, theme transformation, and pragmatic reconstruction
541 instruction-response pairs covering diverse constraint types
Evaluated 20+ LLMs: 70B+ models outperform 8B by 45-60%; 25-35% gap between high/low-resource languages

# Access the dataset
https://huggingface.co/datasets/AIDC-AI/Marco-Bench-MIF

CulturAll: Benchmarking Multilingual and Multicultural Competence of LLMs

📄 Full Details

CulturAll is a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. It contains 2,610 samples in 14 languages across 51 regions and 16 topics. The best LLM achieves only 44.48% accuracy, underscoring substantial room for improvement.

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

📄 Full Details | 📝 Paper | 🤗 Data (Coming Soon) | ACL 2026

DetectRL-X is the most large-scale and challenging multilingual benchmark for LLM-generated text (LGT) detection, containing 3.46 million samples spanning 8 languages across 5 language families, 6 domains, 4 commercial LLMs, 8 attack strategies, 4 text-length granularities, and 3 refinement operations. It extends the traditional Binary classification task (HWT vs. LGT) to a Ternary setting that also identifies human-written & LLM-refined text (HLT), better reflecting real-world human-LLM collaboration. The benchmark provides 8 evaluation dimensions comparing 12 representative detectors, revealing the strengths and limitations of current state-of-the-art methods in multilingual, real-world scenarios.

Key Features:

8 languages across 5 language families: English, German, Spanish, French, Portuguese, Russian, Arabic, Chinese
6 domains manually curated: Academic, News, Novel, SEO, Wiki, WebText
4 commercial LLMs: Deepseek-V3, Gemini-2.5-flash, GPT-4o, Qwen-Max
Ternary classification: Distinguishes HWT, HLT (Human-written & LLM-refined), and LGT
3 AI-assisted operations: polishing, expanding, and condensing
8 attack strategies: 4 paraphrase + 4 perturbation attacks across all languages
4 text-length granularities: 64, 128, 256, and 512 tokens
12 detectors benchmarked: 9 statistical + 3 neural-based methods
8 evaluation dimensions: In-Distribution, Cross-Domain, Cross-Generator, Cross-Language, Cross-Paraphrase, Cross-Perturbation, Cross-Length, Cross-Operation

👨🏻‍💻 Acknowledgement

Special thanks to all contributors, annotators, and translators. This project is supported by Alibaba International Digital Commerce Group.

Citation

If you find our work useful, please cite the relevant papers:

Marco-MoE:

@article{marco-moe,
  title={Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling},
  author={Fan Jiang, Yu Zhao, Chenyang Lyu, Tianqi Shi, YiChao Du, Feihu Jiang, Longyue Wang and Weihua Luo},
  year={2026}
}

Marco-Bench-MIF:

@inproceedings{zeng-etal-2025-marco,
  title     = "Marco-Bench-{MIF}: On Multilingual Instruction-Following Capability of Large Language",
  author    = "Zeng, Bo and Lyu, Chenyang and Liu, Sinuo and Zeng, Mingyan and Wu, Minghao and Ni, Xuanfan and Shi, Tianqi and Zhao, Yu and Liu, Yefeng and Zhu, Chenyu and Li, Ruizhe and Geng, Jiahui and Li, Qing and Tong, Yu and Wang, Longyue and Luo, Weihua and Zhang, Kaifu",
  editor    = "Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher",
  booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
  month     = jul,
  year      = "2025",
  address   = "Vienna, Austria",
  publisher = "Association for Computational Linguistics",
  url       = "https://aclanthology.org/2025.acl-long.1172/",
  doi       = "10.18653/v1/2025.acl-long.1172",
  pages     = "24058--24072",
  ISBN      = "979-8-89176-251-0"
}

CulturAll:

@misc{lin2026culturallbenchmarkingmultilingualmulticultural,
      title={CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks}, 
      author={Peiqin Lin and Chenyang Lyu and Wenjiang Luo and Haotian Ye and Md Mehrab Hossain and Chunlan Ma and Shaoxiong Ji and Younes Samih and Bo Zeng and Fan Jiang and Yuanbin Cao and Dilda Duisenbek and Adrian Neo Sau Xun and Daria Pozdniakova and Liubou Misevich and Nevena Marinković and Ngoc Gia Linh Nguyen and Thi Khanh Linh Do and Sarakmatak Sophy and Baotian Hu and Guanhua Chen and Gongbo Tang and Alham Fikri Aji and Longyue Wang and Weihua Luo},
      year={2026},
      eprint={2604.19262},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.19262}, 
}

DetectRL-X:

@inproceedings{detectrl-x,
  title     = {DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection},
  author    = {Junchao Wu and Yefeng Liu and Chenyu Zhu and Hao Zhang and Zeyu Wu and Tianqi Shi and Yichao Du and Longyue Wang and Weihua Luo and Jinsong Su and Derek F. Wong},
  booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2605.15518},
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
CulturAll		CulturAll
DetectRL-X		DetectRL-X
Marco-Bench-MIF		Marco-Bench-MIF
Marco-MoE		Marco-MoE
image		image
.DS_Store		.DS_Store
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Marco-LLM: Towards Multilingual and Multiculture Large Language Models

🔥 News

Marco-MoE: Open Multilingual MoE LLMs with Efficient Upcycling

Key Highlights

Model Release

Performance

Marco-Bench-MIF: Multilingual Instruction-Following Benchmark

CulturAll: Benchmarking Multilingual and Multicultural Competence of LLMs

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

👨🏻‍💻 Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Marco-LLM: Towards Multilingual and Multiculture Large Language Models

🔥 News

Marco-MoE: Open Multilingual MoE LLMs with Efficient Upcycling

Key Highlights

Model Release

Performance

Marco-Bench-MIF: Multilingual Instruction-Following Benchmark

CulturAll: Benchmarking Multilingual and Multicultural Competence of LLMs

DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection

👨🏻‍💻 Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages