A benchmark that challenges language models to code solutions for scientific problems
-
Updated
Jun 1, 2026 - Python
A benchmark that challenges language models to code solutions for scientific problems
A quick view of high-performance convolution neural networks (CNNs) inference engines on mobile devices.
Agentic AI research papers, benchmarks, frameworks, and tools curated across 24 domains.
AI coding models, agents, CLIs, IDEs, AI app builders, open source tooling, benchmarks
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.
[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?
A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.
AI Benchmark 知识库 — 全面收录各大 AI 公司用来测试模型性能的 Benchmark 题库完整集合
Qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks
A measurement framework for autonomous AI agency across 7 dimensions
Pluralistic human evaluation infrastructure for AI in production
119 AI models × 55 benchmarks with per-score freshness dates, auto-updated pricing, task routing. Every score has a date and source URL. Daily CI.
Benchmarks AI conversations by estimated biochemical impact. Maps conversational outputs to neurochemical response profiles (oxytocin, dopamine, serotonin, endorphins, cortisol) via LLM analysis for physiological evaluation metrics.
A curated collection of AI model benchmarks and leaderboards — covering general rankings, coding, agents, reasoning, embeddings, and more
LLM spatial reasoning evaluation suite with gradient-based scoring. CLI benchmark runner with leaderboard support, OpenRouter multi-model integration, and standardized input/output formats.
Independent benchmarks of AI capability on legal analysis tasks.
Live index of LLM evaluation tools and benchmarks, refreshed every 15 minutes from GitHub
Live, open-source benchmark for comparing AI coding agents on real GitHub issues
LlamaEval is a rapid prototype developed during a hackathon to provide a user-friendly dashboard for evaluating and comparing Llama models using the TogetherAI API.
Automated AI benchmark & LLM arena tracker. Fetches data from top platforms, normalizes scores using Llama 3 (8B), updates raw JSON via GitHub Actions 8x a day, and serves a live Vercel dashboard.
Add a description, image, and links to the ai-benchmarks topic page so that developers can more easily learn about it.
To associate your repository with the ai-benchmarks topic, visit your repo's landing page and select "manage topics."