ai-benchmarks

Star

Here are 28 public repositories matching this topic...

scicode-bench / SciCode

Star

A benchmark that challenges language models to code solutions for scientific problems

benchmark ai ai-benchmarks llm

Updated Jun 1, 2026
Python

CAS-CLab / CNN-Inference-Engine-Quick-View

Star

A quick view of high-performance convolution neural networks (CNNs) inference engines on mobile devices.

cnn inference-engine cnns inference-engines cnn-inference-engine ai-benchmarks speed-benchmarks

Updated Jun 13, 2022

mahmoudrabie / agentic-ai

Star

Agentic AI research papers, benchmarks, frameworks, and tools curated across 24 domains.

Updated Jun 6, 2026

joylarkin / AI-Coding-Landscape

Star

AI coding models, agents, CLIs, IDEs, AI app builders, open source tooling, benchmarks

ai-agents ai-benchmarks ai-apps ai-ide ai-assisted-coding ai-coding-tools ai-coding ai-app-builder vibe-coding vibecoding ai-coding-assistant coding-llm ai-coding-agents ai-coding-landscape ai-coding-2025 ai-coding-models coding-models ai-leaderboards ai-coding-2026

Updated May 25, 2026

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

nlp machine-learning gemini llama language-model model-evaluation ai-safety mistral claude disinformation ai-security ai-benchmarks ai-evaluation llm llm-benchmarking gpt4o

Updated Mar 20, 2025

SS47816 / AGI-Elo

Star

[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?

benchmark leaderboard agi imagenet coco artificial-general-intelligence datasets evaluation-metrics elo-rating rating-system evaluation-framework sota ai-benchmarks waymo-open-dataset mmlu vision-language-action ai-evaluation-framework livecodebench navsim

Updated Oct 28, 2025
Python

brandonhimpfen / awesome-ai-benchmarks-evaluation

Sponsor

Star

A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.

awesome ai awesome-list awesome-lists ai-benchmarks ai-evaluation ai-benchmark

Updated May 11, 2026
Python

wd041216-bit / ai-benchmark-kb

Star

AI Benchmark 知识库 — 全面收录各大 AI 公司用来测试模型性能的 Benchmark 题库完整集合

benchmark knowledge-base model-evaluation reasoning multimodal ai-benchmarks instruction-following llm long-context safety-evaluation ai-performance math-reasoning coding-benchmark benchmark-collection eval-frameworks

Updated Apr 16, 2026

pyros-projects / agent-comparison

Star

Qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks

orchestration ai-agents ai-benchmarks qualitative-evaluation llm-agents coding-agents agentic-workflows agent-evaluation agent-testing ai-coding-assistants agent-comparison development-tasks

Updated Nov 25, 2025
Python

CaptainASIC / autonomous-agency-scale

Sponsor

Star

A measurement framework for autonomous AI agency across 7 dimensions

agi artificial-intelligence autonomous-agents ai-safety evaluation-framework ai-benchmarks agent-evaluation ai-measurement

Updated May 30, 2026
Python

humanjudge / grandjury

Star

Pluralistic human evaluation infrastructure for AI in production

python-sdk model-evaluation ai-benchmarks ai-evaluation human-feedback llm-evaluation mcp-server llm-as-judge-alternative

Updated May 30, 2026
Jupyter Notebook

LARIkoz / ai-model-benchmarks

Star

119 AI models × 55 benchmarks with per-score freshness dates, auto-updated pricing, task routing. Every score has a date and source URL. Daily CI.

embeddings gemini model-selection benchmarks awesome-list gpt ai-agents claude model-comparison ai-pipeline ai-benchmarks ai-models llm openrouter llm-pricing llm-leaderboard llm-routing model-routing

Updated Jun 7, 2026
HTML

waifuai / biochem-framework

Star

Benchmarks AI conversations by estimated biochemical impact. Maps conversational outputs to neurochemical response profiles (oxytocin, dopamine, serotonin, endorphins, cortisol) via LLM analysis for physiological evaluation metrics.

python biochemistry ai-benchmarks bonding-curve llm

Updated Apr 21, 2026
Python

tatn / awesome-ai-benchmarks

Star

A curated collection of AI model benchmarks and leaderboards — covering general rankings, coding, agents, reasoning, embeddings, and more

benchmark machine-learning awesome ai leaderboard embeddings speech-recognition awesome-list ai-benchmarks llm coding-agents llm-leaderboard

Updated May 27, 2026

waifuai / ai-benchmarks

Star

LLM spatial reasoning evaluation suite with gradient-based scoring. CLI benchmark runner with leaderboard support, OpenRouter multi-model integration, and standardized input/output formats.

python reasoning ai-benchmarks llm

Updated Apr 21, 2026
Python

overfit-dicta / legal-ai-benchmarks

Star

Independent benchmarks of AI capability on legal analysis tasks.

legal-ai ai-benchmarks

Updated Jun 4, 2026

linny006 / llm-eval-tracker

Star

Live index of LLM evaluation tools and benchmarks, refreshed every 15 minutes from GitHub

machine-learning awesome-list model-evaluation evaluation-framework live-data ai-research github-actions ai-benchmarks ai-tools auto-updated ai-development ai-evaluation ml-evaluation llm-tools llm-evaluation llm-testing llm-benchmark awesome-eval

Updated Jun 8, 2026
Python

linny006 / agent-eval-harness

Star

Live, open-source benchmark for comparing AI coding agents on real GitHub issues

Updated Jun 8, 2026
Python

Paraskevi-KIvroglou / Hackathon-LlamaEval

Star

LlamaEval is a rapid prototype developed during a hackathon to provide a user-friendly dashboard for evaluating and comparing Llama models using the TogetherAI API.

evaluation-metrics streamlit ai-benchmarks llms togetherai llms-benchmarking llama3

Updated Nov 10, 2024
Python

egecanakincioglu / ai-models-arena-tracker

Star

Automated AI benchmark & LLM arena tracker. Fetches data from top platforms, normalizes scores using Llama 3 (8B), updates raw JSON via GitHub Actions 8x a day, and serves a live Vercel dashboard.

python automation dashboard leaderboard web-scraping language-models raw-data data-pipeline artifical-intelligense github-actions open-source-data ai-benchmarks vercel llm-evaluation ai-insights llama3 llm-arena

Updated May 22, 2026
Python

Improve this page

Add a description, image, and links to the ai-benchmarks topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-benchmarks topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-benchmarks

Here are 28 public repositories matching this topic...

scicode-bench / SciCode

CAS-CLab / CNN-Inference-Engine-Quick-View

mahmoudrabie / agentic-ai

joylarkin / AI-Coding-Landscape

lechmazur / deception

SS47816 / AGI-Elo

brandonhimpfen / awesome-ai-benchmarks-evaluation

wd041216-bit / ai-benchmark-kb

pyros-projects / agent-comparison

CaptainASIC / autonomous-agency-scale

humanjudge / grandjury

LARIkoz / ai-model-benchmarks

waifuai / biochem-framework

tatn / awesome-ai-benchmarks

waifuai / ai-benchmarks

overfit-dicta / legal-ai-benchmarks

linny006 / llm-eval-tracker

linny006 / agent-eval-harness

Paraskevi-KIvroglou / Hackathon-LlamaEval

egecanakincioglu / ai-models-arena-tracker

Improve this page

Add this topic to your repo