HumanJudge

Pluralistic human evaluation infrastructure for AI in production.

What we're building

We're building trust infrastructure for next-generation AI — autonomous agents, production systems, decision-support tools — where ongoing human evaluation isn't a phase but a continuous layer. Diverse human judgment, captured live, in context.

AI evaluation is usually a single number. We capture it as a continuous datastream instead — pluralistic, multi-reviewer, multi-context, from real production traffic.

Presenting at the Berkeley RDI Agentic AI Summit, Aug 1–2 2026.

The R&D community

An open community working in public around six research streams. Anyone curious is welcome — researchers, engineers, designers, contributors of any background.

Stream	Topic
A — Pluralistic data for model training	How to represent multi-reviewer, open-vocabulary feedback as training data; aggregation rules that respect safety constraints
B — Resource curation	A public, actively maintained index of AI safety + production-evaluation tools, frameworks, and papers
C — Model routing	Quality-routing systems trained on pluralistic, multi-reviewer, domain-tagged production feedback
D — Real-time guardrails mechanism	Live production signals driving immediate guardrail updates, apologies, human handoff — continuous red-teaming folded into the pipeline
E — Signal representation & visualization	Richer reviewer-submitted signal + live, multi-dimensional, third-party-attested representations of how AI is actually behaving
F — Platform integration	Live pluralistic evaluation surfaced inside the tools where developers and workflows already touch AI output

Detailed descriptions in the main repo →

Apply by opening a PR

No CV, no application form. Open a small PR on humanjudge/grandjury with a challenge result and which streams interest you. Reviewed personally within ~3 business days. We'll notify you by email when your PR is merged, with instructions for joining the Discord.

See the apply walkthrough →

Active repos

Repo	What it is
grandjury	Python SDK + the R&D community front door
grandjury-js	JavaScript / TypeScript SDK
stt-inference-evaluation-pipeline	Speech-to-text benchmark pipeline for archival audio

Stream-specific spin-outs (grandjury-train, grandjury-router, etc.) appear here as contributors ship code.

What contributors come away with

Co-authored papers on arXiv
Open datasets on Hugging Face (CC-BY-SA 4.0)
Open-source code in this org (Apache 2.0)
Conference posters
Direct working relationship with a small, focused team

Licensing (proposed defaults)

Code: Apache 2.0
Datasets: CC-BY-SA 4.0 on Hugging Face
Papers: CC-BY 4.0

Contact

humanjudge.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HumanJudge

HumanJudge

What we're building

The R&D community

Apply by opening a PR

Active repos

What contributors come away with

Licensing (proposed defaults)

Contact

Popular repositories Loading

Repositories

Uh oh!

Uh oh!

Uh oh!

Uh oh!

People

Top languages

Uh oh!

Most used topics

Uh oh!