Skip to content
@humanjudge

HumanJudge

HumanJudge

Pluralistic human evaluation infrastructure for AI in production.

Berkeley RDI humanjudge.com


What we're building

We're building trust infrastructure for next-generation AI — autonomous agents, production systems, decision-support tools — where ongoing human evaluation isn't a phase but a continuous layer. Diverse human judgment, captured live, in context.

AI evaluation is usually a single number. We capture it as a continuous datastream instead — pluralistic, multi-reviewer, multi-context, from real production traffic.

Presenting at the Berkeley RDI Agentic AI Summit, Aug 1–2 2026.

The R&D community

An open community working in public around six research streams. Anyone curious is welcome — researchers, engineers, designers, contributors of any background.

Stream Topic
A — Pluralistic data for model training How to represent multi-reviewer, open-vocabulary feedback as training data; aggregation rules that respect safety constraints
B — Resource curation A public, actively maintained index of AI safety + production-evaluation tools, frameworks, and papers
C — Model routing Quality-routing systems trained on pluralistic, multi-reviewer, domain-tagged production feedback
D — Real-time guardrails mechanism Live production signals driving immediate guardrail updates, apologies, human handoff — continuous red-teaming folded into the pipeline
E — Signal representation & visualization Richer reviewer-submitted signal + live, multi-dimensional, third-party-attested representations of how AI is actually behaving
F — Platform integration Live pluralistic evaluation surfaced inside the tools where developers and workflows already touch AI output

Detailed descriptions in the main repo →

Apply by opening a PR

No CV, no application form. Open a small PR on humanjudge/grandjury with a challenge result and which streams interest you. Reviewed personally within ~3 business days. We'll notify you by email when your PR is merged, with instructions for joining the Discord.

See the apply walkthrough →

Active repos

Repo What it is
grandjury Python SDK + the R&D community front door
grandjury-js JavaScript / TypeScript SDK
stt-inference-evaluation-pipeline Speech-to-text benchmark pipeline for archival audio

Stream-specific spin-outs (grandjury-train, grandjury-router, etc.) appear here as contributors ship code.

What contributors come away with

  • Co-authored papers on arXiv
  • Open datasets on Hugging Face (CC-BY-SA 4.0)
  • Open-source code in this org (Apache 2.0)
  • Conference posters
  • Direct working relationship with a small, focused team

Licensing (proposed defaults)

  • Code: Apache 2.0
  • Datasets: CC-BY-SA 4.0 on Hugging Face
  • Papers: CC-BY 4.0

Contact

humanjudge.com

Popular repositories Loading

  1. grandjury grandjury Public

    Pluralistic human evaluation infrastructure for AI in production

    Jupyter Notebook 2 1

  2. stt-inference-evaluation-pipeline stt-inference-evaluation-pipeline Public

    Auto inference + evaluation pipeline for benchmarking commercial and open-source speech-to-text models on archival audio

    Python 1

  3. grandjury-js grandjury-js Public

    JavaScript/TypeScript SDK for the GrandJury human evaluation platform

    TypeScript

  4. .github .github Public

    HumanJudge organization profile

Repositories

Showing 4 of 4 repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…