We're building trust infrastructure for next-generation AI — autonomous agents, production systems, decision-support tools — where ongoing human evaluation isn't a phase but a continuous layer. Diverse human judgment, captured live, in context.
AI evaluation is usually a single number. We capture it as a continuous datastream instead — pluralistic, multi-reviewer, multi-context, from real production traffic.
Presenting at the Berkeley RDI Agentic AI Summit, Aug 1–2 2026.
An open community working in public around six research streams. Anyone curious is welcome — researchers, engineers, designers, contributors of any background.
| Stream | Topic |
|---|---|
| A — Pluralistic data for model training | How to represent multi-reviewer, open-vocabulary feedback as training data; aggregation rules that respect safety constraints |
| B — Resource curation | A public, actively maintained index of AI safety + production-evaluation tools, frameworks, and papers |
| C — Model routing | Quality-routing systems trained on pluralistic, multi-reviewer, domain-tagged production feedback |
| D — Real-time guardrails mechanism | Live production signals driving immediate guardrail updates, apologies, human handoff — continuous red-teaming folded into the pipeline |
| E — Signal representation & visualization | Richer reviewer-submitted signal + live, multi-dimensional, third-party-attested representations of how AI is actually behaving |
| F — Platform integration | Live pluralistic evaluation surfaced inside the tools where developers and workflows already touch AI output |
Detailed descriptions in the main repo →
No CV, no application form. Open a small PR on humanjudge/grandjury with a challenge result and which streams interest you. Reviewed personally within ~3 business days. We'll notify you by email when your PR is merged, with instructions for joining the Discord.
| Repo | What it is |
|---|---|
| grandjury | Python SDK + the R&D community front door |
| grandjury-js | JavaScript / TypeScript SDK |
| stt-inference-evaluation-pipeline | Speech-to-text benchmark pipeline for archival audio |
Stream-specific spin-outs (grandjury-train, grandjury-router, etc.) appear here as contributors ship code.
- Co-authored papers on arXiv
- Open datasets on Hugging Face (CC-BY-SA 4.0)
- Open-source code in this org (Apache 2.0)
- Conference posters
- Direct working relationship with a small, focused team
- Code: Apache 2.0
- Datasets: CC-BY-SA 4.0 on Hugging Face
- Papers: CC-BY 4.0