# LLM-as-a-Judge

> **TL;DR.** LLM-as-a-Judge is an evaluation pattern where a large language model scores or ranks outputs from another AI system, replacing expensive human reviewers with an automated judge that applies a natural-language rubric.

- **Category:** AI / Developer Tools / Evaluation
- **Stage:** established
- **Age:** 1103 days
- **Origin date:** 2023-06-09
- **First detected:** 2026-04-23
- **Canonical URL:** https://earlyterms.com/term/judge
- **Sources:** 8 primary URLs

## Definition

LLM-as-a-Judge is an evaluation pattern where a large language model scores or ranks outputs from another AI system, replacing expensive human reviewers with an automated judge that applies a natural-language rubric. The approach scales quality assessment from hundreds of human annotations per day to millions.

The pattern was formalized in the June 2023 paper ["Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"](https://arxiv.org/abs/2306.05685) (Zheng et al., NeurIPS 2023), which showed GPT-4 achieving over 80% agreement with human evaluators — matching human-to-human consistency. By 2026 the pattern had migrated from offline benchmarking into production pipelines at Netflix, Brex, DoorDash, and AWS Bedrock.

## Example

Brex's open-source [CrabTrap](https://github.com/brexhq/CrabTrap) (April 2026) deploys an LLM judge as an HTTP proxy: every outbound request an AI agent makes is checked against natural-language security policies before being forwarded or blocked, giving teams an auditable safety layer without per-tool SDK wrappers.

## Analogy

Think of it as a code reviewer that reads English rubrics instead of style guides.

## Why it's emerging now

In April 2026 the LLM-judge pattern crossed from benchmarking into production security: Brex open-sourced CrabTrap, an HTTP proxy that gates every agent outbound request using an LLM judge. ICLR 2026 simultaneously accepted "preference leakage" research exposing a new family-bias in judges — putting the pattern's promise and limits in front of every AI builder at once.

## Related terms

- *related:* managed-agents
- *related:* agent-harness
- *related:* agent-loop
- *related:* ai-agent-traps
- *related:* deep-research
- *parent:* agentic-ai
- *related:* MT-Bench
- *related:* Chatbot Arena
- *child:* preference leakage
- *related:* reward model
- *parent:* RLHF
- *child:* CrabTrap

## Sources

1. [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (NeurIPS 2023)](https://arxiv.org/abs/2306.05685)
2. [CrabTrap: An LLM-as-a-judge HTTP proxy to secure agents in production — Brex Engineering](https://www.brex.com/journal/building-crabtrap-open-source)
3. [CrabTrap GitHub repo (brexhq/CrabTrap)](https://github.com/brexhq/CrabTrap)
4. [Preference Leakage: A Contamination Problem in LLM-as-a-judge (ICLR 2026)](https://arxiv.org/abs/2502.01534)
5. [LLM-as-a-judge on Amazon Bedrock Model Evaluation — AWS Blog](https://aws.amazon.com/blogs/machine-learning/llm-as-a-judge-on-amazon-bedrock-model-evaluation/)
6. [LLM-as-a-Judge: a complete guide — Evidently AI](https://www.evidentlyai.com/llm-guide/llm-as-a-judge)
7. [Creating a LLM-as-a-Judge That Drives Business Results — Hamel Husain](https://hamel.dev/blog/posts/llm-judge/)
8. [A Survey on LLM-as-a-Judge (arXiv 2411.15594)](https://arxiv.org/abs/2411.15594)

---
_Generated by EarlyTerms · https://earlyterms.com/term/judge_