# Value Accuracy

> **TL;DR.** Value Accuracy measures the fraction of JSON leaf values that exactly match ground truth — distinct from JSON pass rate, which only checks schema structure.

- **Category:** AI / Developer Tools / Evaluation
- **Stage:** validating
- **Age:** 49 days
- **Origin date:** 2026-04-28
- **First detected:** 2026-04-30
- **Canonical URL:** https://earlyterms.com/term/value-accuracy
- **Sources:** 5 primary URLs

## Definition

Value Accuracy measures the fraction of JSON leaf values that exactly match ground truth — distinct from JSON pass rate, which only checks schema structure. The gap is large: schema compliance routinely exceeds 84%, while value accuracy peaks at 83% even on frontier models.

Interfaze and JigsawStack formalized the metric in the [SOB paper](https://arxiv.org/abs/2604.25359) on April 28, 2026, benchmarking 21 models across 5,000+ text, image, and audio records. GLM-4.7 leads text at 83.0%; audio accuracy collapses to 23.7% for every model tested — exposing that structured outputs silently hallucinate at production scale.

## Example

On the SOB leaderboard, every model in the 21-model test achieves >84% JSON Pass Rate. Yet Value Accuracy (exact leaf-value match) peaks at 83.0% for text — and model size is not a predictor: open-weight Qwen3.5-35B outscores GPT-5.4 on this metric, exposing that the failure mode is extraction quality, not model scale.

## Analogy

Think of it as the spell-checker that passes "their" when you meant "there" — valid, but wrong.

## Why it's emerging now

LLMs now produce valid JSON at >84% schema compliance — so the bottleneck has shifted to whether values are correct, not whether the schema parses. The SOB paper (April 28, 2026) named this gap 'Value Accuracy' and quantified it across 21 frontier models, surfacing that structured outputs silently hallucinate at rates most teams don't measure.

## Related terms

- *parent:* structured output
- *related:* JSON schema compliance
- *related:* hallucination rate
- *related:* SOB benchmark
- *related:* schema compliance
- *related:* exact match
- *parent:* LLM evaluation
- *alias:* field-level accuracy
- *related:* structured hallucination
- *related:* JSON pass rate

## Sources

1. [SOB paper: The Structured Output Benchmark (arXiv, Apr 2026)](https://arxiv.org/abs/2604.25359)
2. [Interfaze: Introducing SOB — A Multi-Source Structured Output Benchmark](https://interfaze.ai/blog/introducing-structured-output-benchmark)
3. [Hacker News: Show HN — A new benchmark for testing LLMs for deterministic outputs](https://news.ycombinator.com/item?id=47950283)
4. [JigsawStack/sob — GitHub repository](https://github.com/JigsawStack/sob)
5. [Cleanlab: LLM Structured Output Benchmarks are Riddled with Mistakes](https://cleanlab.ai/blog/structured-output-benchmark/)

---
_Generated by EarlyTerms · https://earlyterms.com/term/value-accuracy_
