# ProgramBench

> **TL;DR.** ProgramBench is a software-engineering benchmark that tests whether AI agents can reconstruct a complete, working codebase from only a compiled binary and its documentation — no source code, no decompilation, no internet access allowed during the task.

- **Category:** AI / Developer Tools / Benchmarks
- **Stage:** validating
- **Age:** 42 days
- **Origin date:** 2026-05-05
- **First detected:** 2026-05-07
- **Canonical URL:** https://earlyterms.com/term/programbench
- **Sources:** 7 primary URLs

## Definition

ProgramBench is a software-engineering benchmark that tests whether AI agents can reconstruct a complete, working codebase from only a compiled binary and its documentation — no source code, no decompilation, no internet access allowed during the task.

Released May 5, 2026 by researchers at [Meta FAIR](https://github.com/facebookresearch/ProgramBench), Stanford, and Harvard, the benchmark covers 200 tasks spanning compact CLI tools to major projects like FFmpeg, SQLite, and the PHP interpreter, verified by 248,000+ agent-generated behavioral tests. No model fully solves a single task; Claude Opus 4.7 leads at 3% almost-resolved.

## Example

A ProgramBench agent receives the compiled `jq` binary and its man page. Without seeing a single line of source, it must choose a programming language, design an architecture, and produce a build-ready codebase whose output matches `jq` across thousands of edge-case inputs — the same task a human engineer would need days to complete.

## Analogy

Think of it as a blindfolded architectural drawing contest: you see only the finished building, never the blueprints.

## Why it's emerging now

Published May 5, 2026 by Meta FAIR, Stanford, and Harvard — the SWE-bench team — ProgramBench resets the difficulty bar for coding AI. Nine frontier models score 0% fully resolved, sparking debate about the gap between LLM code generation and real software engineering.

## Related terms

- *related:* SWE-bench
- *competitor:* MirrorCode
- *parent:* agentic-coding
- *parent:* coding-agents
- *related:* managed-agents
- *related:* HumanEval
- *related:* BIG-Bench
- *related:* mini-SWE-agent
- *related:* agent-harness
- *related:* deep-research

## Sources

1. [ProgramBench paper — arXiv:2605.03546 (May 5, 2026)](https://arxiv.org/abs/2605.03546)
2. [facebookresearch/ProgramBench — official GitHub repo](https://github.com/facebookresearch/ProgramBench)
3. [ProgramBench.com — live leaderboard](https://programbench.com/)
4. [HN: ProgramBench — 139 points, 72 comments](https://news.ycombinator.com/item?id=48045174)
5. [ProgramBench-Tests dataset — HuggingFace](https://huggingface.co/datasets/programbench/ProgramBench-Tests)
6. [Emergent Mind — ProgramBench: Evaluating LM Software Reconstruction](https://www.emergentmind.com/papers/2605.03546)
7. [ProgramBench paper full text — arXiv HTML](https://arxiv.org/html/2605.03546v1)

---
_Generated by EarlyTerms · https://earlyterms.com/term/programbench_