# GRPO

> **TL;DR.** GRPO (Group Relative Policy Optimization) is a reinforcement-learning algorithm that teaches language models to reason by sampling multiple answers per question and scoring each answer against the group's own average, dropping the separate value network that PPO needs.

- **Category:** AI / Research / Reinforcement Learning
- **Stage:** established
- **Age:** 862 days
- **Origin date:** 2024-02-05
- **First detected:** 2026-04-20
- **Canonical URL:** https://earlyterms.com/term/grpo
- **Sources:** 8 primary URLs

## Definition

GRPO (Group Relative Policy Optimization) is a reinforcement-learning algorithm that teaches language models to reason by sampling multiple answers per question and scoring each answer against the group's own average, dropping the separate value network that PPO needs.

It was introduced by DeepSeek in the [DeepSeekMath paper](https://arxiv.org/abs/2402.03300) on February 5, 2024, then made famous a year later when [DeepSeek-R1](https://arxiv.org/abs/2501.12948) used it to match OpenAI's o1 on math and code. Hugging Face TRL shipped a [GRPOTrainer](https://huggingface.co/docs/trl/main/en/grpo_trainer), and Qwen, Kimi, Skywork-R1V and OpenPipe ART now train on it.

## Example

OpenPipe trained a 14B model with GRPO on the 'Temporal Clue' puzzle benchmark and reported beating o1, o3-mini and R1 on that task — a [199-point HN thread](https://news.ycombinator.com/item?id=43272089) in March 2025 that made GRPO a household name for indie RL researchers, not just DeepSeek engineers.

## Analogy

PPO hires a tutor to grade each answer; GRPO has the student take five tries and uses their own average as the passing line.

## Why it's emerging now

A February 2024 footnote became the default RL recipe for open reasoning models after DeepSeek-R1 matched o1 in January 2025. The December 2025 arXiv 'PPO vs GRPO vs DAPO' paper and a flood of Qwen / Kimi / Skywork variants in early 2026 cemented GRPO as the thing you reach for when you want chain-of-thought quality without a critic network.

## Related terms

- *parent:* PPO
- *child:* DAPO
- *child:* Dr. GRPO
- *related:* RLHF
- *competitor:* DPO
- *related:* DeepSeek-R1
- *related:* DeepSeekMath
- *related:* reward model
- *related:* chain-of-thought
- *related:* Qwen3.6
- *related:* tokenmaxxing

## Sources

1. [DeepSeekMath paper (GRPO introduction)](https://arxiv.org/abs/2402.03300)
2. [DeepSeek-R1 paper](https://arxiv.org/abs/2501.12948)
3. [Hugging Face TRL — GRPOTrainer docs](https://huggingface.co/docs/trl/main/en/grpo_trainer)
4. [Cameron Wolfe — Group Relative Policy Optimization (GRPO) deep dive](https://cameronrwolfe.substack.com/p/grpo)
5. [OpenPipe — Using GRPO to beat o1, o3-mini, R1 at Temporal Clue](https://openpipe.ai/blog/using-grpo-to-beat-o1-o3-mini-and-r1-on-temporal-clue)
6. [arXiv 2512.07611 — Comparative analysis of PPO, GRPO, DAPO](https://arxiv.org/abs/2512.07611)
7. [Sebastian Raschka — State of RL for LLM Reasoning](https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training)
8. [Hacker News — GRPO-Zero from-scratch implementation](https://news.ycombinator.com/item?id=43272089)

---
_Generated by EarlyTerms · https://earlyterms.com/term/grpo_