New Survey Unifies LLM Policy Optimization Methods on First Principles from REINFORCE to GRPO

A new survey on arXiv revisits LLM policy optimization from first principles, modeling all methods as modifications of either the trajectory probability or reward function. It covers the path from REINFORCE to GRPO and beyond, identifying compound failures that require joint design of both sides.

iGEN Editorial

June 16, 2026

New Survey Unifies LLM Policy Optimization Methods on First Principles from REINFORCE to GRPO

Enterprise technology leaders deploying large language models (LLMs) in business applications need to understand how these systems are optimized. A new paper on arXiv provides a first-principles derivation of LLM policy optimization, offering a unified framework that clarifies the design rationale behind methods from REINFORCE to PPO to GRPO and their extensions.

The paper, titled "A First-Principles Derivation of LLM Policy Optimization: From Expected Reward to GRPO and Its Structural Extensions," is authored by Shen, Jianghan, Luo, Siqi, Yue, Liu, Jiyao, Qu, Wanying, Zhang, Huang, Ziyan, Tianbin, Ming, Xiaohong, Chen, Yirong, and He, Junjun. According to the paper, all policy gradient algorithms optimize the same objective: J(θ) = E[R(τ)], which has exactly two factors — the trajectory probability p_θ(τ) and the reward R(τ).

The survey organizes methods along two axes: the trajectory side (induced by p_θ(τ)) and the reward side (induced by R(τ)). Every method from REINFORCE to PPO to GRPO modifies one or both factors to address a specific failure in the preceding formulation. Existing surveys organize methods by domain or chronology, which the authors say obscures the rationale behind each design choice and the precise location of intervention within the gradient estimator.

Two-Factor Framework

The paper revisits the landscape from J(θ) on first principles, using the trajectory and reward sides as the two axes. It covers the path from REINFORCE and PPO to GRPO, as well as post-GRPO variants including Agentic RL and GRPO-OPD. The framework is unified, diagnostic, and extensible: it analyzes methods from a shared objective, identifies which side each method modifies and why, and applies the same trajectory and reward axes across all settings.

Method	Primary Modification	Axis Modified
REINFORCE	Basic policy gradient	Both (implicitly)
PPO	Clipped surrogate objective	Trajectory side
GRPO	Group reward normalization	Reward side
Agentic RL	Agent-centric reward shaping	Reward side
GRPO-OPD	Online preference distillation	Both

Compound Failures and Joint Design

According to the paper, the framework also exposes compound failures that no single-side fix resolves, requiring joint design of both the trajectory and reward sides. The boundary cases and coupled failures identified by this map mark where existing solutions run out and provide a principled starting point for designing the next generation of LLM policy optimization algorithms.

For enterprise buyers, this unified perspective helps demystify how LLMs are fine-tuned for specific tasks. Understanding that optimization can be decomposed into trajectory probability (how likely a sequence of actions is) and reward (how desirable the outcome is) allows technology leaders to evaluate different AI vendors' approaches and anticipate performance in business-critical applications such as supply chain decision-making or customer service automation.

The survey is available on arXiv under the Computer Science > Artificial Intelligence category and is a valuable resource for anyone building or procuring LLM-based solutions.

Sources:

New Survey Unifies LLM Policy Optimization Methods on First Principles from REINFORCE to GRPO

Two-Factor Framework

Compound Failures and Joint Design

Recommended Stories

Reinforcement-Aware Knowledge Distillation Boosts LLM Reasoning Efficiency

FlowMPC: New Framework Combines Flow Matching and World Models to Improve Robot Manipulation

New Research Shows Pretraining Data Composition Can Engineer Neural Scaling Laws for Particle Physics

Reinforcement Learning Foundation Models: Synthetic MDPs Could Bridge the Gap