Enterprise technology leaders deploying large language models (LLMs) in business applications need to understand how these systems are optimized. A new paper on arXiv provides a first-principles derivation of LLM policy optimization, offering a unified framework that clarifies the design rationale behind methods from REINFORCE to PPO to GRPO and their extensions.
The paper, titled "A First-Principles Derivation of LLM Policy Optimization: From Expected Reward to GRPO and Its Structural Extensions," is authored by Shen, Jianghan, Luo, Siqi, Yue, Liu, Jiyao, Qu, Wanying, Zhang, Huang, Ziyan, Tianbin, Ming, Xiaohong, Chen, Yirong, and He, Junjun. According to the paper, all policy gradient algorithms optimize the same objective: J(θ) = E[R(τ)], which has exactly two factors — the trajectory probability p_θ(τ) and the reward R(τ).
The survey organizes methods along two axes: the trajectory side (induced by p_θ(τ)) and the reward side (induced by R(τ)). Every method from REINFORCE to PPO to GRPO modifies one or both factors to address a specific failure in the preceding formulation. Existing surveys organize methods by domain or chronology, which the authors say obscures the rationale behind each design choice and the precise location of intervention within the gradient estimator.
Two-Factor Framework
The paper revisits the landscape from J(θ) on first principles, using the trajectory and reward sides as the two axes. It covers the path from REINFORCE and PPO to GRPO, as well as post-GRPO variants including Agentic RL and GRPO-OPD. The framework is unified, diagnostic, and extensible: it analyzes methods from a shared objective, identifies which side each method modifies and why, and applies the same trajectory and reward axes across all settings.
| Method | Primary Modification | Axis Modified |
|---|---|---|
| REINFORCE | Basic policy gradient | Both (implicitly) |
| PPO | Clipped surrogate objective | Trajectory side |
| GRPO | Group reward normalization | Reward side |
| Agentic RL | Agent-centric reward shaping | Reward side |
| GRPO-OPD | Online preference distillation | Both |
Compound Failures and Joint Design
According to the paper, the framework also exposes compound failures that no single-side fix resolves, requiring joint design of both the trajectory and reward sides. The boundary cases and coupled failures identified by this map mark where existing solutions run out and provide a principled starting point for designing the next generation of LLM policy optimization algorithms.
For enterprise buyers, this unified perspective helps demystify how LLMs are fine-tuned for specific tasks. Understanding that optimization can be decomposed into trajectory probability (how likely a sequence of actions is) and reward (how desirable the outcome is) allows technology leaders to evaluate different AI vendors' approaches and anticipate performance in business-critical applications such as supply chain decision-making or customer service automation.
The survey is available on arXiv under the Computer Science > Artificial Intelligence category and is a valuable resource for anyone building or procuring LLM-based solutions.