Artificial Intelligence #ai#llm
New Survey Unifies LLM Policy Optimization Methods on First Principles from REINFORCE to GRPO
A new survey on arXiv revisits LLM policy optimization from first principles, modeling all methods as modifications of either the trajectory probability or reward function. It covers the path from REINFORCE to GRPO and beyond, identifying compound failures that require joint design of both sides.
Jun 16, 2026 1 source