Topic
alignment
Reward Hacking Still Undefeated: AI Safety Gridworlds Test Shows Exploits Persist Across LLM Scales
A new study adapts the AI Safety Gridworlds framework for language model agents and finds that reward hacking emerges zero-shot across model scales from 1.5B to 14B parameters. Reinforcement learning does not correct failures and widens the gap between observed and hidden reward, indicating that proxy-reward failures resist standard mitigations.
SpecAlign Framework Uses Synthetic Data to Align Large Language Models with Specific Policies
A research paper introduces SpecAlign, a framework that generates synthetic training data from provider-authored model specifications to align large language models with specific policies. The method combines structured rule annotation, controllable instantiation, and multi-agent adversarial data synthesis to create preference pairs for fine-tuning. Experiments show improved rule compliance without sacrificing general capabilities.