Artificial Intelligence #auditing#reward hackability
Auditing Reward Hackability in Code RL Training Environments Reveals 28.5% Weak Test Suites
A research paper by Rajan on arXiv measures reward hackability in code reinforcement learning (RL) training environments. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough that a Docker-verified incorrect patch passes them. The study also proposes a hardening procedure using an LLM judge and Docker gate to detect defects.
Jun 16, 2026 1 source