Reinforcement learning (RL) policies often degrade when deployed in unfamiliar environments because they lack explicit deliberation. To address this, researchers have introduced PACT (Plan, Align, Commit, Think), a hybrid architecture that combines a fast, reactive RL policy with a slow, deliberative Small Language Model (SLM) planner.
Architecture: Dual-System Decision-Making
According to the paper, PACT invokes the SLM asynchronously to generate and validate candidate action plans. The SLM operates in a deliberative mode, producing plans that are then verified through simulation as safe, feasible, and complete. Once a plan passes verification, it is executed directly, bypassing the RL policy entirely. This design does not require retraining or modifying the existing RL policy, allowing for seamless integration.
The SLM backbone used in the experiments is a 2-billion-parameter model, which provides the deliberative reasoning necessary for complex planning tasks.
Evaluation on FrozenLake
The researchers evaluated PACT on three configurations of the FrozenLake environment, each of increasing difficulty. FrozenLake is a classic grid-world problem where an agent must navigate from start to goal while avoiding holes. The results showed that PACT outperformed all baselines across the tested configurations.
"Deliberative planning and reactive execution are more powerful in concert than either is alone in these settings."
The study highlights that the combination of fast reactive responses and slow, deliberative planning enables the system to handle unfamiliar situations where pure RL policies would typically fail.
Implications for Autonomous Systems
While the research was conducted in a simulated environment, the PACT architecture has potential applications for autonomous systems that require both immediate reaction and long-term planning. For example, in robotics or automated control, a system could use a reactive policy for routine operations while invoking the SLM planner when encountering novel or uncertain conditions. The asynchronous invocation means the deliberative process does not slow down real-time responses, as the SLM runs in parallel.
Key Components of PACT
- Plan: The SLM generates candidate action plans based on the current state.
- Align: Plans are aligned with the environment's constraints and goals.
- Commit: A plan is committed only after verification through simulation.
- Think: The system continuously refines its planning through deliberation.
The architecture is designed to be modular, allowing the RL policy and SLM to operate independently while sharing a common interface for plan execution.
Conclusion
The PACT approach demonstrates that hybrid architectures combining fast reactive policies with slow deliberative models can achieve superior performance in complex decision-making tasks. By leveraging a small language model for planning, the system benefits from the reasoning capabilities of language models without the computational overhead of larger models. This research opens up avenues for integrating language model deliberation into reinforcement learning systems for real-world applications where reliability and adaptability are critical.