Disruption recovery in industrial assembly lines demands rapid decisions in response to machine faults, worker absences, and emergency orders. Traditional approaches rely on rigid handcrafted recovery logic or adaptive policies that learn from data but cannot readily exploit diverse external recovery knowledge at the moment of decision. This gap often leads to prolonged abnormal recovery time (ART) and risks to on-time delivery (OTD).
Researchers from multiple institutions have introduced a phase-aware guidance injection framework that augments a trained recurrent Multi-Agent Proximal Policy Optimization (RMAPPO) scheduling policy through logit-level action bias during evaluation. The framework provides a unified decision-time interface for integrating heterogeneous recovery hints—rule-based, replay-based, and online LLM-based guidance—and strategically activates intervention only during abnormal and recovery phases of the assembly line.
The Problem of Assembly-Line Disruption
Assembly lines are complex multi-agent systems where coordinated scheduling is critical. When disruptions occur—such as a machine fault or worker absence—recovery actions must be taken quickly to minimize downtime. Existing methods either depend on pre-programmed recovery rules that lack adaptability or on reinforcement learning policies that cannot ingest external knowledge at decision time. The result is suboptimal recovery performance, increased abnormal recovery time, and reduced on-time delivery.
Phase-Aware Guidance Injection Framework
The proposed framework builds on recurrent MAPPO, a multi-agent reinforcement learning algorithm that uses recurrent neural networks to handle partially observable environments. During evaluation, the framework biases the action logits of the policy using guidance from external sources. This logit-level injection allows the policy to incorporate domain expertise or historical recovery data without needing to retrain or redesign the actor network.
Importantly, the framework is phase-aware: it activates guidance injection only during abnormal and recovery phases, leaving normal operations untouched. This targeted intervention prevents unnecessary bias during steady-state production.
Three Guidance Modes
The framework supports three distinct types of guidance:
- Rule-based guidance: Uses handcrafted recovery rules drawn from domain expertise. According to the paper, this mode yields the strongest performance gains.
- Replay-based guidance: Leverages past successful recovery episodes. Performance degrades smoothly when the availability of relevant replays is imperfect.
- Online LLM guidance: Employs a large language model to generate recovery suggestions in real time. This mode provides useful intermediate improvements, bridging the gap between rule-based and learning-only approaches.
| Guidance Mode | Performance | Key Characteristic |
|---|---|---|
| Rule-based | Strongest gains | High-quality domain rules |
| Replay-based | Degrades smoothly | Imperfect replay availability |
| Online LLM | Intermediate improvements | No prior knowledge required |
Experimental Results on AssemblyLineEnv
The researchers conducted experiments on a custom environment called AssemblyLineEnv, designed to simulate realistic assembly-line disruptions. Results demonstrated that decision-time guidance injection enables the policy to exploit heterogeneous recovery hints effectively without altering the learned actor network. High-quality rule guidance produced the strongest improvement in ART and OTD metrics, while replay-based guidance maintained robust performance even with incomplete data. Online LLM guidance, despite its lack of task-specific training, still delivered valuable intermediate gains.
The authors conclude that this approach can be readily applied to existing RMAPPO systems, offering a practical pathway for manufacturers to incorporate external knowledge into disruption recovery without costly retraining.
For enterprise decision-makers evaluating AI in supply chain and manufacturing, the phase-aware guidance injection framework represents a method to combine the adaptability of reinforcement learning with the reliability of expert knowledge. It addresses the practical challenge of integrating diverse recovery strategies—from manual rules to LLM-generated plans—into a single, coherent scheduling system. While the paper focuses on assembly lines, the underlying technique of logit-level injection could extend to other multi-agent coordination problems in logistics, warehousing, and freight operations.