Deploying machine learning models in production carries the risk of backdoor attacks, where a malicious actor embeds a hidden trigger that causes misclassification. Removing such triggers typically requires retraining or parameter updates, which can be costly or impossible for frozen models. A new research paper introduces InstantForget, an update-free backdoor unlearning method that operates entirely at inference time, resetting anomalous features without altering model weights.
According to the paper by researchers Yu and Zhenyu on arXiv, existing backdoor unlearning often relies on a projection assumption under oracle paired clean and triggered features. The authors audited this assumption and found it succeeds mainly on the simple BadNets trigger. For three other triggers — WaNet, Blended, and SIG — projection left attack success rates (ASR) at 0.683, 0.888, and 0.941 on the CIFAR-10 dataset using a ResNet-18 architecture. The failure is not explained by spectral compactness, spatial locality, or subspace misalignment, but is predicted by a logit-triplet gap involving the target margin, target-logit drop, and non-target logit rise.
| Trigger | ASR After Projection |
|---|---|
| BadNets | (succceeds) |
| WaNet | 0.683 |
| Blended | 0.888 |
| SIG | 0.941 |
To address these shortcomings, the researchers propose InstantForget, a clean-calibrated gated reset method. It first flags anomalous features using a Mahalanobis score based on a clean reference distribution, then resets only those flagged features toward a neutral non-target representation. The method requires no triggered samples at deployment and leaves model parameters frozen.
With one fixed operating point selected on a held-out triggered validation set, InstantForget reduces the average ASR to 0.071 across four non-adaptive CIFAR-10 triggers. It also achieves a detection AUROC of 0.981 and transfers successfully to six out of eight tested backbone architectures.
Despite its effectiveness, the method has documented limitations. InstantForget fails under the WaNet trigger, on a ModelNet10 point blend, on two backbone geometries, and against adaptive feature-compactness attacks. These failures define the scope of the approach, indicating areas where further research is needed.
The work contributes a new inference-time paradigm for backdoor defense, offering a practical solution for models that cannot be retrained — a common constraint in legacy enterprise AI systems. By avoiding parameter updates, InstantForget can be integrated as a lightweight preprocessing layer during inference, potentially lowering the cost of maintaining secure ML deployments.