As AI-generated images become increasingly realistic, traditional semantic-level inconsistency checks are no longer sufficient for reliable detection. A new research paper from a team of computer scientists introduces a method called Deep Visual Residual MLLM (Deep-VRM) that enables multimodal large language models (MLLMs) to capture full-spectrum forensic signals—including low-level generator artifacts—while retaining their pre-trained semantic understanding.
The work, posted on arXiv under the title "Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models," addresses a critical limitation: fine-tuning MLLMs for artifact learning typically disrupts the semantic representations formed in the models' early-to-middle layers. The authors—Kaiqing Lin, Zhiyuan Yan, Ruoxin Chen, Ke-Yue Zhang, Piao Zhou, Caiyong, Bin, Taiping Yao, Bo Wang, Youchang Xiao, and Shouhong Ding—conducted a layer-wise analysis of forensic signal perception in MLLMs and found that semantic information is primarily formed in the early-to-middle layers, whereas direct fine-tuning for artifact learning disrupts these semantic representations.
The Challenge of Full-Spectrum Perception
MLLMs have been increasingly adopted in forensics due to their robust semantic understanding. However, as AI-generated images become more realistic, relying solely on semantic-level inconsistencies is often insufficient. The researchers pose a critical question: whether MLLMs can achieve full-spectrum forensic signal perception—capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge.
Deep Residual Injection Method
To solve this, the team proposes Deep-VRM. The architecture preserves early semantic processing while injecting artifact-specific visual signals as a residual path into an intermediate layer. These artifact signals are then fused with semantic token representations and propagated through subsequent trainable layers. This design enables later layers to jointly model semantic reasoning and signal-level forensic cues. Surprisingly, the model learns to adaptively leverage different levels of forensic signals depending on the input, achieving robust and generalizable detection performance.
"Semantic information is primarily formed in the early-to-middle layers, whereas direct fine-tuning for artifact learning disrupts these semantic representations."
Experimental Results
The paper reports extensive experiments showing that Deep-VRM achieves state-of-the-art results across most benchmarks. The code and data are available alongside the arXiv publication under a CC BY 4.0 license.
Implications for Enterprise AI
For enterprise technology leaders deploying MLLMs in document verification, fraud detection, or content moderation, the ability to detect AI-generated images without compromising semantic performance is crucial. Deep-VRM offers a method to enhance forensic capabilities while maintaining the model's general intelligence, potentially reducing error rates in automated inspection and validation processes. Although the paper focuses on image forensics, the residual injection technique could be adapted to other modalities and domains where low-level signals need to be preserved alongside high-level understanding.