Deep Residual Injection Method Enables Full-Spectrum Forensic AI Detection in Multimodal Models

Researchers propose Deep Visual Residual MLLM (Deep-VRM), a method that injects low-level artifact signals into multimodal large language models without disrupting pre-trained semantic knowledge. The approach achieves state-of-the-art detection of AI-generated images across multiple benchmarks.

iGEN Editorial

June 16, 2026

Deep Residual Injection Method Enables Full-Spectrum Forensic AI Detection in Multimodal Models

As AI-generated images become increasingly realistic, traditional semantic-level inconsistency checks are no longer sufficient for reliable detection. A new research paper from a team of computer scientists introduces a method called Deep Visual Residual MLLM (Deep-VRM) that enables multimodal large language models (MLLMs) to capture full-spectrum forensic signals—including low-level generator artifacts—while retaining their pre-trained semantic understanding.

The work, posted on arXiv under the title "Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models," addresses a critical limitation: fine-tuning MLLMs for artifact learning typically disrupts the semantic representations formed in the models' early-to-middle layers. The authors—Kaiqing Lin, Zhiyuan Yan, Ruoxin Chen, Ke-Yue Zhang, Piao Zhou, Caiyong, Bin, Taiping Yao, Bo Wang, Youchang Xiao, and Shouhong Ding—conducted a layer-wise analysis of forensic signal perception in MLLMs and found that semantic information is primarily formed in the early-to-middle layers, whereas direct fine-tuning for artifact learning disrupts these semantic representations.

The Challenge of Full-Spectrum Perception

MLLMs have been increasingly adopted in forensics due to their robust semantic understanding. However, as AI-generated images become more realistic, relying solely on semantic-level inconsistencies is often insufficient. The researchers pose a critical question: whether MLLMs can achieve full-spectrum forensic signal perception—capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge.

Deep Residual Injection Method

To solve this, the team proposes Deep-VRM. The architecture preserves early semantic processing while injecting artifact-specific visual signals as a residual path into an intermediate layer. These artifact signals are then fused with semantic token representations and propagated through subsequent trainable layers. This design enables later layers to jointly model semantic reasoning and signal-level forensic cues. Surprisingly, the model learns to adaptively leverage different levels of forensic signals depending on the input, achieving robust and generalizable detection performance.

"Semantic information is primarily formed in the early-to-middle layers, whereas direct fine-tuning for artifact learning disrupts these semantic representations."

Experimental Results

The paper reports extensive experiments showing that Deep-VRM achieves state-of-the-art results across most benchmarks. The code and data are available alongside the arXiv publication under a CC BY 4.0 license.

Implications for Enterprise AI

For enterprise technology leaders deploying MLLMs in document verification, fraud detection, or content moderation, the ability to detect AI-generated images without compromising semantic performance is crucial. Deep-VRM offers a method to enhance forensic capabilities while maintaining the model's general intelligence, potentially reducing error rates in automated inspection and validation processes. Although the paper focuses on image forensics, the residual injection technique could be adapted to other modalities and domains where low-level signals need to be preserved alongside high-level understanding.

Sources:

Deep Residual Injection Method Enables Full-Spectrum Forensic AI Detection in Multimodal Models

The Challenge of Full-Spectrum Perception

Deep Residual Injection Method

Experimental Results

Implications for Enterprise AI

Recommended Stories

New Framework GeoVR Learns 3D Spatial Intelligence from 2D Videos for Multimodal LLMs

ROSE Benchmark Reveals Perception-to-Action Gap in Multimodal AI Models

LLM Paraphrase Augmentation Boosts Sign Language Translation Performance

Large Language Models Can Read Compressed Text That Humans Cannot, Researchers Find