Training-Free Framework Uses XAI and Multimodal LLMs to Generate Grounded Explanations for Speech Deepfake Detection

Researchers propose a training-free explanation framework that integrates XAI evidence with multimodal large language models to generate grounded and specific explanations for speech deepfake detection. Using the PartialSpoof dataset, the method increases inside accuracy by over 45%, verified through human evaluation and faithfulness checks.

iGEN Editorial

June 16, 2026

Training-Free Framework Uses XAI and Multimodal LLMs to Generate Grounded Explanations for Speech Deepfake Detection

Enterprise systems increasingly rely on artificial intelligence to detect deepfake speech, but the black-box nature of many models undermines trust. A new research paper from an international team of computer scientists tackles this challenge by proposing a training-free explanation framework that combines explainable AI (XAI) evidence with multimodal large language models (LLMs) to generate human-readable, grounded explanations for speech deepfake detection (SDD) decisions.

The Challenge of Explainable AI in Speech Deepfake Detection

According to the paper, published on arXiv, existing explanation methods for SDD fall into two categories, each with significant limitations. Traditional XAI approaches, such as gradient-based attribution, produce low-level attribution signals that are tightly coupled with model decisions. While technically faithful, these signals are harder for humans to understand than natural language explanations. On the other hand, LLM-based explanation generation often produces generic and ungrounded descriptions. This stems from a lack of heuristic evidence and task-specific supervision, as there are limited grounded explanation datasets for SDD.

Two Existing Explanation Approaches

The paper contrasts the two main paradigms:

Explanation Method	Strengths	Weaknesses
Traditional XAI (e.g., gradient-based attribution)	Faithful to model decisions	Low-level, hard for humans to understand
LLM-based explanation generation	Produces natural language	Generic, ungrounded due to lack of task-specific data

The combination of both approaches has been underexplored, largely because of the scarcity of grounded explanation datasets for SDD.

The Proposed Training-Free Framework

The researchers propose a novel framework that is training-free, meaning it does not require fine-tuning or additional supervised training on labeled explanation data. Instead, it integrates XAI evidence—such as attribution signals—with multimodal LLMs. These LLMs can process both textual and non-textual inputs, allowing them to incorporate XAI-generated evidence as context when generating explanations. By grounding the LLM's output in actual model behavior, the framework produces explanations that are both specific and understandable.

Dataset and Experimental Results

To evaluate their approach, the team constructed a grounded explanation dataset using the PartialSpoof dataset. This dataset is specifically designed for speech deepfake detection tasks. The results show that methods incorporating XAI evidence increase "inside accuracy" by over 45%. This metric reflects how well the generated explanations align with the model's internal decision process. The improvements were verified through two evaluation channels: human evaluation (assessing readability and relevance) and faithfulness checks (measuring how accurately the explanation reflects the model's actual reasoning).

Implications for Enterprise AI Trustworthiness

For enterprise technology leaders overseeing AI-driven voice authentication, fraud detection, or voice-based interfaces, the ability to generate trustworthy explanations is critical. The training-free nature of the proposed framework means it can be deployed without expensive retraining or large labeled datasets, lowering the barrier to adoption. By combining traditional XAI's faithfulness with LLMs' natural language capabilities, the approach addresses a key gap in explainable AI for speech deepfake detection. The authors—Li, Yupei; Sun, Qiyang; Wu, Xiaoliang; Wang, Chenxi; Sisman, Berrak; and Schuller, Björn W.—demonstrate that integrating these two paradigms yields measurable accuracy gains while maintaining human interpretability.

Sources:

Training-Free Framework Uses XAI and Multimodal LLMs to Generate Grounded Explanations for Speech Deepfake Detection

The Challenge of Explainable AI in Speech Deepfake Detection

Two Existing Explanation Approaches

The Proposed Training-Free Framework

Dataset and Experimental Results

Implications for Enterprise AI Trustworthiness

Recommended Stories

AURA: Adaptive Uncertainty-Aware Refinement Framework for Auditing LLM-as-a-Judge Decisions

New Framework GeoVR Learns 3D Spatial Intelligence from 2D Videos for Multimodal LLMs

New Benchmark Reveals Remote Sensing AI Models Fail at Negation Comprehension

New Method Improves Confidence Calibration for Medical Multimodal LLMs by 40%