Enterprise systems increasingly rely on artificial intelligence to detect deepfake speech, but the black-box nature of many models undermines trust. A new research paper from an international team of computer scientists tackles this challenge by proposing a training-free explanation framework that combines explainable AI (XAI) evidence with multimodal large language models (LLMs) to generate human-readable, grounded explanations for speech deepfake detection (SDD) decisions.
The Challenge of Explainable AI in Speech Deepfake Detection
According to the paper, published on arXiv, existing explanation methods for SDD fall into two categories, each with significant limitations. Traditional XAI approaches, such as gradient-based attribution, produce low-level attribution signals that are tightly coupled with model decisions. While technically faithful, these signals are harder for humans to understand than natural language explanations. On the other hand, LLM-based explanation generation often produces generic and ungrounded descriptions. This stems from a lack of heuristic evidence and task-specific supervision, as there are limited grounded explanation datasets for SDD.
Two Existing Explanation Approaches
The paper contrasts the two main paradigms:
| Explanation Method | Strengths | Weaknesses |
|---|---|---|
| Traditional XAI (e.g., gradient-based attribution) | Faithful to model decisions | Low-level, hard for humans to understand |
| LLM-based explanation generation | Produces natural language | Generic, ungrounded due to lack of task-specific data |
The combination of both approaches has been underexplored, largely because of the scarcity of grounded explanation datasets for SDD.
The Proposed Training-Free Framework
The researchers propose a novel framework that is training-free, meaning it does not require fine-tuning or additional supervised training on labeled explanation data. Instead, it integrates XAI evidence—such as attribution signals—with multimodal LLMs. These LLMs can process both textual and non-textual inputs, allowing them to incorporate XAI-generated evidence as context when generating explanations. By grounding the LLM's output in actual model behavior, the framework produces explanations that are both specific and understandable.
Dataset and Experimental Results
To evaluate their approach, the team constructed a grounded explanation dataset using the PartialSpoof dataset. This dataset is specifically designed for speech deepfake detection tasks. The results show that methods incorporating XAI evidence increase "inside accuracy" by over 45%. This metric reflects how well the generated explanations align with the model's internal decision process. The improvements were verified through two evaluation channels: human evaluation (assessing readability and relevance) and faithfulness checks (measuring how accurately the explanation reflects the model's actual reasoning).
Implications for Enterprise AI Trustworthiness
For enterprise technology leaders overseeing AI-driven voice authentication, fraud detection, or voice-based interfaces, the ability to generate trustworthy explanations is critical. The training-free nature of the proposed framework means it can be deployed without expensive retraining or large labeled datasets, lowering the barrier to adoption. By combining traditional XAI's faithfulness with LLMs' natural language capabilities, the approach addresses a key gap in explainable AI for speech deepfake detection. The authors—Li, Yupei; Sun, Qiyang; Wu, Xiaoliang; Wang, Chenxi; Sisman, Berrak; and Schuller, Björn W.—demonstrate that integrating these two paradigms yields measurable accuracy gains while maintaining human interpretability.