Dual-Granularity Orthogonal Disentanglement: New Framework Boosts Generalizable Audio Deepfake Detection

A new paper on arXiv proposes a dual-granularity orthogonal disentanglement framework for generalizable audio deepfake detection. The method enforces sample-level cosine orthogonality and batch-level cross-covariance regularization to avoid speaker identity leakage. Experiments show equal error rates of 1.35%, 7.88%, and 21.58% on standard benchmarks.

iGEN Editorial

June 16, 2026

Dual-Granularity Orthogonal Disentanglement: New Framework Boosts Generalizable Audio Deepfake Detection

Audio deepfake detectors often fail when faced with unseen speakers because they latch onto speaker-identity features rather than genuine synthesis artifacts. This phenomenon, called implicit identity leakage, limits real-world deployment. A new paper on arXiv from researchers Liu, Zhuodong, Lv, Hugen, Xiangyu, Yuan, and Chunhong tackles the problem with a dual-granularity orthogonal disentanglement framework that enforces feature independence at two complementary levels.

The Problem: Implicit Identity Leakage

Most deep learning detectors inadvertently learn to recognize the speaker rather than the forgery. According to the paper, existing methods that attempt to address this often introduce architectural complexity or training instability. The proposed framework avoids these pitfalls by imposing orthogonality constraints on learned representations without auxiliary networks or adversarial dynamics.

Dual-Granularity Orthogonal Disentanglement

The method operates at two granularities:

Sample-level cosine orthogonality captures directional decorrelation between features.
Batch-level cross-covariance regularization eliminates linear correlations across embedding dimensions.

Together, they push the model to discard speaker-specific cues and focus on synthesis artifacts. The researchers also introduce a curriculum disentanglement schedule that progressively strengthens the orthogonality constraint during training, improving stability and convergence.

Experimental Results

The framework was evaluated on three standard benchmarks: ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild. The results, measured by equal error rate (EER), are summarized in the table below.

Dataset	Proposed Method EER	Prior Best (Gradient Reversal) EER	Improvement
ASVspoof 2019 LA	1.35%	Not directly reported	Outperforms gradient reversal by 2.60% absolute on cross-dataset transfer
ASVspoof 2021 DF	7.88%	Not directly reported	—
In-the-Wild	21.58%	Not directly reported	—

The proposed method achieves 1.35%, 7.88%, and 21.58% equal error rates (EER) on ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild datasets, respectively, surpassing gradient reversal disentanglement by 2.60% absolute on cross-dataset transfer.

The cross-dataset transfer test is particularly important for generalizability. The paper notes that their approach surpasses gradient reversal disentanglement by 2.60% absolute EER on that task.

Technical Approach Details

The architecture avoids auxiliary networks and adversarial training, which are common in prior work. Instead, it uses a straightforward loss function that combines the two orthogonality regularizers with a standard classification loss. The curriculum schedule starts with a weak orthogonality penalty and ramps up the weight over training epochs, allowing the model to first learn basic discriminative features before being forced to disentangle.

Implications for Enterprise Security

While the paper does not discuss specific enterprise applications, robust audio deepfake detection is critical for voice-based authentication systems, call center fraud prevention, and media verification. The framework's improved cross-dataset generalization means it can handle a wider variety of synthetic voices without retraining, reducing operational overhead. The 2.60% absolute improvement over prior methods on cross-dataset transfer represents a meaningful step toward production-ready detectors.

For decision-makers evaluating audio security solutions, the key metrics are the equal error rates on diverse benchmarks. The proposed method's strong performance on the challenging In-the-Wild dataset (21.58% EER) suggests it can handle real-world variability better than earlier approaches. The lack of auxiliary networks also simplifies integration into existing ML pipelines.

Sources:

Dual-Granularity Orthogonal Disentanglement: New Framework Boosts Generalizable Audio Deepfake Detection

The Problem: Implicit Identity Leakage

Dual-Granularity Orthogonal Disentanglement

Experimental Results

Technical Approach Details

Implications for Enterprise Security

Recommended Stories

FlowFake: Liquid Time-Constant Architecture Boosts Audio Deepfake Detection Cross-Dataset Generalization

UniSinger: First End-to-End Framework Unifies Song Generation and Singing Voice Conversion

New Research Reveals How Visual Tokens Evolve Inside Vision-Language Models

QC-GAN: Parameter-Efficient Speech Enhancement Model Delivers High Fidelity with 0.89M Parameters