Audio deepfake detectors often fail when faced with unseen speakers because they latch onto speaker-identity features rather than genuine synthesis artifacts. This phenomenon, called implicit identity leakage, limits real-world deployment. A new paper on arXiv from researchers Liu, Zhuodong, Lv, Hugen, Xiangyu, Yuan, and Chunhong tackles the problem with a dual-granularity orthogonal disentanglement framework that enforces feature independence at two complementary levels.
The Problem: Implicit Identity Leakage
Most deep learning detectors inadvertently learn to recognize the speaker rather than the forgery. According to the paper, existing methods that attempt to address this often introduce architectural complexity or training instability. The proposed framework avoids these pitfalls by imposing orthogonality constraints on learned representations without auxiliary networks or adversarial dynamics.
Dual-Granularity Orthogonal Disentanglement
The method operates at two granularities:
- Sample-level cosine orthogonality captures directional decorrelation between features.
- Batch-level cross-covariance regularization eliminates linear correlations across embedding dimensions.
Together, they push the model to discard speaker-specific cues and focus on synthesis artifacts. The researchers also introduce a curriculum disentanglement schedule that progressively strengthens the orthogonality constraint during training, improving stability and convergence.
Experimental Results
The framework was evaluated on three standard benchmarks: ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild. The results, measured by equal error rate (EER), are summarized in the table below.
| Dataset | Proposed Method EER | Prior Best (Gradient Reversal) EER | Improvement |
|---|---|---|---|
| ASVspoof 2019 LA | 1.35% | Not directly reported | Outperforms gradient reversal by 2.60% absolute on cross-dataset transfer |
| ASVspoof 2021 DF | 7.88% | Not directly reported | — |
| In-the-Wild | 21.58% | Not directly reported | — |
The proposed method achieves 1.35%, 7.88%, and 21.58% equal error rates (EER) on ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild datasets, respectively, surpassing gradient reversal disentanglement by 2.60% absolute on cross-dataset transfer.
The cross-dataset transfer test is particularly important for generalizability. The paper notes that their approach surpasses gradient reversal disentanglement by 2.60% absolute EER on that task.
Technical Approach Details
The architecture avoids auxiliary networks and adversarial training, which are common in prior work. Instead, it uses a straightforward loss function that combines the two orthogonality regularizers with a standard classification loss. The curriculum schedule starts with a weak orthogonality penalty and ramps up the weight over training epochs, allowing the model to first learn basic discriminative features before being forced to disentangle.
Implications for Enterprise Security
While the paper does not discuss specific enterprise applications, robust audio deepfake detection is critical for voice-based authentication systems, call center fraud prevention, and media verification. The framework's improved cross-dataset generalization means it can handle a wider variety of synthetic voices without retraining, reducing operational overhead. The 2.60% absolute improvement over prior methods on cross-dataset transfer represents a meaningful step toward production-ready detectors.
For decision-makers evaluating audio security solutions, the key metrics are the equal error rates on diverse benchmarks. The proposed method's strong performance on the challenging In-the-Wild dataset (21.58% EER) suggests it can handle real-world variability better than earlier approaches. The lack of auxiliary networks also simplifies integration into existing ML pipelines.