Recent spatial self-supervised audio models have achieved high performance on localization tasks, but new research suggests that their encoding of microsecond interaural phase fine structures may be less genuine than previously assumed. A team led by Chen, Yuxuan, Haoyuan, He, and Peize, in a paper titled "Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models" published on arXiv, proposed a psychoacoustic benchmark based on the binaural masking level difference (BMLD) to evaluate this capability.
The Problem of Phase Encoding in Spatial Audio
Spatial audio foundation models are designed to understand the direction and location of sounds, a task that in biological hearing relies heavily on interaural time differences (ITDs) — delays as short as microseconds between ears. The researchers hypothesized that modern self-supervised learning (SSL) models might not actually compute phase differences but instead exploit other cues. To test this, they constructed a benchmark using BMLD, a well-known psychoacoustic phenomenon where the detectability of a tone in noise improves when the signal is presented with opposite phase to the two ears. BMLD provides a direct measure of sensitivity to interaural phase fine structure.
Psychoacoustic Benchmark Based on BMLD
The team used an equalization-cancellation (EC) baseline and a GCC-PHAT positive control (generalized cross-correlation with phase transform) to evaluate nine frozen audio models. These models spanned binaural SSL, monaural SSL, and neural audio codecs. The experimental setup allowed the researchers to systematically assess whether models can detect the BMLD effect.
Findings: General-Purpose vs. Dedicated Models
| Model Category | Number of Models | BMLD Performance | Key Observation |
|---|---|---|---|
| Monaural negative controls | 4 | Zero | Confirms binaural specificity |
| General-purpose binaural SSL | 2 | Minimal phase sensitivity | Rely on spectro-temporal interference |
| Dedicated binaural spatial SSL | 2 | Comparable to analytical baseline | Achieve true phase encoding |
According to the paper, four monaural negative controls yielded zero BMLD, confirming that binaural input is necessary for phase sensitivity. Two general-purpose binaural SSL models exhibited minimal phase sensitivity, while two dedicated binaural spatial SSL models achieved BMLD comparable to the analytical baseline. The researchers performed progressive physical ablations, which revealed that general-purpose binaural SSL models rely on spectro-temporal interference textures rather than cross-channel phase computation. This means they detect patterns in time-frequency energy distributions that correlate with phase differences, but do not actually compute interaural phase.
Implications for Audio Model Development
The findings highlight a critical distinction: high detection rates in speech tasks may reflect a confounding reliance on broadband envelopes rather than genuine phase encoding. For enterprise technology leaders evaluating audio AI solutions for applications such as teleconferencing, surveillance, or human-computer interaction, this suggests that performance on localization benchmarks does not guarantee robust phase encoding. The paper demonstrates that dedicated spatial SSL models are necessary for tasks requiring true phase sensitivity. As audio foundation models become more pervasive in enterprise settings, understanding their underlying mechanisms—whether they are truly encoding spatial cues or exploiting statistical textures—will be essential for deployment in high-stakes applications.