Respiratory acoustic foundation models (FMs) have demonstrated strong performance in cough classification—determining if a cough is indicative of a disease. However, their ability to predict continuous health quantities from cough audio, such as age, BMI, or disease probability, remains largely unexplored. This regression capability has clinical value in settings where physical measurements are unavailable, enabling passive health monitoring. A new preprint by researchers including Sanap, Mayur, Desikan, Prasanna, and Lobaton, Edgar introduces the first multi-model, multi-target cough regression benchmark to evaluate these models.
The Cough Regression Benchmark
The benchmark assesses five foundation models—OPERA-CT, OPERA-CE, OPERA-GT, HeAR, and M2D+Resp—across six targets (age, BMI, disease probability on multiple datasets) under subject-disjoint protocols. The models are tested on three datasets: Coswara, CIDRZ, and CoughVID. Three types of regression heads are compared: linear probing, a small multi-layer perceptron (MLP-small), and a full MLP.
The study, according to the arXiv preprint, finds that MLP-small beats the mean-predictor baseline on all tasks and outperforms linear probing in 23 of 30 model × task cases. However, full MLP overfits on small clinical data but recovers on larger datasets, revealing a dataset-size × head-capacity trade-off.
Model Performance by Target
| Model | Best Age MAE (Coswara) | Best Age MAE (CIDRZ) | Key Observation |
|---|---|---|---|
| HeAR | 9.12 yr | Excluded due to pretraining overlap | Leads within-dataset age regression on Coswara |
| OPERA-GT | (favored over OPERA-CT) | Margin within seed variance | Generative pretraining advantage from breath to cough |
| M2D+Resp | Near-full performance at N=50 | Similar | Data-efficient on small samples |
According to the authors, HeAR leads within-dataset age regression on Coswara with a mean absolute error (MAE) of 9.12 years. However, HeAR's results on CIDRZ are excluded from headline claims due to possible overlap between HeAR's pretraining data and CIDRZ. OPERA-GT is favored over OPERA-CT on age in all three datasets, with the CIDRZ margin within seed variance, extending a generative-pretraining advantage from breath analysis to cough.
Data Efficiency Across Models
Data efficiency varies significantly. HeAR and M2D+Resp reach near-full performance with only N=50 samples, while OPERA models require N=400 samples to achieve comparable results. This makes HeAR and M2D+Resp particularly attractive for deployment in low-data scenarios, such as emerging outbreak monitoring in under-resourced regions.
Asymmetric Cross-Dataset Transfer
Cross-dataset transfer performance is strongly asymmetric. The study reports that large diverse data generalises to small clinical populations (e.g., CoughVID to CIDRZ yields a negative bias of -0.17 years), but transfer in the opposite direction fails (CIDRZ to Coswara leads to a positive bias of +2.43 years, a 26.6% increase). This highlights the importance of using large, diverse training datasets when building regression models for cough audio.
For enterprise technology decision-makers, these findings have practical implications. When deploying respiratory acoustic AI in clinical or remote monitoring systems, choosing the right foundation model and regression head depends on dataset size and target population. HeAR and M2D+Resp offer data efficiency for small-labelled datasets, while OPERA models may benefit from larger datasets. The asymmetric transfer results underscore the need to match training data to the deployment population.