Federated learning (FL) promises to advance medical image segmentation by enabling collaborative model training across institutions without sharing sensitive patient data. However, real-world deployment is frequently complicated by label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice because existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation, according to a new paper on arXiv.
The Real-World Label Noise Problem
The research team—Bujotzek, Markus, Bounias, Dimitrios, Denner, Stefan, Floca, Ralf, Fischer, Maximilian, Neher, Peter, and Maier-Hein, Klaus—highlights that current FNLL evaluations do not reflect deployment realities. The typical approach of injecting synthetic noise into clean labels fails to capture the complexity of actual annotation errors, which vary across sites and imaging modalities. Key noise types encountered in practice include:
- Contour disagreement: Different annotators outline structures inconsistently.
- Missing or additional structures: Some labels omit lesions or include artifacts.
- Confused labels: Misclassification of tissue types or organs.
These imperfections can significantly degrade model performance, particularly when data is distributed across multiple clients in a federated setting.
A Benchmark Suite for Fair Comparison
To address this gap, the authors introduce a benchmark suite that combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework. The suite incorporates deployment-relevant client-noise scenarios—for example, varying noise levels across participating sites—and noise-targeted evaluation metrics. This provides a realistic and discriminative basis for FNLL evaluation, enabling systematic assessment and informed method selection.
| Aspect | Previous Work | This Benchmark Suite |
|---|---|---|
| Noise source | Synthetic noise | Real-world noisy datasets |
| Settings | Simplified, uniform | Diverse client-noise scenarios |
| Evaluation | Limited, not noise-focused | Label-noise-targeted metrics |
| Reproducibility | Varies | Reusable foundation with public code |
The benchmark establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. The code is available at the repository linked in the paper.
Implications for Healthcare AI
For healthcare organizations deploying federated learning for medical imaging, this benchmark provides a tool to evaluate how different noisy-label mitigation techniques perform under realistic conditions. By moving beyond synthetic noise, practitioners can select methods that are more likely to generalize to actual annotation workflows. The framework also supports dataset-specific characterization, helping institutions understand the nature of their label errors and choose appropriate preprocessing or training strategies.
As federated learning expands in clinical deployment, the ability to handle real-world label noise becomes critical. This benchmark represents a step toward robust, trustworthy models that can be trained across institutions without compromising on data privacy or model accuracy. The authors emphasize that the suite offers a realistic and reproducible environment to drive progress in FNLL and ultimately improve automated medical image analysis.