Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

Federated learning enables collaborative medical image segmentation without centralizing sensitive data, but real-world label noise hampers deployment. A new benchmark suite combines diverse real-world noisy datasets, client-noise scenarios, and targeted evaluation to support systematic assessment of federated noisy label learning methods, addressing the gap left by synthetic noise studies.

iGEN Editorial

June 16, 2026

Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

Federated learning (FL) promises to advance medical image segmentation by enabling collaborative model training across institutions without sharing sensitive patient data. However, real-world deployment is frequently complicated by label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice because existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation, according to a new paper on arXiv.

The Real-World Label Noise Problem

The research team—Bujotzek, Markus, Bounias, Dimitrios, Denner, Stefan, Floca, Ralf, Fischer, Maximilian, Neher, Peter, and Maier-Hein, Klaus—highlights that current FNLL evaluations do not reflect deployment realities. The typical approach of injecting synthetic noise into clean labels fails to capture the complexity of actual annotation errors, which vary across sites and imaging modalities. Key noise types encountered in practice include:

Contour disagreement: Different annotators outline structures inconsistently.
Missing or additional structures: Some labels omit lesions or include artifacts.
Confused labels: Misclassification of tissue types or organs.

These imperfections can significantly degrade model performance, particularly when data is distributed across multiple clients in a federated setting.

A Benchmark Suite for Fair Comparison

To address this gap, the authors introduce a benchmark suite that combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework. The suite incorporates deployment-relevant client-noise scenarios—for example, varying noise levels across participating sites—and noise-targeted evaluation metrics. This provides a realistic and discriminative basis for FNLL evaluation, enabling systematic assessment and informed method selection.

Aspect	Previous Work	This Benchmark Suite
Noise source	Synthetic noise	Real-world noisy datasets
Settings	Simplified, uniform	Diverse client-noise scenarios
Evaluation	Limited, not noise-focused	Label-noise-targeted metrics
Reproducibility	Varies	Reusable foundation with public code

The benchmark establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. The code is available at the repository linked in the paper.

Implications for Healthcare AI

For healthcare organizations deploying federated learning for medical imaging, this benchmark provides a tool to evaluate how different noisy-label mitigation techniques perform under realistic conditions. By moving beyond synthetic noise, practitioners can select methods that are more likely to generalize to actual annotation workflows. The framework also supports dataset-specific characterization, helping institutions understand the nature of their label errors and choose appropriate preprocessing or training strategies.

As federated learning expands in clinical deployment, the ability to handle real-world label noise becomes critical. This benchmark represents a step toward robust, trustworthy models that can be trained across institutions without compromising on data privacy or model accuracy. The authors emphasize that the suite offers a realistic and reproducible environment to drive progress in FNLL and ultimately improve automated medical image analysis.

Sources:

Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

The Real-World Label Noise Problem

A Benchmark Suite for Fair Comparison

Implications for Healthcare AI

Recommended Stories

Controlled Benchmark Finds No Quantum Advantage in Brain MRI Data Augmentation

DF3DV-1K: Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

CSWinUNETR: Deep Learning Model Segments Thin Anatomical Structures with Cross-Shaped Self-Attention

K-Prism Model Unifies Medical Image Segmentation with Knowledge-Guided Prompt Integration