iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
PISA Memory System Draws on Cognitive Psychology to Boost AI Agent Adaptability New Multi-Scale Two-Stream Framework Aims to Decouple Semantics from Distortions in AI-Generated Image Quality Assessment P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models Controlled Dynamics Attractor Transformer: New Model Targets Graph Anomaly Detection with Biologically Plausible Attention Tamil Nadu OE Spinning Mills Threaten 50% Production Cut Over High Cotton Waste Prices BridgePolicy: New Diffusion Bridge Method Improves Visuomotor Policy Learning in Robotics New Theory Explains How Deep Transformers Achieve Adaptive Inference Using Function Vectors PVminerLLM2 Uses Preference Optimization to Improve Structured Patient Voice Extraction Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents PISA Memory System Draws on Cognitive Psychology to Boost AI Agent Adaptability New Multi-Scale Two-Stream Framework Aims to Decouple Semantics from Distortions in AI-Generated Image Quality Assessment P3B3 Benchmark Reveals Strong Brazilian Portuguese Bias in Large Language Models Controlled Dynamics Attractor Transformer: New Model Targets Graph Anomaly Detection with Biologically Plausible Attention Tamil Nadu OE Spinning Mills Threaten 50% Production Cut Over High Cotton Waste Prices BridgePolicy: New Diffusion Bridge Method Improves Visuomotor Policy Learning in Robotics New Theory Explains How Deep Transformers Achieve Adaptive Inference Using Function Vectors PVminerLLM2 Uses Preference Optimization to Improve Structured Patient Voice Extraction Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course AutoDojo: Adaptive Attacks Expose Superficial Defenses and Structural Limits in LLM Agents
Home ›› Technology ›› Ai ›› Computer Vision ›› Medical Image Segmentation Survey: U-Net, Transformers, SAM and Clinical Translation Challenges

Medical Image Segmentation Survey: U-Net, Transformers, SAM and Clinical Translation Challenges

A new arXiv survey systematically reviews medical image segmentation methods based on U-Net, Transformer, and SAM architectures. It covers public datasets, evaluation metrics, and key challenges, aiming to guide future research and clinical adoption. The authors have made all related resources publicly available on GitHub.

iG
iGEN Editorial
June 16, 2026
Medical Image Segmentation Survey: U-Net, Transformers, SAM and Clinical Translation Challenges

Medical image segmentation is a cornerstone of modern clinical diagnostics, enabling precise delineation of anatomical structures and pathologies. According to a comprehensive survey published on arXiv by Zhu, Pengyu, Zhang, Xiaojing, Kunbo, Chunyan, and Wang, Zhenyu, the field has undergone systematic development driven by deep learning. The survey organizes methods built on three major architectures—U-Net, Transformer, and SAM—within a unified analytical framework, with a particular focus on their effectiveness in improving segmentation accuracy and efficiency.

U-Net-Based Methods

U-Net remains a foundational backbone for medical image segmentation due to its symmetric encoder-decoder structure and skip connections. The survey notes that U-Net-based methods are widely adopted for tasks ranging from organ segmentation to lesion detection. These methods benefit from a large body of public datasets and have demonstrated strong performance on benchmarks. However, the survey also highlights limitations in capturing long-range dependencies, which led to the exploration of Transformer-based alternatives.

Transformer-Based Methods

Transformers, originally developed for natural language processing, have been adapted for vision tasks, including medical image segmentation. The survey reviews representative Transformer-based approaches that leverage self-attention mechanisms to model global context. These methods often outperform U-Net variants on complex segmentation tasks, particularly where fine-grained boundaries or multi-scale features are critical. The authors note that Transformers, while powerful, require larger datasets and more computational resources, posing challenges for clinical deployment.

SAM-Based Methods

The Segment Anything Model (SAM) represents a recent paradigm shift toward foundation models for segmentation. The survey examines how SAM-based methods are being fine-tuned or adapted for medical imaging. Early results indicate strong zero-shot and few-shot capabilities, but the survey underscores that domain-specific adaptation remains necessary for clinical-grade accuracy. SAM models are still under active investigation for tasks like tumor segmentation and anatomical structure retrieval.

Challenges and Benchmarks

The survey identifies several persistent challenges: limited annotated data, domain shift across imaging modalities, class imbalance, and the need for real-time inference in clinical settings. It also enumerates widely used public datasets and evaluation metrics, pointing out differences between metrics like Dice coefficient, Hausdorff distance, and intersection over union (IoU). A key conclusion is that no single architecture universally dominates; the choice depends on the specific clinical application and available data.

Implications for Enterprise Technology Leaders

While the survey is focused on healthcare AI, the architectural innovations and benchmarking methodologies carry lessons for enterprise technology leaders in sectors such as logistics and supply chain. The ability to segment complex images—whether medical scans or satellite freight imagery—relies on similar deep learning paradigms. The open-source resources shared on the survey's GitHub repository enable rapid prototyping, which can accelerate clinical translation and cross-industry adoption. The survey aims to guide future research and support clinical translation, with all related resources publicly available in the authors' GitHub repository.


Sources:

Keep Reading

Recommended Stories

UniBrain: A Unified Multimodal Model for Brain MRI Imputation and Understanding Technology

UniBrain: A Unified Multimodal Model for Brain MRI Imputation and Understanding

Researchers propose UniBrain, a unified multimodal large language model for brain MRI analysis that handles missing data through joint imputation and understanding. The model uses interleaved data flow, self-alignment, and dynamic hidden state mechanisms to achieve high performance on multi-disease MRI datasets.

June 16, 2026
Mutual Distillation of Dual Foundation Models Achieves State-of-the-Art PET/CT Segmentation with Only 5 Labeled Cases Technology

Mutual Distillation of Dual Foundation Models Achieves State-of-the-Art PET/CT Segmentation with Only 5 Labeled Cases

Researchers propose MuDuo, a mutual distillation framework that leverages two foundation models (SAM-Med3D for CT, SegAnyPET for PET) to distill knowledge into a lightweight student network for semi-supervised PET/CT segmentation. Achieving state-of-the-art performance on the AutoPET dataset with only 5 labeled cases, the approach eliminates manual prompts and maximizes unlabeled data utility.

June 16, 2026
EyeMVP AI Model Enhances Retinal Screening by Learning OCT Insights from Fundus Photos Technology

EyeMVP AI Model Enhances Retinal Screening by Learning OCT Insights from Fundus Photos

Researchers developed EyeMVP, a cross-modal retinal foundation model that enriches color fundus photography (CFP) with depth-resolved information from optical coherence tomography (OCT). Pretrained on 674,893 paired images from 112,642 patients across eight Chinese hospitals, EyeMVP outperforms leading models on 16 downstream tasks including macular edema detection (AUROC 0.948 vs 0.852) and myopic macular schisis (0.825).

June 16, 2026
New Sub-Semantic Image Segmentation Method DETECTURE Introduced by Researchers, Outperforms Baselines Technology

New Sub-Semantic Image Segmentation Method DETECTURE Introduced by Researchers, Outperforms Baselines

Researchers propose a new category of image segmentation called sub-semantic, which uses language to partition images into stable appearance patterns rather than whole objects. They introduce DETECTURE, a method that couples a vision-language model with SAM 3 to overcome three failure modes, and create a new dataset called TextureADE derived from ADE20K. DETECTURE achieves the strongest performance on several datasets compared to baselines.

June 16, 2026