Color fundus photography (CFP) is the mainstay for large-scale retinal screening, but its diagnostic capacity is constrained by the lack of depth-resolved structural information. Optical coherence tomography (OCT) provides cross-sectional retinal anatomy, yet is less accessible in population-level screening. To bridge this gap, researchers have developed EyeMVP, a cross-modal retinal foundation model that uses paired CFP–OCT pretraining to learn OCT-informed CFP representations, according to a study published on arXiv.
Model Architecture and Pretraining
EyeMVP is pretrained on 674,893 strict same-eye same-day paired CFP–OCT image triples from 112,642 patients across eight hospitals in China. The model employs cross-modal masked reconstruction to enrich CFP representations with OCT-associated supervision, while requiring only CFP images at inference. To accommodate the non-aligned imaging geometry between en-face CFP and cross-sectional OCT, EyeMVP combines source-constrained cross-attention with CFP-derived structural masks.
Performance on Downstream Tasks
Across 16 downstream tasks, including classification, segmentation, few-shot adaptation, and cross-modal retrieval, EyeMVP outperforms representative retinal foundation models. The model shows consistent gains on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, EyeMVP achieves an AUROC of 0.948 for macular edema (vs. 0.852 for EyeCLIP) and 0.825 for myopic macular schisis.
| Task | EyeMVP AUROC | Comparison Model AUROC |
|---|---|---|
| Macular edema | 0.948 | 0.852 (EyeCLIP) |
| Myopic macular schisis | 0.825 | Not reported |
Comparison with Existing Models
In addition to outperforming EyeCLIP on macular edema, EyeMVP exceeded the performance of other representative retinal foundation models across the 16-task benchmark, according to the study. The architecture's ability to incorporate OCT supervision at the pixel level during pretraining is credited for the improvement.
Reader Study Results
In an exploratory reader study, EyeMVP exceeded junior and intermediate ophthalmologist groups but did not reach senior ophthalmologist performance on macular edema. On myopic macular schisis, EyeMVP showed numerically higher balanced accuracy than all reader groups. These results suggest that pixel-level cross-modal reconstruction can enrich CFP representations with OCT-associated supervision, providing a practical route toward stronger CFP-based retinal analysis in screening settings.
The study demonstrates that AI models can learn depth-resolved information from OCT without requiring OCT at inference time, potentially enabling more accurate large-scale screening programs. For enterprise technology decision-makers evaluating medical imaging AI, the pretraining methodology and performance gains highlight the value of cross-modal learning in resource-constrained environments.