Time-to-event (TTE) prediction is critical in many industries—from healthcare to supply chain—where anticipating the time until an event (e.g., equipment failure, shipment delay) enables proactive decision-making. A new study by Zhang et al. introduces a foundation model-driven framework for cross-modal representation alignment, designed to generalize across tasks and institutions. The researchers evaluate two clinically distinct TTE tasks: pulmonary embolism (PE) mortality and cardiovascular disease (CVD) outcomes, using large-scale multi-institutional cohorts.
The Challenge of Multimodal Time-to-Event Modeling
Accurate TTE prediction from multimodal clinical data remains challenging due to modality imbalance and distribution shift. According to the paper, the authors encode CT imaging and longitudinal EHR data independently using domain-specific foundation models, then align them in a shared latent space through four principled fusion strategies: late fusion, contrastive alignment, cross-attention, and co-attention.
Four Fusion Strategies Tested
The research compares these strategies on two tasks. Below is a summary of the best-performing approaches per task:
| Task | Best Internal Strategy | Best External Strategy | Improvement over Unimodal Baselines |
|---|---|---|---|
| PE mortality | Contrastive multimodal fusion (CLMBR representations) | — | 1.5–5.4% concordance index |
| MACE (major adverse cardiovascular events) | Cross-attention (one-hot) | Image-guided co-attention | 1.5–5.4% concordance index |
Experimental Setup and Results
The cohorts are substantial: for PE, 3,099 training, 1,098 internal test, and 435 external test samples; for CVD, 2,951 training, 837 internal, and 682 external samples. The paper reports that fusion consistently improves the concordance index by 1.5–5.4% over unimodal baselines when modalities contribute comparably. Overall, contrastive multimodal fusion, particularly with CLMBR representations, provided the most consistent and statistically robust improvements, especially for PE mortality prediction. For MACE, cross-attention (one-hot) achieved the highest internal performance, while image-guided co-attention achieved the best external performance.
Implications for Enterprise AI
Although the study focuses on clinical data, the framework is generalizable and can be applied to any multimodal TTE prediction problem. For logistics and supply chain, combining sensor data (analogous to CT) with operational logs (analogous to EHR) could predict equipment failure or delivery delays. The paper provides the first systematic analysis of fusion behavior under modality imbalance in TTE prediction, a common challenge across industries.
A Task-Aware Design Principle
The authors conclude that task-aware multimodal alignment is a necessary design principle for robust generalization and scalable deployment. Their work establishes a foundation for deploying cross-modal AI in real-world settings where data sources are heterogeneous and imbalanced.