A team of researchers has introduced FusionRS, the first large-scale RGB-infrared-text dataset designed for dual-modal vision-language learning in remote sensing, according to a paper published on arXiv. The dataset addresses a gap in existing remote sensing AI, which predominantly relies on RGB imagery, by incorporating infrared data that provides thermal intensity structures, object boundaries, and illumination-invariant scene features.
Background: The Need for Infrared in Remote Sensing
Most existing remote sensing vision-language models remain centered on RGB imagery, leaving the complementary information in infrared data underexplored, the authors report. Infrared images offer distinctive cues — including thermal intensity structures, object boundaries, and illumination-invariant scene features — that can enrich visual-language learning beyond conventional RGB observations. However, a large-scale RGB-infrared-text dataset for remote sensing vision-language modeling was previously absent.
FusionRS Dataset Construction
FusionRS is constructed by translating diverse public RGB remote sensing images into infrared-style counterparts, forming aligned RGB-IR image pairs. Each pair is associated with two types of textual descriptions: conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties while preserving semantic content.
| Component | Description |
|---|---|
| RGB images | Diverse public remote sensing images |
| Infrared-style counterparts | Translated from RGB via a method to form aligned pairs |
| Scene captions | Conventional captions describing the scene |
| IR-aware captions | Captions describing infrared-specific visual properties while preserving semantic content |
Model Training and Key Results
Based on FusionRS, the researchers trained dual-modal vision-language foundation models for RGB-IR joint understanding. They first trained CLIP-style models for RGB-IR-text alignment, then fine-tuned generative vision-language models (VLMs) for dual-modal RGB-IR captioning. Experiments show that FusionRS improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings.
Ablation Studies: Importance of IR-Aware Captions
Ablation studies further verify that IR-aware captions are crucial for strengthening infrared-language alignment. The findings highlight the importance of modality-specific textual supervision for more scalable RGB-infrared remote sensing vision-language representation learning, according to the paper.
Implications for Enterprise Technology
While the dataset is primarily a research contribution, its potential to enhance Earth observation understanding may benefit enterprise applications requiring robust monitoring of infrastructure, agriculture, or logistics networks. Remote sensing AI that can fuse RGB and infrared data could improve detection of anomalies or changes in physical assets, though the researchers do not explicitly address supply chain use cases. The work underscores the value of multi-modal data in advancing foundation models for geospatial analysis.