Robust perception under poor lighting, shadows, and adverse weather remains a core barrier to deploying autonomous vehicles in logistics—where 24/7 operations demand reliability. Semantic segmentation, the pixel-level classification of road elements, often fails when fixed sensor fusion strategies propagate noise from one modality across the network. A new framework called CLARITY, described in a research paper on arXiv, addresses this by using language guidance to dynamically adjust how visual and thermal data are combined, achieving new state-of-the-art accuracy on a benchmark dataset.
The Illumination Challenge in Autonomous Logistics
Autonomous trucks and delivery robots must operate at night, in tunnels, or under overcast skies. According to the paper, existing RGB-Thermal fusion methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. This uniform approach causes errors when one sensor is degraded—for example, a camera blinded by glare while the thermal sensor remains reliable. The authors report that CLARITY overcomes this by dynamically adapting its fusion strategy to the detected scene condition.
CLARITY: Language-Guided Dynamic Fusion
CLARITY (a name derived from the paper's methodology) is guided by vision-language model (VLM) priors. The network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation. Two novel mechanisms are introduced:
- A mechanism that preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard.
- A hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects.
These components allow the framework to treat different regions of the image differently, rather than applying one fusion policy to the entire scene.
State-of-the-Art Results on MFNet
Experiments were conducted on the MFNet dataset, a standard benchmark for RGB-Thermal road scene segmentation. CLARITY established a new state-of-the-art (SOTA) with the following metrics:
| Metric | Value |
|---|---|
| Mean Intersection over Union (mIoU) | 62.3% |
| Mean Accuracy (mAcc) | 77.5% |
These results represent a significant improvement over prior methods that use static fusion. The paper does not disclose exact comparisons but states the method sets a new SOTA.
Implications for Enterprise Autonomous Deployment
For logistics companies investing in autonomous fleets—whether long-haul trucks or last-mile delivery bots—the ability to accurately segment road scenes under varied illumination directly reduces the risk of perception failures. While the research is still at the academic stage, the techniques described could be integrated into commercial autonomy stacks. The use of language models (VLM priors) to guide sensor fusion is a novel approach that may influence how perception systems are designed for robust all-weather operation. The paper's authors, including Reddy, Ruturaj, Barua, Hrishav Bakul, Loo, Junn Yong, Nguyen, Thanh Thi, and Krishnasamy, Ganesh, have not announced any commercial partnerships, but the code and methodologies are expected to be shared via arXiv.