Rovers operating in unknown environments face a fundamental challenge: they must maintain spatial maps that encode not only objects but also sensor quality — such as range reliability, lighting artifacts, and data density — to guide data fusion, embedding updates, and navigation under partial observability. A new research paper from Klein, Jan-Niklas, Ghahremani, Sona, Adriano, Christian Medeiros, and Giese, Holger, published on arXiv, presents CrossMaps, a real-time confidence-aware open-vocabulary semantic mapping pipeline designed to address this coupled perception-navigation problem.
How CrossMaps Works
According to the paper, CrossMaps builds on VLMaps-style approaches by integrating multi-scale CLIP embeddings with confidence-aware fusion and a dual-memory architecture consisting of Short-Term Memory (STM) and Long-Term Memory (LTM). The STM aggregates noisy visual observations using geometric, semantic, and temporal confidence cues. Confident and coherent cells are then promoted to the LTM as persistent semantic landmarks. This dual-memory design allows the system to handle uncertainty and maintain a stable, queryable map over time.
The system takes RGB-D data as input and produces semantic heatmaps that can be queried with natural language. For example, a rover could ask 'Where is the nearest red container?' and the map would return a confidence-weighted location. The use of open-vocabulary CLIP embeddings means the system is not limited to a predefined set of object categories.
| Feature | Short-Term Memory (STM) | Long-Term Memory (LTM) |
|---|---|---|
| Role | Aggregates noisy observations | Stores persistent semantic landmarks |
| Confidence cues | Geometric, semantic, temporal | Only promoted when confident and coherent |
| Update frequency | High (every frame) | Low (only on promotion) |
| Persistence | Temporary | Long-term |
Real-Time Deployment
CrossMaps is designed for deployment with a Jetson Orin-powered UGV (unmanned ground vehicle) alongside SLAM (Simultaneous Localization and Mapping). The authors report that the pipeline runs in real time, processing sensory data on the edge without requiring cloud connectivity. This makes it suitable for autonomous navigation in dynamic or communication-limited environments.
The system's confidence-awareness helps the rover decide when to fuse new observations or rely on stored landmarks. For instance, if lighting is poor, the STM reduces the weight of visual data, preventing erroneous map updates.
Implications for Autonomous Navigation
CrossMaps runs in real time and produces semantic heatmaps that can be queried with natural language to guide rover navigation.
This natural-language query capability is a significant step beyond traditional semantic mapping, which typically requires fixed labels. By using open-vocabulary CLIP embeddings, the system can respond to arbitrary text queries, enabling more flexible human-robot interaction. The dual-memory architecture also improves robustness by filtering out transient errors.
The research was made available on arXiv under a Creative Commons Attribution 4.0 International license, allowing replication and further development. While the immediate application is rover navigation, the techniques could be adapted to other autonomous systems that need to build and query semantic maps in uncertain conditions.