Multimodal egocentric activity recognition, which combines visual and inertial cues to understand first-person behavior, faces significant hurdles when deployed in open-world environments. According to a paper on arXiv, existing methods struggle to detect activities never seen before while continuously learning from non-stationary data streams. The authors propose MAND (Modality-Aware Novelty Detection), a framework that adaptively leverages complementary evidence from multiple modalities to improve reliability.
The Problem with Existing Approaches
Traditional multimodal systems rely on the main fused logits for novelty scoring, according to the paper. This approach fails to fully exploit the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities—particularly IMU (inertial measurement unit)—remain underutilized. The paper notes that this imbalance worsens as catastrophic forgetting accumulates, where neural networks overwrite previously learned knowledge when integrating new tasks.
MAND: Dual Mechanism for Adaptive Learning
MAND introduces two key components. At inference, the Modality-aware Adaptive Scoring (MoAS) mechanism adaptively adjusts modality contributions using sample-wise reliability. It refines novelty scoring with deviation and disagreement penalties, ensuring that less reliable modalities are downweighted. During training, Modality-aware Representation Stabilization Training (MoRST) preserves the discriminative capacity of each modality across tasks. This is achieved through modality-specific heads and modality-wise logit distillation, preventing catastrophic forgetting.
Experimental Results
The authors tested MAND on a public multimodal egocentric benchmark. The results show that MAND consistently improves novel activity detection and known-class accuracy while substantially reducing FPR95 (false positive rate at 95% recall). This indicates more reliable open-world recognition compared to existing methods. The source code is publicly available at the link in the paper.
| Metric | Existing Methods | MAND |
|---|---|---|
| Novel activity detection | Baseline | Improved |
| Known-class accuracy | Baseline | Improved |
| FPR95 | Higher | Substantially reduced |
The research was conducted by Im, Hyejeong; Lim, Wonseon; and Kim, Dae-Won. The paper is titled "MAND: Modality-Aware Novelty Detection for Open-World Egocentric Activity Recognition."
Implications for Enterprise AI
While the research is academic, the ability to detect novel activities in first-person video with multimodal data has relevance for enterprise systems that require anomaly detection, such as monitoring worker actions in manufacturing or logistics. The MAND framework's focus on robustness and adaptability aligns with the needs of open-world deployments where unseen events must be detected reliably without manual retraining.
The publication on arXiv and the availability of source code enable further exploration and adoption by the research community.