Topic
multimodal
New Attack Forces Costly Model Usage in Multimodal LLM Cascades
A research paper introduces the Forced Deferral Attack (FDA), which manipulates confidence thresholds in multimodal large language model cascades, causing queries to be routed to more expensive strong models. The attack raises security concerns for enterprises deploying cost-optimized AI systems.
Scribby Multi-Level LLM Framework Promises Fine-Grained Semantic Analysis of Long-Form Video
Researchers propose Scribby, an LLM-based framework for semantic video analysis that balances macro-level comprehension with micro-level semantic indexing. The approach analyzes full transcripts, individual sentences, and groups sentences by semantic similarity using an LLM as a judge, enabling more detailed understanding of video structure and thematic progression.
MAGE-RAG: Multigranular Adaptive Graph Evidence Framework Improves Long-Document Multimodal QA Accuracy
The MAGE-RAG research paper introduces a multigranular adaptive graph evidence framework for multimodal retrieval-augmented generation (RAG) in long-document question answering. By building an evidence graph with page and element nodes and using an online controller to iteratively activate and prune evidence, it balances coverage and noise. Experiments show accuracy improvements over existing methods on LongDocURL and MMLongBench-Doc benchmarks.
Training-Free Framework Uses XAI and Multimodal LLMs to Generate Grounded Explanations for Speech Deepfake Detection
Researchers propose a training-free explanation framework that integrates XAI evidence with multimodal large language models to generate grounded and specific explanations for speech deepfake detection. Using the PartialSpoof dataset, the method increases inside accuracy by over 45%, verified through human evaluation and faithfulness checks.
Unifying Acoustic Features and Text with Multimodal LLMs for Neurodegenerative Disease Staging
Researchers propose NeurMLLM, a multimodal generative framework that integrates acoustic features and text using a large language model for neurodegenerative disease staging. Evaluated on the Bridge2AI-Voice dataset, it outperforms classical machine learning and existing LLM-based methods for Alzheimer's and Parkinson's disease staging.
X-Tokenizer: Semantic Action Tokenizer Boosts Robot Control by 13.5% Over FAST
Researchers propose X-Tokenizer, a new action tokenizer that treats tokenization as semantic interface learning rather than mere compression. Using a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture, it improves multimodal grounding by 13.5% and long-horizon task performance by 8.25 points over existing methods like FAST.
Visual-Seeker: Visual-Native AI Agent for Active Visual Reasoning in Multimodal Search
Researchers propose Visual-Seeker, a visual-native multimodal deep search agent that actively harvests fine-grained visual evidence during search. Using a synthesized dataset of 5K multimodal trajectories, it achieves state-of-the-art on five benchmarks, outperforming several proprietary models.