Unmanned Aerial Vehicles (UAVs) are increasingly used in inspection and surveillance, but their ability to answer natural language questions by actively exploring environments – a task known as Embodied Question Answering (EQA) – has been limited. Existing outdoor EQA systems typically stop once a target enters the UAV's field of view, leaving fine-grained viewpoint adjustments for evidence-seeking questions unresolved. According to a new preprint on arXiv, researchers have introduced ScoutVLA, an evidence-driven Vision-Language-Action (VLA) model designed to address this gap.
"To address this issue, we introduce FG-EQA, a fine-grained active perception EQA benchmark with more than 40K simulated trajectories and 1K real-world trajectories." – from the paper
FG-EQA Benchmark: Fine-Grained Active Perception
The team first developed FG-EQA, a benchmark specifically for fine-grained active perception. It includes over 40,000 simulated trajectories and 1,000 real-world trajectories, providing a robust testing ground for EQA systems. The benchmark challenges UAVs to not only locate targets but also refine their viewpoint to gather evidence for answering questions.
ScoutVLA Architecture: Dual-Expert Design
ScoutVLA draws inspiration from the "waggle dance" of scout bees, which iteratively adjust flight paths to verify target information. The model employs a decoupled dual-expert architecture:
- A vision-language expert that infers semantic intent to identify missing evidence.
- An independent action expert that uses high-DoF flow matching to generate continuous viewpoint-refinement trajectories.
This separation is key to balancing the competing demands of continuous control and semantic reasoning.
Training and Results
To avoid interference between the two experts, the researchers devised a decoupled training strategy with a knowledge insulation mechanism that prevents action gradients from erasing the model's multimodal reasoning ability. The results show significant improvements over state-of-the-art baselines:
| Metric | Improvement over Baselines |
|---|---|
| Average strict success rate | 10.48× higher |
| Average QA correctness | 7.72× higher |
These gains were demonstrated in extensive simulated experiments.
Real-World Validation
A qualitative real-world field study also confirmed ScoutVLA's superiority. While the paper does not provide specific field results, it states the study "verifies the superiority of ScoutVLA over the state-of-the-art baselines."
Implications for Autonomous Systems
ScoutVLA represents a step forward in UAV active perception, with potential applications in infrastructure inspection, search and rescue, and precision agriculture. For enterprise technology leaders, the model's ability to reason about missing evidence and adjust viewpoints autonomously could reduce the need for manual drone piloting and enable more sophisticated autonomous missions. The decoupled architecture also offers a blueprint for integrating vision-language reasoning with continuous control in other robotic domains.