ScoutVLA: New Dual-Expert AI Model Boosts UAV Active Perception for Embodied Question Answering

Researchers introduce ScoutVLA, a vision-language-action model for UAV active perception, achieving 10.48x higher strict success rate and 7.72x higher QA correctness over baselines. The model features a decoupled dual-expert architecture inspired by scout bee waggle dance.

iGEN Editorial

June 16, 2026

ScoutVLA: New Dual-Expert AI Model Boosts UAV Active Perception for Embodied Question Answering

Unmanned Aerial Vehicles (UAVs) are increasingly used in inspection and surveillance, but their ability to answer natural language questions by actively exploring environments – a task known as Embodied Question Answering (EQA) – has been limited. Existing outdoor EQA systems typically stop once a target enters the UAV's field of view, leaving fine-grained viewpoint adjustments for evidence-seeking questions unresolved. According to a new preprint on arXiv, researchers have introduced ScoutVLA, an evidence-driven Vision-Language-Action (VLA) model designed to address this gap.

"To address this issue, we introduce FG-EQA, a fine-grained active perception EQA benchmark with more than 40K simulated trajectories and 1K real-world trajectories." – from the paper

FG-EQA Benchmark: Fine-Grained Active Perception

The team first developed FG-EQA, a benchmark specifically for fine-grained active perception. It includes over 40,000 simulated trajectories and 1,000 real-world trajectories, providing a robust testing ground for EQA systems. The benchmark challenges UAVs to not only locate targets but also refine their viewpoint to gather evidence for answering questions.

ScoutVLA Architecture: Dual-Expert Design

ScoutVLA draws inspiration from the "waggle dance" of scout bees, which iteratively adjust flight paths to verify target information. The model employs a decoupled dual-expert architecture:

A vision-language expert that infers semantic intent to identify missing evidence.
An independent action expert that uses high-DoF flow matching to generate continuous viewpoint-refinement trajectories.

This separation is key to balancing the competing demands of continuous control and semantic reasoning.

Training and Results

To avoid interference between the two experts, the researchers devised a decoupled training strategy with a knowledge insulation mechanism that prevents action gradients from erasing the model's multimodal reasoning ability. The results show significant improvements over state-of-the-art baselines:

Metric	Improvement over Baselines
Average strict success rate	10.48× higher
Average QA correctness	7.72× higher

These gains were demonstrated in extensive simulated experiments.

Real-World Validation

A qualitative real-world field study also confirmed ScoutVLA's superiority. While the paper does not provide specific field results, it states the study "verifies the superiority of ScoutVLA over the state-of-the-art baselines."

Implications for Autonomous Systems

ScoutVLA represents a step forward in UAV active perception, with potential applications in infrastructure inspection, search and rescue, and precision agriculture. For enterprise technology leaders, the model's ability to reason about missing evidence and adjust viewpoints autonomously could reduce the need for manual drone piloting and enable more sophisticated autonomous missions. The decoupled architecture also offers a blueprint for integrating vision-language reasoning with continuous control in other robotic domains.

Sources:

ScoutVLA: New Dual-Expert AI Model Boosts UAV Active Perception for Embodied Question Answering

FG-EQA Benchmark: Fine-Grained Active Perception

ScoutVLA Architecture: Dual-Expert Design

Training and Results

Real-World Validation

Implications for Autonomous Systems

Recommended Stories

New Research Reveals How Visual Tokens Evolve Inside Vision-Language Models

STAR Allocation Method Improves Text-to-Image AI Training with Spatiotemporal Rewards

RoboSSM Introduces State-Space Models for Scalable In-Context Imitation Learning in Robotics

Lagrange: New Open-Vocabulary Sparse Framework Promises Robust Autonomous Driving in Open Worlds