FineVLA Framework Improves Robot Instruction Following by 62.7% in Real-World Dual-Arm Manipulation

Researchers introduce FineVLA, an open framework for fine-grained instruction alignment in vision-language-action (VLA) robot policies. The framework includes a dataset of 47,159 human-verified trajectories, a benchmark with 500 videos and 11,631 atomic facts, and a steerable policy that improves real-world dual-arm manipulation success from 49.9% (raw-only) to 62.7%.

iGEN Editorial

June 16, 2026

FineVLA Framework Improves Robot Instruction Following by 62.7% in Real-World Dual-Arm Manipulation

Enterprise robotics deployments often struggle when robots must follow detailed execution instructions beyond simple goal-level commands. A new open framework called FineVLA, detailed in a paper on arXiv, addresses this gap by aligning robot actions with fine-grained human instructions about how tasks should be performed.

The framework, developed by researchers including Xintong Huang, Xuhong Zhang, Jinyu Yao, Yutong Sun, and Yuchong Wang, among others, targets a fundamental limitation in existing robot datasets: they typically pair trajectories with coarse goal-level language, leaving out execution-critical details such as active arm, approach direction, and contact region. This missing nuance limits steerable policy learning and robotic video understanding, according to the paper.

FineVLA Components and Dataset

FineVLA includes four main components: (1) a data construction tool that unifies 972,247 trajectories across 85,000 tasks from 10 open-source robot datasets; (2) a human-verified dataset called FineVLA-Data containing 47,159 fine-grained trajectories; (3) a held-out benchmark with 500 videos, 11,631 atomic facts, and 1,030 VQA questions; and (4) a robotics-specialized VLM annotator for scalable fine-grained annotation. The framework also includes a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions.

Experimental Results: Fine-Grained Supervision Boosts Success

The paper reports three key findings from experiments. First, fine-grained supervision does not sacrifice goal-level success: FineGrained-only outperforms Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at a FineGrained:Raw ratio of 1:2 to 1:1.

Setting	FineGrained-Only	Raw-Only	Best Mixed (FG:Raw)	Improvement
RoboTwin simulation	—	baseline	86.8%/82.5%	+?
Real-world dual-arm manipulation	—	49.9%	62.7%	+12.8 points

In real-world dual-arm manipulation, the best mixed setting reached 62.7% success rate versus 49.9% for Raw-only, according to the paper. In RoboTwin simulation, the best mixed setting achieved 86.8% and 82.5% success rates.

Steerable Control Improvements

Third, fine-grained supervision improves steerable control. The largest real-world gains are observed on pose (+23 percentage points), color (+18), and approach direction (+18)—factors where goal-level instructions provide no guidance.

Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve.

Implications for Enterprise Robotics

For enterprise technology decision-makers evaluating robotic automation, FineVLA demonstrates that incorporating detailed execution instructions can yield significant performance gains without complicating training. The open-source nature of the framework—including its dataset, benchmark, and policy infrastructure—allows organizations to test and adapt the approach for their own robotic systems, from warehouse picking to assembly line manipulation.

The researchers note that the framework uses controlled mixtures of fine-grained and raw instructions, with optimal results at a 1:2 to 1:1 ratio. This suggests that enterprises can augment existing goal-level command interfaces with more specific guidance to improve robot flexibility and task completion.

Future work could extend FineVLA to more complex industrial scenarios, though the paper does not detail specific integration paths or commercial availability. The framework is available at the project page linked in the paper.

Sources:

FineVLA Framework Improves Robot Instruction Following by 62.7% in Real-World Dual-Arm Manipulation

FineVLA Components and Dataset

Experimental Results: Fine-Grained Supervision Boosts Success

Steerable Control Improvements

Implications for Enterprise Robotics

Recommended Stories

New Training-Free Method Enables Robots to Follow Personalized Commands Like 'Bring My Cup'

New Benchmark and Method Address Occlusion in Vision-Language-Action Models for Robotics

For the First Time, Zoox Can Charge People for Rides in Its Steering-Wheel-Free Robotaxis

Google DeepMind's Gemini AI Now Controls Humanoid Robots for Dextrous Tasks