iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
Mutual Distillation of Dual Foundation Models Achieves State-of-the-Art PET/CT Segmentation with Only 5 Labeled Cases SPARK Method Activates Latent Security Knowledge in LLMs for Secure Code Generation Apple explains why Siri AI took so long: first version ready last year but rebuilt from ground up New LLM Framework Detects Phishing Emails with Over 90% Accuracy Dual-Granularity Orthogonal Disentanglement: New Framework Boosts Generalizable Audio Deepfake Detection Medical Image Segmentation Survey: U-Net, Transformers, SAM and Clinical Translation Challenges Bayesian Inference and Decision Audits Reveal Unreliability in Frontier AI Evaluation Archives Dali casualty exposes erosion of technical ownership in shipmanagement, warns veteran Kapoor SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions Deep Learning Enables Autonomous Logistics Vehicles to Detect and Pick Load Carriers Mutual Distillation of Dual Foundation Models Achieves State-of-the-Art PET/CT Segmentation with Only 5 Labeled Cases SPARK Method Activates Latent Security Knowledge in LLMs for Secure Code Generation Apple explains why Siri AI took so long: first version ready last year but rebuilt from ground up New LLM Framework Detects Phishing Emails with Over 90% Accuracy Dual-Granularity Orthogonal Disentanglement: New Framework Boosts Generalizable Audio Deepfake Detection Medical Image Segmentation Survey: U-Net, Transformers, SAM and Clinical Translation Challenges Bayesian Inference and Decision Audits Reveal Unreliability in Frontier AI Evaluation Archives Dali casualty exposes erosion of technical ownership in shipmanagement, warns veteran Kapoor SMEPilot Boosts LLM Inference Up to 3.94x on CPUs with Scalable Matrix Extensions Deep Learning Enables Autonomous Logistics Vehicles to Detect and Pick Load Carriers
Home ›› Technology ›› Ai ›› Robotics ›› FineVLA Framework Improves Robot Instruction Following by 62.7% in Real-World Dual-Arm Manipulation

FineVLA Framework Improves Robot Instruction Following by 62.7% in Real-World Dual-Arm Manipulation

Researchers introduce FineVLA, an open framework for fine-grained instruction alignment in vision-language-action (VLA) robot policies. The framework includes a dataset of 47,159 human-verified trajectories, a benchmark with 500 videos and 11,631 atomic facts, and a steerable policy that improves real-world dual-arm manipulation success from 49.9% (raw-only) to 62.7%.

iG
iGEN Editorial
June 16, 2026
FineVLA Framework Improves Robot Instruction Following by 62.7% in Real-World Dual-Arm Manipulation

Enterprise robotics deployments often struggle when robots must follow detailed execution instructions beyond simple goal-level commands. A new open framework called FineVLA, detailed in a paper on arXiv, addresses this gap by aligning robot actions with fine-grained human instructions about how tasks should be performed.

The framework, developed by researchers including Xintong Huang, Xuhong Zhang, Jinyu Yao, Yutong Sun, and Yuchong Wang, among others, targets a fundamental limitation in existing robot datasets: they typically pair trajectories with coarse goal-level language, leaving out execution-critical details such as active arm, approach direction, and contact region. This missing nuance limits steerable policy learning and robotic video understanding, according to the paper.

FineVLA Components and Dataset

FineVLA includes four main components: (1) a data construction tool that unifies 972,247 trajectories across 85,000 tasks from 10 open-source robot datasets; (2) a human-verified dataset called FineVLA-Data containing 47,159 fine-grained trajectories; (3) a held-out benchmark with 500 videos, 11,631 atomic facts, and 1,030 VQA questions; and (4) a robotics-specialized VLM annotator for scalable fine-grained annotation. The framework also includes a steerable VLA policy trained with controlled mixtures of fine-grained and raw goal-level instructions.

Experimental Results: Fine-Grained Supervision Boosts Success

The paper reports three key findings from experiments. First, fine-grained supervision does not sacrifice goal-level success: FineGrained-only outperforms Raw-only by +1.4 to +8.1 success-rate points across settings. Second, fine-grained and raw instructions are complementary, following a consistent inverted-U trend peaking at a FineGrained:Raw ratio of 1:2 to 1:1.

Setting FineGrained-Only Raw-Only Best Mixed (FG:Raw) Improvement
RoboTwin simulation baseline 86.8%/82.5% +?
Real-world dual-arm manipulation 49.9% 62.7% +12.8 points

In real-world dual-arm manipulation, the best mixed setting reached 62.7% success rate versus 49.9% for Raw-only, according to the paper. In RoboTwin simulation, the best mixed setting achieved 86.8% and 82.5% success rates.

Steerable Control Improvements

Third, fine-grained supervision improves steerable control. The largest real-world gains are observed on pose (+23 percentage points), color (+18), and approach direction (+18)—factors where goal-level instructions provide no guidance.

Overall, fine-grained language should augment goal-level instructions: specifying how to execute alongside what to achieve.

Implications for Enterprise Robotics

For enterprise technology decision-makers evaluating robotic automation, FineVLA demonstrates that incorporating detailed execution instructions can yield significant performance gains without complicating training. The open-source nature of the framework—including its dataset, benchmark, and policy infrastructure—allows organizations to test and adapt the approach for their own robotic systems, from warehouse picking to assembly line manipulation.

The researchers note that the framework uses controlled mixtures of fine-grained and raw instructions, with optimal results at a 1:2 to 1:1 ratio. This suggests that enterprises can augment existing goal-level command interfaces with more specific guidance to improve robot flexibility and task completion.

Future work could extend FineVLA to more complex industrial scenarios, though the paper does not detail specific integration paths or commercial availability. The framework is available at the project page linked in the paper.


Sources:

Keep Reading

Recommended Stories

Deep Learning Enables Autonomous Logistics Vehicles to Detect and Pick Load Carriers Technology

Deep Learning Enables Autonomous Logistics Vehicles to Detect and Pick Load Carriers

A research paper presents a deep learning-based framework that uses a convolutional neural network on RGBD images to identify landmarks on load carriers and compute their pose. Experiments show sufficient accuracy for reliable detection in industrial environments, supporting autonomous intralogistics operations.

June 16, 2026
Trust-Region Diffusion Policies Enable Expressive AI for Complex Control Tasks Technology

Trust-Region Diffusion Policies Enable Expressive AI for Complex Control Tasks

Researchers introduce Trust-Region Diffusion Policies (TruDi), a method that enables diffusion models to be used in massively parallel on-policy reinforcement learning. By enforcing a KL-divergence constraint over the entire diffusion trajectory, TruDi achieves stable training and outperforms strong baselines across 73 diverse tasks, showing particular gains on challenging humanoid control problems.

June 16, 2026
Kairos Stack Promises Native World Models for Physical AI Across Heterogeneous Experience Technology

Kairos Stack Promises Native World Models for Physical AI Across Heterogeneous Experience

Researchers have introduced Kairos, a world model stack designed for Physical AI. It features a Native Pre-training Paradigm using a cross-embodiment data curriculum, a Native Unified Architecture with hybrid linear temporal attention, and a Deployment-Aware System Co-Design for real-time performance. Kairos achieves top-level results on embodied world-model, long-horizon, and action-policy benchmarks.

June 16, 2026
Sensor-Conditioned Representation Learning Uses Scene-Relevant Observation Quotients to Improve Latent Geometry Technology

Sensor-Conditioned Representation Learning Uses Scene-Relevant Observation Quotients to Improve Latent Geometry

Researchers propose a sensor-conditioned representation learning framework using scene-relevant observation quotients. Their OQ-TSAE method, tested on synthetic and real-radar data, improves representation-correctness diagnostics over reconstruction, metric-learning, and contrastive baselines.

June 16, 2026