vision language models

9 stories

Artificial Intelligence #vision language models#computer vision

New Research Reveals How Visual Tokens Evolve Inside Vision-Language Models

A new computer vision paper from arXiv investigates how visual tokens are integrated into large language models (LLMs) under two paradigms: in-context prompting and layer-wise injection. The authors find that visual tokens enter the LLM as 'disguised visual context' lacking linguistic structure, then evolve differently depending on the integration architecture. They show that attention allocation alone is insufficient, and performance depends on the quality of visual representations at each layer.

Jul 8, 2026 1 source

RTSGameBench Benchmark Tests Strategic Reasoning in Vision-Language Models

Technology

Artificial Intelligence #rtsgamebench#benchmark

RTSGameBench Benchmark Tests Strategic Reasoning in Vision-Language Models

A new benchmark called RTSGameBench evaluates strategic reasoning in vision-language models (VLMs) using the real-time strategy game Beyond All Reason. The benchmark includes diagnostic mini-games, diverse matchup structures, and a self-evolving generation framework. Initial tests show state-of-the-art VLMs struggle with tighter coordination, multiagent tasks, and increased scale.

Jun 21, 2026 1 source

The Scaffold Effect: How Prompt Framing Skews AI Evaluation in Clinical Vision-Language Models

Technology

Artificial Intelligence #artificial intelligence#vision-language models

The Scaffold Effect: How Prompt Framing Skews AI Evaluation in Clinical Vision-Language Models

A study on arXiv evaluating 12 open-weight vision-language models (VLMs) on clinical neuroimaging datasets found that up to 58% of apparent multimodal performance gains are due to prompt framing rather than genuine reasoning. The researchers identified a 'scaffold effect' where merely mentioning MRI availability in the task prompt accounts for 70-80% of F1 improvement, even when no imaging data is present. Expert evaluation also revealed fabrication of neuroimaging-grounded justifications, raising concerns about the reliability of VLM evaluations in clinical settings.

Jun 20, 2026 1 source

Neuro-Inspired Vision-Language Models Show Resilience to Membership Inference Privacy Leakage

Technology

Artificial Intelligence #ai#privacy

Neuro-Inspired Vision-Language Models Show Resilience to Membership Inference Privacy Leakage

A new study explores whether neuro-inspired multi-modal vision-language models (VLMs) are resilient to membership inference privacy attacks. Using topological regularization, the authors found that NEURO VLMs reduce MIA success by up to 24% without sacrificing model utility, offering a promising path for secure AI deployment.

Jun 17, 2026 1 source

New Method Detects 'Mirage' Answers in Vision-Language Models Before Generation

Technology

Artificial Intelligence #vision language models#ai

New Method Detects 'Mirage' Answers in Vision-Language Models Before Generation

A new study introduces Text-Conditioned Layer-wise Internal Alignment (TC-LIA), a method to detect 'mirage' answers in vision-language models (VLMs) before generation. The approach, tested across twelve VLM backbones, achieves up to 94.7% accuracy, reducing mirage rates to as low as 2.8%. This is critical for medical and document VQA applications.

Jun 17, 2026 1 source

GEASS: Gated Evidence-Adaptive Selective Caption Trust Tackles VLM Hallucination

Technology

Artificial Intelligence #geass#gated evidence-adaptive

GEASS: Gated Evidence-Adaptive Selective Caption Trust Tackles VLM Hallucination

Vision-language models often hallucinate objects, and feeding them their own captions can actually worsen accuracy. Researchers propose GEASS, a gated evidence-adaptive module that decides per query how much of the caption to trust, improving accuracy across four VLMs on two benchmarks without training or additional parameters.

Jun 16, 2026 1 source

Prompt-Driven AI Models Enable On-Orbit Spacecraft Inspection Without Retraining

Technology

Artificial Intelligence #vision-language models#spacecraft inspection

Prompt-Driven AI Models Enable On-Orbit Spacecraft Inspection Without Retraining

Researchers demonstrate that prompt-driven vision-language models can perform zero-shot instance segmentation of spacecraft components on orbit without modifying onboard weights, enabling post-launch semantic expansion. The approach achieves 0.385 mAP@0.5 on a test set of 129 images of unseen satellites, with strong performance on large structures but challenges on fine-scale appendages. Structured prompts improve accuracy by up to 82% over simple category names.

Jun 16, 2026 1 source

TimeVista: Researchers Use Vision-Language Models as Judges for Time Series Forecasting Evaluation

Technology

Artificial Intelligence #time series forecasting#vision-language models

TimeVista: Researchers Use Vision-Language Models as Judges for Time Series Forecasting Evaluation

Researchers propose using vision-language models (VLMs) as judges for time series forecasting, addressing limitations of traditional point-wise metrics. They introduce TimeVista, a benchmark of 5,563 samples, and show VLMs achieve significantly higher consistency with human preferences than conventional metrics, also assessing Time Series Foundation Models.

Jun 16, 2026 1 source

Open-Source Binary Tracking Boosts Robot Navigation Accuracy by 22.8% Without Cloud Dependence

Technology

Artificial Intelligence #binary tracking#spatial qa

Open-Source Binary Tracking Boosts Robot Navigation Accuracy by 22.8% Without Cloud Dependence

BinTrack, a fully open-source spatial-localization agent, enables robots to answer spatial queries without relying on closed-source cloud models. It improves accuracy by up to 22.8% over other open-source implementations and matches GPT-4o on the challenging SpaceLocQA benchmark, with a 1.5x inference speedup. The research also introduces GangnamLoop, a real-world multi-trip dataset collected with a quadruped robot on public streets.

Jun 16, 2026 1 source