interpretability

7 stories

Artificial Intelligence #benchmark#text-to-video

BRITE Benchmark Reveals Critical Gaps in Text-to-Video Models' Object-Action Binding and Audio-Visual Sync

A new benchmark called BRITE provides the first unified framework for evaluating text-to-video (T2V) models on implausible prompts, audio-visual consistency, and interpretable QA-based assessment. Testing five state-of-the-art models including Sora 2 and Veo 3.1, BRITE reveals that while models excel at static object composition, they show significant degradation in object-action binding and audio-visual synchronization.

Jun 16, 2026 1 source

Cascaded Sparse Autoencoders Enable Hierarchical Visual Concept Learning in Multimodal LLMs

Technology

Artificial Intelligence #cascaded sparse autoencoders#multimodal llms

Cascaded Sparse Autoencoders Enable Hierarchical Visual Concept Learning in Multimodal LLMs

Researchers introduce cascaded sparse autoencoders (CSAEs) that learn hierarchical visual concepts in multimodal large language models. By training a second-level SAE on the decoder weights of the first, CSAEs achieve 'concepts of concepts' without nesting or stacking bottlenecks. Experiments on Qwen3-VL, Gemma-3, and LLaVA show improved interpretability and effective group-level steering.

Jun 16, 2026 1 source

New Research Demystifies Variance in Circuit Discovery of Large Language Models

Technology

Artificial Intelligence #llms#circuit discovery

New Research Demystifies Variance in Circuit Discovery of Large Language Models

A new research paper explores variance in circuit discovery of large language models, identifying resampling, rephrasing, and sample-wise variance. The authors propose CEAP, an improved method over EAP-IG with theoretical guarantees, and argue that rephrasing variance makes it hard to find comprehensive circuits, suggesting LLMs may be inherently difficult to steer.

Jun 16, 2026 1 source

DifFRACT Brings Circuit Tracing to Diffusion Transformers for Better AI Interpretability

Technology

Artificial Intelligence #diffusion models#ai

DifFRACT Brings Circuit Tracing to Diffusion Transformers for Better AI Interpretability

Researchers introduce DifFRACT, a method for mechanistic interpretability of multimodal diffusion transformers. By training timestep-conditioned transcoders on FLUX.1[schnell], they achieve exact feature-to-feature attribution and recover compact circuits, outperforming sparse autoencoders in precision.

Jun 16, 2026 1 source

New DAG-SHAP Method Improves Feature Attribution Using Edge Intervention in Directed Acyclic Graphs

Technology

Artificial Intelligence #feature attribution#directed acyclic graphs

New DAG-SHAP Method Improves Feature Attribution Using Edge Intervention in Directed Acyclic Graphs

Researchers introduce DAG-SHAP, a feature attribution method for directed acyclic graphs that uses edge intervention to address limitations of node-centric Shapley value approaches. The method captures both externality and exogenous influence, validated on real and synthetic datasets.

Jun 16, 2026 1 source

New Orthogonal Projection Method Reduces Hallucinations in Vision-Language AI Explanations

Technology

Artificial Intelligence #hallucinations#ai

New Orthogonal Projection Method Reduces Hallucinations in Vision-Language AI Explanations

Researchers propose Orthogonal Semantic Projection (OSP), a geometric intervention that reduces semantic hallucination in Vision-Language Model explanations. The method orthogonalizes query vectors against distractor concepts, improving attribution fidelity for safety-critical AI applications.

Jun 16, 2026 1 source

New Definition of Good Explanations Highlights Challenges in Explaining LLM Outputs

Technology

Artificial Intelligence #llm#explanation

New Definition of Good Explanations Highlights Challenges in Explaining LLM Outputs

A recent arXiv paper by Mahon, Louis, Ford, Elliot, Hackett, and Callum proposes a definition of good explanations inspired by counterfactual explanations but incorporating the interlocutor's prior beliefs. The authors explore the ramifications for AI explainability, particularly why LLM outputs are difficult to explain well.

Jun 16, 2026 1 source