AnchorEdit: Autoregressive Diffusion Tackles Identity Drift in Multi-Turn Image Editing

Researchers propose AnchorEdit, the first autoregressive diffusion-based framework for multi-turn image editing, addressing identity drift and error accumulation via a three-stage training curriculum and a causal memory mechanism. The method achieves state-of-the-art subject fidelity and instruction following over extended editing trajectories.

iGEN Editorial

June 16, 2026

AnchorEdit: Autoregressive Diffusion Tackles Identity Drift in Multi-Turn Image Editing

Multi-turn image editing is essential for iterative design workflows, but existing models suffer from identity drift and error accumulation as edits are applied sequentially. According to a recent arXiv paper by Xu, Hang; Ma, Xiaoxiao; Zhang, Guohui; Yu, Fu; Siming, Huang; Jie, Lin; Haoyang, Song; Nan, Duan; and Feng, Zhao, dated June 10, 2026, current approaches that leverage video priors rely on bidirectional attention, which is fundamentally misaligned with the causal, sequential nature of interactive editing. To address this, the researchers introduce AnchorEdit, the first autoregressive (AR) diffusion-based framework designed for high-resolution, long-term multi-turn editing.

The Challenge of Temporal Consistency

In iterative design—whether for product prototyping, marketing visuals, or architectural renderings—users often need to make multiple successive edits to an image while preserving the identity of key subjects. Existing models, including those based on video priors, gradually lose fidelity as changes accumulate. The paper notes that relying on bidirectional attention fails to account for the sequential order of edits, leading to inconsistencies. AnchorEdit bridges this gap between video priors and causal inference through a novel three-stage training curriculum and an inference-time memory mechanism.

AnchorEdit: A Novel Autoregressive Diffusion Framework

AnchorEdit is described as the first autoregressive diffusion-based framework for this task. Instead of processing all edits simultaneously, it treats each editing step as a causal, sequential update. The core innovation lies in its training recipe and inference strategy. During inference, a memory mechanism anchors the initial subject identity, ensuring stable extrapolation across extended editing trajectories. This allows the model to maintain consistency even over many rounds.

Three-Stage Training Curriculum

The training process consists of three distinct stages, each designed to prepare the model for long-horizon consistency:

Stage	Purpose	Method
1. Identity-preserving single-turn pretraining	Enable the model to learn high-fidelity single-turn edits without drift	Standard diffusion training on single editing steps
2. Causal AR forcing fine-tuning	Teach the model to handle sequential dependencies and mitigate exposure bias	Novel self-rollout strategy where the model generates its own context during training
3. Consistency distillation	Speed up inference to four steps while preserving quality	Distillation into a student model that generates high-quality outputs in fewer steps

According to the paper, this curriculum ensures that AnchorEdit can maintain subject fidelity and follow complex instructions even over 10+ interaction rounds.

Inference Memory Mechanism and Benchmarking

During inference, AnchorEdit employs a memory mechanism that persistently retains the initial subject identity from the first editing round. This prevents the model from 'forgetting' the original appearance as later transformations are applied. To evaluate performance, the authors introduce a new high-resolution multi-turn editing benchmark designed specifically to stress-test long-horizon stability. Extensive experiments, as reported in the paper, demonstrate that AnchorEdit achieves state-of-the-art results, with exceptional subject fidelity and instruction adherence across prolonged editing sequences.

The work represents a significant step forward for applications requiring iterative, high-quality image manipulation, such as product design, advertising, and content creation. By aligning the model's architecture with the causal nature of editing, AnchorEdit opens the door to more reliable and practical multi-turn editing tools.

Sources:

AnchorEdit: Autoregressive Diffusion Tackles Identity Drift in Multi-Turn Image Editing

The Challenge of Temporal Consistency

AnchorEdit: A Novel Autoregressive Diffusion Framework

Three-Stage Training Curriculum

Inference Memory Mechanism and Benchmarking

Recommended Stories

SACE Framework Introduces First Scale-Aware Concept Erasure for Visual Autoregressive Models to Prevent Catastrophic Semantic Collapse

Language-Guided AI Framework CLARITY Boosts Road Scene Segmentation for Autonomous Logistics

Modality-Aware Novelty Detection Framework MAND Improves Open-World Egocentric Activity Recognition

Phase, Not Magnitude, Drives Image Classifier Predictions, New Research Reveals