Multi-turn image editing is essential for iterative design workflows, but existing models suffer from identity drift and error accumulation as edits are applied sequentially. According to a recent arXiv paper by Xu, Hang; Ma, Xiaoxiao; Zhang, Guohui; Yu, Fu; Siming, Huang; Jie, Lin; Haoyang, Song; Nan, Duan; and Feng, Zhao, dated June 10, 2026, current approaches that leverage video priors rely on bidirectional attention, which is fundamentally misaligned with the causal, sequential nature of interactive editing. To address this, the researchers introduce AnchorEdit, the first autoregressive (AR) diffusion-based framework designed for high-resolution, long-term multi-turn editing.
The Challenge of Temporal Consistency
In iterative design—whether for product prototyping, marketing visuals, or architectural renderings—users often need to make multiple successive edits to an image while preserving the identity of key subjects. Existing models, including those based on video priors, gradually lose fidelity as changes accumulate. The paper notes that relying on bidirectional attention fails to account for the sequential order of edits, leading to inconsistencies. AnchorEdit bridges this gap between video priors and causal inference through a novel three-stage training curriculum and an inference-time memory mechanism.
AnchorEdit: A Novel Autoregressive Diffusion Framework
AnchorEdit is described as the first autoregressive diffusion-based framework for this task. Instead of processing all edits simultaneously, it treats each editing step as a causal, sequential update. The core innovation lies in its training recipe and inference strategy. During inference, a memory mechanism anchors the initial subject identity, ensuring stable extrapolation across extended editing trajectories. This allows the model to maintain consistency even over many rounds.
Three-Stage Training Curriculum
The training process consists of three distinct stages, each designed to prepare the model for long-horizon consistency:
| Stage | Purpose | Method |
|---|---|---|
| 1. Identity-preserving single-turn pretraining | Enable the model to learn high-fidelity single-turn edits without drift | Standard diffusion training on single editing steps |
| 2. Causal AR forcing fine-tuning | Teach the model to handle sequential dependencies and mitigate exposure bias | Novel self-rollout strategy where the model generates its own context during training |
| 3. Consistency distillation | Speed up inference to four steps while preserving quality | Distillation into a student model that generates high-quality outputs in fewer steps |
According to the paper, this curriculum ensures that AnchorEdit can maintain subject fidelity and follow complex instructions even over 10+ interaction rounds.
Inference Memory Mechanism and Benchmarking
During inference, AnchorEdit employs a memory mechanism that persistently retains the initial subject identity from the first editing round. This prevents the model from 'forgetting' the original appearance as later transformations are applied. To evaluate performance, the authors introduce a new high-resolution multi-turn editing benchmark designed specifically to stress-test long-horizon stability. Extensive experiments, as reported in the paper, demonstrate that AnchorEdit achieves state-of-the-art results, with exceptional subject fidelity and instruction adherence across prolonged editing sequences.
The work represents a significant step forward for applications requiring iterative, high-quality image manipulation, such as product design, advertising, and content creation. By aligning the model's architecture with the causal nature of editing, AnchorEdit opens the door to more reliable and practical multi-turn editing tools.