A challenge in code generation is customizing existing programs that produce visual outputs, such as TikZ—a graphics language for LaTeX. Unlike generating code from scratch, editing requires localized, semantics-preserving changes. A recent empirical study from arXiv investigates whether iterative refinement can remain effective when the verifier providing feedback is itself unreliable.
Researchers evaluated multiple LLM-based and tool-augmented visual verifiers within iterative refinement pipelines on TikZ code customization tasks. They defined visual code customization as an iterative editing problem with an imperfect oracle and manually annotated refinement trajectories to assess verifier behavior and feedback quality.
Key findings include:
| Metric | Value |
|---|---|
| Verifier accuracy (F1-score) | Up to 0.815 |
| Improvement for Qwen3-vl-30b-a3b-Instruct | +11 to +20 perfect customizations |
| Improvement for Gemini-3 | +5 perfect customizations |
| Benefit of accurate verification for strong models | Prevents premature acceptance |
The study used TikZ as a case study because it isolates core difficulties: weak code structure, fine-grained visual semantics, and difficult feature localization. The researchers found that feedback is effective only when it precisely identifies image issues, provides actionable guidance, addresses all relevant problems, and remains grounded in the original instruction.
While stronger models like Gemini-3 gained fewer absolute improvements (+5) compared to weaker models, they benefited more from accurate verification that prevented premature acceptance of incomplete edits. For the weaker model Qwen3-vl-30b-a3b-Instruct, imperfect verifiers added between 11 and 20 perfect customizations.
The study's authors—Charly Reux, Mathieu Acher, Djamel Eddine Khelladi, Clément Quinton, and Olivier Barais—conducted a large-scale evaluation of multiple LLM-based and tool-augmented visual verifiers within iterative refinement pipelines. They emphasized that even imperfect verifiers can determine with moderate accuracy whether visual instructions are applied to code.
For enterprise technology leaders dealing with automated documentation or graphics generation—such as supply chain diagrams or product illustrations—this research suggests that imperfect verification can still be a practical tool. Instead of requiring perfect automated checks, organizations can leverage iterative refinement with fallible verifiers to improve code customization outcomes, especially when using less capable models.
The paper "Imperfect Visual Verification for Code Edition: A Case Study on TikZ" is available on arXiv. The findings indicate that imperfect verifiers, while not perfect, can significantly boost the effectiveness of LLM-based code editing for visual programs.