VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis

Meng Chu1, Senqiao Yang2, Haoxuan Che3*†, Suiyun Zhang3, Xichen Zhang1, Shaozuo Yu2, Haokun Gui1, Zhefan Rao1, Dandan Tu3, Rui Liu3*, Jiaya Jia1
1HKUST, 2CUHK, 3Huawei Research
*Corresponding authors
VisionDirector Teaser

VisionDirector enables automatic step-by-step refinement for both Image Editing and Image Generation tasks, achieving superior multi-goal alignment compared to closed-source models.

Abstract

The field of visual content creation has progressed rapidly with the rise of diffusion models. However, professional image creation often relies on long, multi-goal instructions specifying global composition, local object placement, typography, and stylistic constraints. While modern diffusion models achieve strong visual fidelity, they frequently fail to satisfy such tightly coupled objectives.

To systematically expose this gap, we introduce LongGoalBench (LGBench), a dual-modality benchmark comprising 2,000 tasks (1,000 T2I + 1,000 I2I) with over 29,000+ annotated goals and automated goal-level verification. LGBench stresses multi-attribute alignment rather than simple prompt fidelity, with each task requiring satisfaction of 10–23 quantitative goals.

We further propose VisionDirector, a training-free, vision-language guided closed-loop framework that decomposes long instructions into structured goals, dynamically plans generation or editing actions, and verifies goal satisfaction after each step—achieving 30% improvement on GenEval and 60% improvement on ComplexBench over state-of-the-art methods.

Describe
Inspect
Revise
Edit
Converge
Toward
Optimal
Rendering

LGBench: LongGoal Benchmark

A director-style stress test for multi-goal image generation and editing

2,000
Total Tasks
(1K T2I + 1K I2I)
29,000+
Annotated Goals
18.0
Avg Goals per T2I Prompt
418
T2I Subcategories
LGBench Construction

Goal Type Distribution:

  • T2I: Additive objects (31.8%), Textual overlays (16.8%), Visual effects (16.6%), Color constraints (11.8%), Lighting (11.5%)
  • I2I: Effects (34.4%), Color grading (21.5%), Typography (17.5%), Lighting refinement (15.8%)

Comparison with Existing Benchmarks

Benchmark Modalities Prompt Complexity Goals per Task Scale
LGBench (Ours) T2I + I2I Long-chain 10–23 2,000
DrawBench / VBench T2I Single sentence 1 200–300
TIFA / GenEval T2I Short + QA 1–2 500–1000
MagicBrush / EditEval I2I Short directive Few <500

Framework Overview

Framework Overview
VisionDirector introduces a director-style vision–language agent that decomposes long, multi-goal instructions into structured goals and supervises generation and editing through a closed-loop process with verification and rollback. The framework operates without retraining diffusion backbones, leveraging the strong perceptual capabilities of VLMs.

Adaptive Planning Behavior

VLM Decision Pattern
VisionDirector adaptively switches between one-shot generation and staged refinement as instruction complexity increases, demonstrating rational planning behavior. The VLM dynamically decides when to iterate based on goal verification feedback.

Qualitative Results

VisionDirector consistently improves multi-goal adherence on complex generation and editing tasks

Image-to-Image Editing

I2I Demo

Multi-stage editing with comparison against Qwen-Image, Seeddream, GPT-Image, and KONTEXT-MAX

Text-to-Image Generation

T2I Demo

High-fidelity generation with multi-goal alignment

Complex Scene Generation

Complex Demo

Handling extremely detailed instructions: Multidimensional Ballroom, Quantum Archaeology Laboratory, Forest Cathedral

Step-by-Step Refinement

Horse Demo

Artistic Style Transfer

Artifact Demo

BibTeX

@misc{chu2025visiondirector,
  title        = {VisionDirector: Vision-Language Guided Closed-Loop Refinement for Generative Image Synthesis},
  author       = {Chu, Meng and Yang, Senqiao and Che, Haoxuan and Zhang, Suiyun and Zhang, Xichen and Yu, Shaozuo and Gui, Haokun and Rao, Zhefan and Tu, Dandan and Liu, Rui and Jia, Jiaya},
  year         = {2025},
  eprint       = {2512.19243},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV}
}