Papers
arxiv:2509.06945

Interleaving Reasoning for Better Text-to-Image Generation

Published on Sep 8
· Submitted by Osilly on Sep 9
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Interleaving Reasoning Generation (IRG) framework alternates between text-based thinking and image synthesis to improve Text-to-Image generation, achieving state-of-the-art performance and enhanced visual quality.

AI-generated summary

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

Community

Paper author Paper submitter
edited 3 days ago

This paper introduces Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on various benchmarks (Geneval 0.85, WISE 0.77, TIIF-short/long (QwenVL2.5-72B as the evaluation) 76.00/73.77, GenAI-Bench 0.84, OneIG-EN 0.415), alongside substantial improvements in visual quality and fine-grained fidelity. As an early exploration, our results demonstrate that interleaving reasoning is a powerful paradigm for advancing T2I.
case (1).png

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.06945 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.06945 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.06945 in a Space README.md to link it from this page.

Collections including this paper 4