arxiv:2509.06945

Interleaving Reasoning for Better Text-to-Image Generation

Published on Sep 8

· Submitted by

Osilly on Sep 9

Upvote

Authors:

Wenxuan Huang ,

Abstract

Interleaving Reasoning Generation (IRG) framework alternates between text-based thinking and image synthesis to improve Text-to-Image generation, achieving state-of-the-art performance and enhanced visual quality.

AI-generated summary

Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .

View arXiv page View PDF GitHub 26 Add to collection

Community

Osilly

Paper author Paper submitter 3 days ago

•

edited 3 days ago

This paper introduces Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on various benchmarks (Geneval 0.85, WISE 0.77, TIIF-short/long (QwenVL2.5-72B as the evaluation) 76.00/73.77, GenAI-Bench 0.84, OneIG-EN 0.415), alongside substantial improvements in visual quality and fine-grained fidelity. As an early exploration, our results demonstrate that interleaving reasoning is a powerful paradigm for advancing T2I.