Interleaving Reasoning for Better Text-to-Image Generation
Abstract
Interleaving Reasoning Generation (IRG) framework alternates between text-based thinking and image synthesis to improve Text-to-Image generation, achieving state-of-the-art performance and enhanced visual quality.
Unified multimodal understanding and generation models recently have achieve significant improvement in image generation capability, yet a large gap remains in instruction following and detail preservation compared to systems that tightly couple comprehension with generation such as GPT-4o. Motivated by recent advances in interleaving reasoning, we explore whether such reasoning can further improve Text-to-Image (T2I) generation. We introduce Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis: the model first produces a text-based thinking to guide an initial image, then reflects on the result to refine fine-grained details, visual quality, and aesthetics while preserving semantics. To train IRG effectively, we propose Interleaving Reasoning Generation Learning (IRGL), which targets two sub-goals: (1) strengthening the initial think-and-generate stage to establish core content and base quality, and (2) enabling high-quality textual reflection and faithful implementation of those refinements in a subsequent image. We curate IRGL-300K, a dataset organized into six decomposed learning modes that jointly cover learning text-based thinking, and full thinking-image trajectories. Starting from a unified foundation model that natively emits interleaved text-image outputs, our two-stage training first builds robust thinking and reflection, then efficiently tunes the IRG pipeline in the full thinking-image trajectory data. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on GenEval, WISE, TIIF, GenAI-Bench, and OneIG-EN, alongside substantial improvements in visual quality and fine-grained fidelity. The code, model weights and datasets will be released in: https://github.com/Osilly/Interleaving-Reasoning-Generation .
Community
This paper introduces Interleaving Reasoning Generation (IRG), a framework that alternates between text-based thinking and image synthesis. Extensive experiments show SoTA performance, yielding absolute gains of 5-10 points on various benchmarks (Geneval 0.85, WISE 0.77, TIIF-short/long (QwenVL2.5-72B as the evaluation) 76.00/73.77, GenAI-Bench 0.84, OneIG-EN 0.415), alongside substantial improvements in visual quality and fine-grained fidelity. As an early exploration, our results demonstrate that interleaving reasoning is a powerful paradigm for advancing T2I.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation (2025)
- MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models (2025)
- Draw-In-Mind: Learning Precise Image Editing via Chain-of-Thought Imagination (2025)
- MM-R1: Unleashing the Power of Unified Multimodal Large Language Models for Personalized Image Generation (2025)
- UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing (2025)
- Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play? (2025)
- DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper