arxiv:2509.03516

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Published on Sep 3

· Submitted by

lioooox on Sep 9

Upvote

Authors:

Ouxiang Li ,

Abstract

T2I-CoReBench is a benchmark that evaluates the composition and reasoning capabilities of text-to-image models using a comprehensive and complex set of prompts and checklist questions.

AI-generated summary

Text-to-image (T2I) generation aims to synthesize images from textual prompts, which jointly specify what must be shown and imply what can be inferred, thereby corresponding to two core capabilities: composition and reasoning. However, with the emerging advances of T2I models in reasoning beyond composition, existing benchmarks reveal clear limitations in providing comprehensive evaluations across and within these capabilities. Meanwhile, these advances also enable models to handle more complex prompts, whereas current benchmarks remain limited to low scene density and simplified one-to-one reasoning. To address these limitations, we propose T2I-CoReBench, a comprehensive and complex benchmark that evaluates both composition and reasoning capabilities of T2I models. To ensure comprehensiveness, we structure composition around scene graph elements (instance, attribute, and relation) and reasoning around the philosophical framework of inference (deductive, inductive, and abductive), formulating a 12-dimensional evaluation taxonomy. To increase complexity, driven by the inherent complexities of real-world scenarios, we curate each prompt with high compositional density for composition and multi-step inference for reasoning. We also pair each prompt with a checklist that specifies individual yes/no questions to assess each intended element independently to facilitate fine-grained and reliable evaluation. In statistics, our benchmark comprises 1,080 challenging prompts and around 13,500 checklist questions. Experiments across 27 current T2I models reveal that their composition capability still remains limited in complex high-density scenarios, while the reasoning capability lags even further behind as a critical bottleneck, with all models struggling to infer implicit elements from prompts. Our project page: https://t2i-corebench.github.io/.

View arXiv page View PDF Project page GitHub 13 Add to collection

Community

lioooox

Paper author Paper submitter 2 days ago

•

edited 2 days ago

Easier Painting Than Thinking: Can Text-to-Image Models
Set the Stage, but Not Direct the Play?

Ouxiang Li^1*, Yuan Wang¹, Xinting Hu^2†, Huijuan Huang^3‡, Rui Chen³, Jiarong Ou³,
Xin Tao^3†, Pengfei Wan³, Fuli Feng¹

¹University of Science and Technology of China, ²Nanyang Technological University, ³Kuaishou Technology
^*Work done during internship at KwaiVGI, Kuaishou Technology. ^†Corresponding authors. ^†Project leader.

Statistics of T2I-CoReBench. Left: Our T2I evaluation taxonomy spanning two fundamental generative capabilities (i.e., composition and reasoning), further refined into 12 dimensions. Right: Distributions of prompt-token lengths and checklist-question counts. Our benchmark demonstrates high complexity, with an average prompt length of 170 tokens and an average of 12.5 questions. Note: reasoning has fewer questions, as each requires reasoning that is more challenging.

librarian-bot

about 24 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.03516 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.03516 in a Space README.md to link it from this page.

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Abstract

Community

Easier Painting Than Thinking: Can Text-to-Image Models Set the Stage, but Not Direct the Play?

Models citing this paper 0

Datasets citing this paper 2

Spaces citing this paper 0

Collections including this paper 2

Easier Painting Than Thinking: Can Text-to-Image Models
Set the Stage, but Not Direct the Play?