Does Data Scaling Lead to Visual Compositional Generalization? Paper • 2507.07102 • Published Jul 9 • 1
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion Paper • 2507.06165 • Published Jul 8 • 56
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning Paper • 2507.05920 • Published Jul 8 • 11
Is Diversity All You Need for Scalable Robotic Manipulation? Paper • 2507.06219 • Published Jul 8 • 20
AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance Paper • 2506.03828 • Published Jun 4 • 13
Exploring the Latent Capacity of LLMs for One-Step Text Generation Paper • 2505.21189 • Published May 27 • 61
OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data Paper • 2505.18445 • Published May 24 • 64
MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs Paper • 2505.21327 • Published May 27 • 82
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers Paper • 2505.21497 • Published May 27 • 107
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning Paper • 2505.13426 • Published May 19 • 13
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding Paper • 2505.17012 • Published May 22 • 12
Diffusion Classifiers Understand Compositionality, but Conditions Apply Paper • 2505.17955 • Published May 23 • 22
view article Article Vision Language Models (Better, Faster, Stronger) By merve and 4 others • May 12 • 524
CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally Paper • 2502.03566 • Published Feb 5 • 2