Models
Datasets
Spaces
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2506.03147

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 28
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Multimodal Model

A curated list for Multimodal Model Generation papers.

OmniGen2: Exploration to Advanced Multimodal Generation

Paper • 2506.18871 • Published Jun 23 • 75
OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 115
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

Paper • 2502.05415 • Published Feb 8 • 22
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 51

(Urban) World Model

CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes

Paper • 2411.00771 • Published Nov 1, 2024 • 9
SynCity: Training-Free Generation of 3D Worlds

Paper • 2503.16420 • Published Mar 20 • 27
CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities

Paper • 2501.08983 • Published Jan 15 • 20
Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Paper • 2407.13759 • Published Jul 18, 2024 • 18

CoLLM: A Large Language Model for Composed Image Retrieval

Paper • 2503.19910 • Published Mar 25 • 15
LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Paper • 2503.21541 • Published Mar 27 • 1
HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration

Paper • 2504.03536 • Published Apr 4 • 12
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

Paper • 2504.04842 • Published Apr 7 • 35

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

Robust Multimodal Large Language Models Against Modality Conflict

Paper • 2507.07151 • Published Jul 9 • 5
One Token to Fool LLM-as-a-Judge

Paper • 2507.08794 • Published Jul 11 • 31
Test-Time Scaling with Reflective Generative Model

Paper • 2507.01951 • Published Jul 2 • 107
KV Cache Steering for Inducing Reasoning in Small Language Models

Paper • 2507.08799 • Published Jul 11 • 40

about 2 hours ago

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5 • 72
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Paper • 2506.04207 • Published Jun 4 • 47
MiMo-VL Technical Report

Paper • 2506.03569 • Published Jun 4 • 80
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Paper • 2506.03147 • Published Jun 3 • 58

Interesting Papers

ReZero: Enhancing LLM search ability by trying one-more-time

Paper • 2504.11001 • Published Apr 15 • 15
FonTS: Text Rendering with Typography and Style Controls

Paper • 2412.00136 • Published Nov 28, 2024 • 1
GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 97
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 155

about 5 hours ago

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 28
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 14
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

Robust Multimodal Large Language Models Against Modality Conflict

Paper • 2507.07151 • Published Jul 9 • 5
One Token to Fool LLM-as-a-Judge

Paper • 2507.08794 • Published Jul 11 • 31
Test-Time Scaling with Reflective Generative Model

Paper • 2507.01951 • Published Jul 2 • 107
KV Cache Steering for Inducing Reasoning in Small Language Models

Paper • 2507.08799 • Published Jul 11 • 40

Multimodal Model

A curated list for Multimodal Model Generation papers.

OmniGen2: Exploration to Advanced Multimodal Generation

Paper • 2506.18871 • Published Jun 23 • 75
OmniGen: Unified Image Generation

Paper • 2409.11340 • Published Sep 17, 2024 • 115
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation

Paper • 2502.05415 • Published Feb 8 • 22
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 51

about 2 hours ago

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Paper • 2506.05176 • Published Jun 5 • 72
Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Paper • 2506.04207 • Published Jun 4 • 47
MiMo-VL Technical Report

Paper • 2506.03569 • Published Jun 4 • 80
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Paper • 2506.03147 • Published Jun 3 • 58

(Urban) World Model

CityGaussianV2: Efficient and Geometrically Accurate Reconstruction for Large-Scale Scenes

Paper • 2411.00771 • Published Nov 1, 2024 • 9
SynCity: Training-Free Generation of 3D Worlds

Paper • 2503.16420 • Published Mar 20 • 27
CityDreamer4D: Compositional Generative Model of Unbounded 4D Cities

Paper • 2501.08983 • Published Jan 15 • 20
Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

Paper • 2407.13759 • Published Jul 18, 2024 • 18

Interesting Papers

ReZero: Enhancing LLM search ability by trying one-more-time

Paper • 2504.11001 • Published Apr 15 • 15
FonTS: Text Rendering with Typography and Style Controls

Paper • 2412.00136 • Published Nov 28, 2024 • 1
GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 97
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 155

CoLLM: A Large Language Model for Composed Image Retrieval

Paper • 2503.19910 • Published Mar 25 • 15
LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Paper • 2503.21541 • Published Mar 27 • 1
HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration

Paper • 2504.03536 • Published Apr 4 • 12
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

Paper • 2504.04842 • Published Apr 7 • 35

about 5 hours ago

LinFusion: 1 GPU, 1 Minute, 16K Image

Paper • 2409.02097 • Published Sep 3, 2024 • 34
Phidias: A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion

Paper • 2409.11406 • Published Sep 17, 2024 • 27
Diffusion Models Are Real-Time Game Engines

Paper • 2408.14837 • Published Aug 27, 2024 • 126
Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Paper • 2405.15223 • Published May 24, 2024 • 17
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

Paper • 2405.15574 • Published May 24, 2024 • 55
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27, 2024 • 90
Matryoshka Multimodal Models

Paper • 2405.17430 • Published May 27, 2024 • 34

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs