Collections
Discover the best community collections!
Collections including paper arxiv:2509.06951
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 17 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Paper • 2507.04447 • Published • 43 -
A Survey on Vision-Language-Action Models for Autonomous Driving
Paper • 2506.24044 • Published • 14 -
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Paper • 2507.10548 • Published • 36
-
Unified Vision-Language-Action Model
Paper • 2506.19850 • Published • 27 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 131 -
3D-VLA: A 3D Vision-Language-Action Generative World Model
Paper • 2403.09631 • Published • 10 -
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Paper • 2312.14457 • Published • 1
-
Personalize Anything for Free with Diffusion Transformer
Paper • 2503.12590 • Published • 43 -
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Paper • 2503.12937 • Published • 30 -
Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks
Paper • 2503.11514 • Published • 18 -
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
Paper • 2502.19328 • Published • 22
-
Reconstruction Alignment Improves Unified Multimodal Models
Paper • 2509.07295 • Published • 37 -
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Paper • 2509.06951 • Published • 26 -
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Paper • 2509.06818 • Published • 27 -
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts
Paper • 2509.06155 • Published • 13
-
Reinforcement Learning in Vision: A Survey
Paper • 2508.08189 • Published • 27 -
Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels
Paper • 2508.17437 • Published • 35 -
Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation
Paper • 2509.00428 • Published • 17 -
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Paper • 2509.06951 • Published • 26
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 34 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 42 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 29
-
SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models
Paper • 2506.04180 • Published • 33 -
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
Paper • 2506.10540 • Published • 37 -
AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Paper • 2506.10974 • Published • 18 -
SPAR: Scholar Paper Retrieval with LLM-based Agents for Enhanced Academic Search
Paper • 2507.15245 • Published • 11
-
facebook/w2v-bert-2.0
Feature Extraction • 0.6B • Updated • 207k • 180 -
facebook/metaclip-h14-fullcc2.5b
Zero-Shot Image Classification • 1.0B • Updated • 43.1k • 44 -
openai/clip-vit-large-patch14
Zero-Shot Image Classification • 0.4B • Updated • 8.36M • 1.86k -
Salesforce/blip-image-captioning-large
Image-to-Text • 0.5B • Updated • 1.22M • 1.4k
-
Reconstruction Alignment Improves Unified Multimodal Models
Paper • 2509.07295 • Published • 37 -
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Paper • 2509.06951 • Published • 26 -
UMO: Scaling Multi-Identity Consistency for Image Customization via Matching Reward
Paper • 2509.06818 • Published • 27 -
UniVerse-1: Unified Audio-Video Generation via Stitching of Experts
Paper • 2509.06155 • Published • 13
-
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Paper • 2508.09789 • Published • 5 -
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper • 2508.13186 • Published • 17 -
ZARA: Zero-shot Motion Time-Series Analysis via Knowledge and Retrieval Driven LLM Agents
Paper • 2508.04038 • Published • 1 -
Prompt Orchestration Markup Language
Paper • 2508.13948 • Published • 48
-
Reinforcement Learning in Vision: A Survey
Paper • 2508.08189 • Published • 27 -
Pixie: Fast and Generalizable Supervised Learning of 3D Physics from Pixels
Paper • 2508.17437 • Published • 35 -
Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation
Paper • 2509.00428 • Published • 17 -
F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions
Paper • 2509.06951 • Published • 26
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Paper • 2507.04447 • Published • 43 -
A Survey on Vision-Language-Action Models for Autonomous Driving
Paper • 2506.24044 • Published • 14 -
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Paper • 2507.10548 • Published • 36
-
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Paper • 2507.01925 • Published • 37 -
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning
Paper • 2507.16746 • Published • 34 -
MolmoAct: Action Reasoning Models that can Reason in Space
Paper • 2508.07917 • Published • 42 -
Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Paper • 2508.20072 • Published • 29
-
Unified Vision-Language-Action Model
Paper • 2506.19850 • Published • 27 -
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
Paper • 2506.01844 • Published • 131 -
3D-VLA: A 3D Vision-Language-Action Generative World Model
Paper • 2403.09631 • Published • 10 -
QUAR-VLA: Vision-Language-Action Model for Quadruped Robots
Paper • 2312.14457 • Published • 1
-
SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models
Paper • 2506.04180 • Published • 33 -
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
Paper • 2506.10540 • Published • 37 -
AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Paper • 2506.10974 • Published • 18 -
SPAR: Scholar Paper Retrieval with LLM-based Agents for Enhanced Academic Search
Paper • 2507.15245 • Published • 11
-
Personalize Anything for Free with Diffusion Transformer
Paper • 2503.12590 • Published • 43 -
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Paper • 2503.12937 • Published • 30 -
Exploring the Vulnerabilities of Federated Learning: A Deep Dive into Gradient Inversion Attacks
Paper • 2503.11514 • Published • 18 -
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
Paper • 2502.19328 • Published • 22
-
facebook/w2v-bert-2.0
Feature Extraction • 0.6B • Updated • 207k • 180 -
facebook/metaclip-h14-fullcc2.5b
Zero-Shot Image Classification • 1.0B • Updated • 43.1k • 44 -
openai/clip-vit-large-patch14
Zero-Shot Image Classification • 0.4B • Updated • 8.36M • 1.86k -
Salesforce/blip-image-captioning-large
Image-to-Text • 0.5B • Updated • 1.22M • 1.4k