btjhjeon
's Collections
Multimodal Agent
updated
Gemini Robotics: Bringing AI into the Physical World
Paper
•
2503.20020
•
Published
•
28
Magma: A Foundation Model for Multimodal AI Agents
Paper
•
2502.13130
•
Published
•
58
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
•
2311.05437
•
Published
•
51
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
•
2410.23218
•
Published
•
51
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
•
2411.17465
•
Published
•
89
Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Paper
•
2501.11733
•
Published
•
29
Being-0: A Humanoid Robotic Agent with Vision-Language Models and
Modular Skills
Paper
•
2503.12533
•
Published
•
69
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
•
2503.21696
•
Published
•
23
UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement
Learning
Paper
•
2503.21620
•
Published
•
63
OmniParser for Pure Vision Based GUI Agent
Paper
•
2408.00203
•
Published
•
26
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
Paper
•
2505.06111
•
Published
•
25
Visual Agentic Reinforcement Fine-Tuning
Paper
•
2505.14246
•
Published
•
32
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot
Manipulation Datasets
Paper
•
2505.15517
•
Published
•
4
Interactive Post-Training for Vision-Language-Action Models
Paper
•
2505.17016
•
Published
•
6
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer
Interaction
Paper
•
2505.10887
•
Published
•
10
Paper2Poster: Towards Multimodal Poster Automation from Scientific
Papers
Paper
•
2505.21497
•
Published
•
108
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual
Tool Selection
Paper
•
2505.20289
•
Published
•
10
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial
Intelligence
Paper
•
2505.23747
•
Published
•
68
Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and
Benchmarking Multimodal LLM Agents
Paper
•
2505.24878
•
Published
•
23
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient
Robotics
Paper
•
2506.01844
•
Published
•
131
LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon
Embodied Tasks
Paper
•
2506.00411
•
Published
•
31
VS-Bench: Evaluating VLMs for Strategic Reasoning and Decision-Making in
Multi-Agent Environments
Paper
•
2506.02387
•
Published
•
58
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Paper
•
2506.03143
•
Published
•
52
Multimodal DeepResearcher: Generating Text-Chart Interleaved Reports
From Scratch with Agentic Framework
Paper
•
2506.02454
•
Published
•
6
SAFE: Multitask Failure Detection for Vision-Language-Action Models
Paper
•
2506.09937
•
Published
•
9
Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable
Task Experts
Paper
•
2506.10357
•
Published
•
22
VideoDeepResearch: Long Video Understanding With Agentic Tool Using
Paper
•
2506.10821
•
Published
•
20
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning
with Vision-Language Models
Paper
•
2506.07961
•
Published
•
12
EfficientVLA: Training-Free Acceleration and Compression for
Vision-Language-Action Models
Paper
•
2506.10100
•
Published
•
10
From Intention to Execution: Probing the Generalization Boundaries of
Vision-Language-Action Models
Paper
•
2506.09930
•
Published
•
8
Unified Vision-Language-Action Model
Paper
•
2506.19850
•
Published
•
27
WorldVLA: Towards Autoregressive Action World Model
Paper
•
2506.21539
•
Published
•
39
A Survey on Vision-Language-Action Models: An Action Tokenization
Perspective
Paper
•
2507.01925
•
Published
•
37
RoboBrain 2.0 Technical Report
Paper
•
2507.02029
•
Published
•
30
PresentAgent: Multimodal Agent for Presentation Video Generation
Paper
•
2507.04036
•
Published
•
10
A Survey on Vision-Language-Action Models for Autonomous Driving
Paper
•
2506.24044
•
Published
•
14
PyVision: Agentic Vision with Dynamic Tooling
Paper
•
2507.07998
•
Published
•
31
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent
Planning
Paper
•
2507.16815
•
Published
•
38
ScreenCoder: Advancing Visual-to-Code Generation for Front-End
Automation via Modular Multimodal Agents
Paper
•
2507.22827
•
Published
•
98
villa-X: Enhancing Latent Action Modeling in Vision-Language-Action
Models
Paper
•
2507.23682
•
Published
•
23
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper
•
2503.15558
•
Published
•
51
InstructVLA: Vision-Language-Action Instruction Tuning from
Understanding to Manipulation
Paper
•
2507.17520
•
Published
•
14
RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong
Learning in Physical Embodied Systems
Paper
•
2508.01415
•
Published
•
7
OS Agents: A Survey on MLLM-based Agents for General Computing Devices
Use
Paper
•
2508.04482
•
Published
•
9
MolmoAct: Action Reasoning Models that can Reason in Space
Paper
•
2508.07917
•
Published
•
41
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents
Paper
•
2508.13186
•
Published
•
17
UI-Venus Technical Report: Building High-performance UI Agents with RFT
Paper
•
2508.10833
•
Published
•
41
Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with
Long-Term Memory
Paper
•
2508.09736
•
Published
•
54
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
Paper
•
2508.05748
•
Published
•
122
Mobile-Agent-v3: Foundamental Agents for GUI Automation
Paper
•
2508.15144
•
Published
•
58
Do What? Teaching Vision-Language-Action Models to Reject the Impossible
Paper
•
2508.16292
•
Published
•
9
CogVLA: Cognition-Aligned Vision-Language-Action Model via
Instruction-Driven Routing & Sparsification
Paper
•
2508.21046
•
Published
•
8
Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered
Smartphone Agents
Paper
•
2508.19493
•
Published
•
11
EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for
General Robot Control
Paper
•
2508.21112
•
Published
•
72
UItron: Foundational GUI Agent with Advanced Perception and Planning
Paper
•
2508.21767
•
Published
•
12
Robix: A Unified Model for Robot Interaction, Reasoning and Planning
Paper
•
2509.01106
•
Published
•
39
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn
Reinforcement Learning
Paper
•
2509.02544
•
Published
•
104