muzammal
's Collections
Papers to Read
updated
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper
•
2501.00192
•
Published
•
31
2.5 Years in Class: A Multimodal Textbook for Vision-Language
Pretraining
Paper
•
2501.00958
•
Published
•
108
Xmodel-2 Technical Report
Paper
•
2412.19638
•
Published
•
27
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper
•
2412.18925
•
Published
•
102
CodeElo: Benchmarking Competition-level Code Generation of LLMs with
Human-comparable Elo Ratings
Paper
•
2501.01257
•
Published
•
53
MiniMax-01: Scaling Foundation Models with Lightning Attention
Paper
•
2501.08313
•
Published
•
289
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with
Large Language Models
Paper
•
2501.09686
•
Published
•
41
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper
•
2501.10120
•
Published
•
49
GuardReasoner: Towards Reasoning-based LLM Safeguards
Paper
•
2501.18492
•
Published
•
87
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in
Post-Training
Paper
•
2501.18511
•
Published
•
20
LIMO: Less is More for Reasoning
Paper
•
2502.03387
•
Published
•
61
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
•
2502.06703
•
Published
•
151
Expect the Unexpected: FailSafe Long Context QA for Finance
Paper
•
2502.06329
•
Published
•
131
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation
Paper
•
2502.07870
•
Published
•
45
LLMs Can Easily Learn to Reason from Demonstrations Structure, not
content, is what matters!
Paper
•
2502.07374
•
Published
•
39
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance
Paper
•
2502.08127
•
Published
•
56
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large
Language Models
Paper
•
2502.07346
•
Published
•
54
TransMLA: Multi-head Latent Attention Is All You Need
Paper
•
2502.07864
•
Published
•
49
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of
Video Foundation Model
Paper
•
2502.10248
•
Published
•
56
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance
Software Engineering?
Paper
•
2502.12115
•
Published
•
45
Magma: A Foundation Model for Multimodal AI Agents
Paper
•
2502.13130
•
Published
•
58
Qwen2.5-VL Technical Report
Paper
•
2502.13923
•
Published
•
183
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Paper
•
2502.14499
•
Published
•
192
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
•
2502.14786
•
Published
•
143
S*: Test Time Scaling for Code Generation
Paper
•
2502.14382
•
Published
•
63
SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines
Paper
•
2502.14739
•
Published
•
103
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
•
2503.07365
•
Published
•
60
Token-Efficient Long Video Understanding for Multimodal LLMs
Paper
•
2503.04130
•
Published
•
94
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Paper
•
2503.05132
•
Published
•
58
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
•
2503.01785
•
Published
•
78
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
•
2503.01743
•
Published
•
86
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
•
2503.07536
•
Published
•
85
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural
Vision-Language Dataset for Southeast Asia
Paper
•
2503.07920
•
Published
•
98
Unified Reward Model for Multimodal Understanding and Generation
Paper
•
2503.05236
•
Published
•
123
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Paper
•
2503.11579
•
Published
•
20
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model
for Visual Generation and Editing
Paper
•
2503.10639
•
Published
•
50
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
•
2503.10615
•
Published
•
17
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
•
2503.10291
•
Published
•
36
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based
Scientific Research
Paper
•
2503.13399
•
Published
•
21
V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Paper
•
2503.11495
•
Published
•
12
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Paper
•
2503.13444
•
Published
•
16
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM
Paper
•
2503.14478
•
Published
•
47
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
•
2503.12797
•
Published
•
30
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal
Consistent Video Generation
Paper
•
2503.06053
•
Published
•
138
TULIP: Towards Unified Language-Image Pretraining
Paper
•
2503.15485
•
Published
•
48
Stop Overthinking: A Survey on Efficient Reasoning for Large Language
Models
Paper
•
2503.16419
•
Published
•
73
Video-T1: Test-Time Scaling for Video Generation
Paper
•
2503.18942
•
Published
•
88
Video SimpleQA: Towards Factuality Evaluation in Large Video Language
Models
Paper
•
2503.18923
•
Published
•
12
Reasoning to Learn from Latent Thoughts
Paper
•
2503.18866
•
Published
•
13
LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?
Paper
•
2503.19990
•
Published
•
34
Qwen2.5-Omni Technical Report
Paper
•
2503.20215
•
Published
•
147
Scaling Vision Pre-Training to 4K Resolution
Paper
•
2503.19903
•
Published
•
40
CoLLM: A Large Language Model for Composed Image Retrieval
Paper
•
2503.19910
•
Published
•
14
Exploring Hallucination of Large Multimodal Models in Video
Understanding: Benchmark, Analysis and Mitigation
Paper
•
2503.19622
•
Published
•
31
Long-Context Autoregressive Video Modeling with Next-Frame Prediction
Paper
•
2503.19325
•
Published
•
72
MDocAgent: A Multi-Modal Multi-Agent Framework for Document
Understanding
Paper
•
2503.13964
•
Published
•
19
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time
Thinking
Paper
•
2503.19855
•
Published
•
26
CoMP: Continual Multimodal Pre-training for Vision Foundation Models
Paper
•
2503.18931
•
Published
•
30
Defeating Prompt Injections by Design
Paper
•
2503.18813
•
Published
•
20
Wan: Open and Advanced Large-Scale Video Generative Models
Paper
•
2503.20314
•
Published
•
50
Gemini Robotics: Bringing AI into the Physical World
Paper
•
2503.20020
•
Published
•
25
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
•
2503.21776
•
Published
•
78
Large Language Model Agent: A Survey on Methodology, Applications and
Challenges
Paper
•
2503.21460
•
Published
•
77
ResearchBench: Benchmarking LLMs in Scientific Discovery via
Inspiration-Based Task Decomposition
Paper
•
2503.21248
•
Published
•
20
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
•
2503.21696
•
Published
•
22
A Survey of Efficient Reasoning for Large Reasoning Models: Language,
Multimodality, and Beyond
Paper
•
2503.21614
•
Published
•
39
Your ViT is Secretly an Image Segmentation Model
Paper
•
2503.19108
•
Published
•
21
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large
Language Models
Paper
•
2503.24235
•
Published
•
53
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement
Learning on the Base Model
Paper
•
2503.24290
•
Published
•
63
RIG: Synergizing Reasoning and Imagination in End-to-End Generalist
Policy
Paper
•
2503.24388
•
Published
•
30
Any2Caption:Interpreting Any Condition to Caption for Controllable Video
Generation
Paper
•
2503.24379
•
Published
•
75
JudgeLRM: Large Reasoning Models as a Judge
Paper
•
2504.00050
•
Published
•
60
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
•
2503.24376
•
Published
•
38
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal
LLMs on Academic Resources
Paper
•
2504.00595
•
Published
•
35
Z1: Efficient Test-time Scaling with Code
Paper
•
2504.00810
•
Published
•
26
Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for
Large Language Models
Paper
•
2503.24377
•
Published
•
17
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Paper
•
2504.00883
•
Published
•
62
Understanding R1-Zero-Like Training: A Critical Perspective
Paper
•
2503.20783
•
Published
•
46
PaperBench: Evaluating AI's Ability to Replicate AI Research
Paper
•
2504.01848
•
Published
•
36
Advances and Challenges in Foundation Agents: From Brain-Inspired
Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper
•
2504.01990
•
Published
•
268
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image
Generation
Paper
•
2504.02782
•
Published
•
56
Rethinking RL Scaling for Vision Language Models: A Transparent,
From-Scratch Framework and Comprehensive Evaluation Scheme
Paper
•
2504.02587
•
Published
•
30
MedSAM2: Segment Anything in 3D Medical Images and Videos
Paper
•
2504.03600
•
Published
•
8
SmolVLM: Redefining small and efficient multimodal models
Paper
•
2504.05299
•
Published
•
176
One-Minute Video Generation with Test-Time Training
Paper
•
2504.05298
•
Published
•
99
Rethinking Reflection in Pre-Training
Paper
•
2504.04022
•
Published
•
77
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(v1)
Paper
•
2504.03151
•
Published
•
13
Skywork R1V: Pioneering Multimodal Reasoning with Chain-of-Thought
Paper
•
2504.05599
•
Published
•
81
DDT: Decoupled Diffusion Transformer
Paper
•
2504.05741
•
Published
•
73
OmniCaptioner: One Captioner to Rule Them All
Paper
•
2504.07089
•
Published
•
20
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
•
2504.06958
•
Published
•
10
Are We Done with Object-Centric Learning?
Paper
•
2504.07092
•
Published
•
6
Paper
•
2504.07491
•
Published
•
122
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper
•
2504.07128
•
Published
•
82
VCR-Bench: A Comprehensive Evaluation Framework for Video
Chain-of-Thought Reasoning
Paper
•
2504.07956
•
Published
•
45
MM-IFEngine: Towards Multimodal Instruction Following
Paper
•
2504.07957
•
Published
•
34
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
Paper
•
2504.08685
•
Published
•
122
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for
Autoregressive Image Generation
Paper
•
2504.08736
•
Published
•
47
FUSION: Fully Integration of Vision-Language Representations for Deep
Cross-Modal Understanding
Paper
•
2504.09925
•
Published
•
38
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
•
2504.10479
•
Published
•
249
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
•
2504.08837
•
Published
•
42
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
•
2504.09641
•
Published
•
16
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Paper
•
2504.10481
•
Published
•
84
Genius: A Generalizable and Purely Unsupervised Self-Training Framework
For Advanced Reasoning
Paper
•
2504.08672
•
Published
•
54
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper
•
2504.10465
•
Published
•
28
Efficient Reasoning Models: A Survey
Paper
•
2504.10903
•
Published
•
18
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for
Language Model Pre-training
Paper
•
2504.13161
•
Published
•
87
VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference
Optimization for Large Video Models
Paper
•
2504.13122
•
Published
•
21
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper
•
2504.11536
•
Published
•
58
ToolRL: Reward is All Tool Learning Needs
Paper
•
2504.13958
•
Published
•
40
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
•
2504.15279
•
Published
•
67
TTRL: Test-Time Reinforcement Learning
Paper
•
2504.16084
•
Published
•
91
Describe Anything: Detailed Localized Image and Video Captioning
Paper
•
2504.16072
•
Published
•
53
Eagle 2.5: Boosting Long-Context Post-Training for Frontier
Vision-Language Models
Paper
•
2504.15271
•
Published
•
63
Paper2Code: Automating Code Generation from Scientific Papers in Machine
Learning
Paper
•
2504.17192
•
Published
•
92
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning
in Multimodal LLMs
Paper
•
2504.15415
•
Published
•
21
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Paper
•
2504.15521
•
Published
•
60