MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier Paper β’ 2603.03756 β’ Published 10 days ago β’ 86
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders Paper β’ 2603.06569 β’ Published 7 days ago β’ 103
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper β’ 2501.13106 β’ Published Jan 22, 2025 β’ 90
What Is a Good Caption? A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness Paper β’ 2502.14914 β’ Published Feb 19, 2025
MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources Paper β’ 2509.21268 β’ Published Sep 25, 2025 β’ 104
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models Paper β’ 2512.16561 β’ Published Dec 18, 2025 β’ 20
Chain of Ideas: Revolutionizing Research in Novel Idea Development with LLM Agents Paper β’ 2410.13185 β’ Published Oct 17, 2024 β’ 5
Focus on the Whole Character: Discriminative Character Modeling for Scene Text Recognition Paper β’ 2407.05562 β’ Published Jul 8, 2024
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper β’ 2501.00599 β’ Published Dec 31, 2024 β’ 46
AQE: Argument Quadruplet Extraction via a Quad-Tagging Augmented Generative Approach Paper β’ 2305.19902 β’ Published May 31, 2023
GlobalWoZ: Globalizing MultiWoZ to Develop Multilingual Task-Oriented Dialogue Systems Paper β’ 2110.07679 β’ Published Oct 14, 2021
Multi-Agent Tool-Integrated Policy Optimization Paper β’ 2510.04678 β’ Published Oct 6, 2025 β’ 31
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning Paper β’ 2510.13515 β’ Published Oct 15, 2025 β’ 12
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling Paper β’ 2511.11793 β’ Published Nov 14, 2025 β’ 191
OpenMMReasoner: Pushing the Frontiers for Multimodal Reasoning with an Open and General Recipe Paper β’ 2511.16334 β’ Published Nov 20, 2025 β’ 93
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling Paper β’ 2511.20785 β’ Published Nov 25, 2025 β’ 187