HaluMem: Evaluating Hallucinations in Memory Systems of Agents Paper • 2511.03506 • Published Nov 5, 2025 • 93
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution Paper • 2510.25726 • Published Oct 29, 2025 • 45
DeepAgent: A General Reasoning Agent with Scalable Toolsets Paper • 2510.21618 • Published Oct 24, 2025 • 99
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs Paper • 2509.09677 • Published Sep 11, 2025 • 34
Glyph: Scaling Context Windows via Visual-Text Compression Paper • 2510.17800 • Published Oct 20, 2025 • 67
A Theoretical Study on Bridging Internal Probability and Self-Consistency for LLM Reasoning Paper • 2510.15444 • Published Oct 17, 2025 • 147
Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL Paper • 2508.07976 • Published Aug 11, 2025 • 51
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use Paper • 2510.05592 • Published Oct 7, 2025 • 106
Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs Paper • 2509.24107 • Published Sep 28, 2025 • 78
BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents Paper • 2504.12516 • Published Apr 16, 2025 • 1
BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese Paper • 2504.19314 • Published Apr 27, 2025 • 7