Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories Paper • 2604.17596 • Published 5 days ago • 1
MessIRve: A Large-Scale Spanish Information Retrieval Dataset Paper • 2409.05994 • Published Sep 9, 2024
HardTests: Synthesizing High-Quality Test Cases for LLM Coding Paper • 2505.24098 • Published May 30, 2025 • 43
HardTests: Synthesizing High-Quality Test Cases for LLM Coding Paper • 2505.24098 • Published May 30, 2025 • 43