SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?
Multi-hour SWE benchmark spanning library reproductions, full-stack product clones, and MLE.
>
open_site.sh
Multi-hour SWE benchmark spanning library reproductions, full-stack product clones, and MLE.
Why automating RL environment creation means building the loop around task generation.
Trust boundaries for coding agents: verifiers, artifacts, and network access.
Frontier coding agents caught cheating on long-horizon tasks: a leaderboard, a taxonomy of reward hacks, and what it means for benchmarks.
Hillclimbing — the practice of making a number go up for a capability that resists clean definition — is the core bottleneck on the path to AGI.