Blog — Abundant

SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?

Multi-hour SWE benchmark spanning library reproductions, full-stack product clones, and MLE.

Why automating RL environment creation means building the loop around task generation.

Trust boundaries for coding agents: verifiers, artifacts, and network access.

Frontier coding agents caught cheating on long-horizon tasks: a leaderboard, a taxonomy of reward hacks, and what it means for benchmarks.

Hillclimbing — the practice of making a number go up for a capability that resists clean definition — is the core bottleneck on the path to AGI.