RL environment creation is becoming continuous QA
There has been a lot of discussion about automating RL environment creation. Most of it focuses on task generation: can an agent write the instructions, verifier, and reference solution?
That is only the first step. A generated environment is not useful until it has survived contact with frontier agents. A pass is not automatically evidence that the task is good; it may reveal a shortcut, ambiguity, or weak verifier. A failure is not automatically evidence that the task is hard; it may reveal missing instructions, brittle tests, or an impossible setup.
The core loop is:
Generate a task, run agents against it, read the traces, revise the environment, and rerun until the failures mean what you think they mean.
The important shift is not just that agents will write environments. It is that agents will operate more of this validation loop: run trials, inspect traces, find shortcuts, propose edits, rerun, and repeat. Human experts still choose what matters, but the tedious middle becomes automated.
That is the direction I have been pushing with my project Oddish: run trials, keep traces, schedule follow-up work, and make environment revision tractable.
Environment Creation Loop
This makes model behavior part of environment creation. You are no longer just asking whether a generated task exists, or whether an agent can solve it once. You are asking what many trajectories reveal about the task's incentives, ambiguity, and test coverage.
That changes what "good" means:
- If every frontier agent fails, the task may be hard, underspecified, flaky, or impossible.
- If every frontier agent passes, the task is probably saturated.
- The useful region is in between: some strong attempts solve it, many miss it, and the failures reflect capability.
For hard coding environments, a 10-30% pass@1 from frontier agents is ideal. The useful artifact is the trace that shows whether a failure came from capability, ambiguity, brittleness, or an accidental shortcut. Oddish gives the loop a substrate for keeping those traces and rerunning the right follow-up trials.
Automating QA
Once trials are represented as traces with verdicts, environment revision becomes schedulable work:
- A suspicious pass triggers a rerun with a modified harness.
- Brittle failures trigger a job that proposes instruction edits.
- An easy-looking task goes to an adversarial agent whose only job is to find shortcuts.
The C compiler task is the canonical example. Ask an agent to build a C compiler in Rust, and the first round of trials show a model "solving" it by wrapping gcc, hardcoding outputs, or exploiting a weak test harness. That is not just a bad score to record; it is a concrete revision request. Add tests that compile unseen programs, run it under strace, constrain network egress, and rerun the same model sweep to see whether the shortcut is gone.
These probes mean different things, but they share the same interface: assign work to a model and sandbox, produce a verdict, and feed the verdict into the next revision.
Pass@k scores are only one component. Automated environment creation needs the next step too: explain why the score happened and decide what should change before the next run.
Scheduling Thousands of Agent Trials
The loop only matters if it can run often. Once agents handle trials, trace analysis, and revision, scheduling becomes part of environment creation.
A single task may need 10-30 trials across 3 agents before you trust its difficulty. A batch of task drafts can turn into thousands of long-running jobs, followed by more jobs that summarize traces, classify failures, probe suspicious passes, and propose revisions.
At that scale, a single global queue is too blunt. The useful primitive is simpler: one queue per model, because capacity and failure modes are model-specific:
- Opus might be slow and scarce.
- Sonnet might be faster and plentiful.
- Gemini might be backed off while OpenAI is healthy.
Oddish handles this with per-model queues. Each model queue has its own concurrency limit, retry policy, and backoff state. If Gemini is waiting on a rate limit, Gemini jobs stay parked while Claude and OpenAI jobs keep moving.
That makes the queue more than a buffer. It becomes the control plane for a heterogeneous evaluation workload:
- Opus running 4 jobs and GPT running 32 should not be averaged into a fake global limit.
- Rate-limited jobs should wait until their retry time, not occupy sleeping workers.
- Healthy model queues should keep running even when another queue is blocked.
Users or agents can submit a sweep once. Oddish breaks the sweep into model-specific trials, puts each trial into the right model queue, and lets workers claim from queues that have capacity.
The result is automated environment creation that is interactive enough to use: submit drafts, run agents, analyze logs, synthesize verdicts, and retry transient failures without manually tracking which provider is unhappy.
Takeaways
Automating RL environment creation means turning task development into continuous QA.
Human experts still seed the task and decide what matters, but the expensive middle can increasingly be delegated: run trials, read traces, detect broken incentives, propose edits, and rerun. Frontier models are now good enough to invalidate shallow task review. If an environment has an ambiguity, leaked shortcut, brittle assertion, or accidental exploit path, agents will find it.
Oddish is my attempt to build the tooling this loop needs: reliable trials, trace capture, provider-aware scheduling, and enough structure for agents to keep improving environments before human review. The full code is open-source here under the Apache 2.0 License.