Search Discipline for Long-Horizon Research Agents

Motivation

We started by asking whether an autoresearch agent could be made more creative by prompting it to step back, draw structural analogies, and look for mechanisms outside its local search frame. The test bed was a burned-area submodel inside the Ecosystem Demography pipeline. The agent began from a strong baseline, Model C, and tried to improve a single global formula under a fixed input contract. No routing by region, no memorized cells, no new external inputs, no residual patches. Different regimes had to fall out of the same global formula reacting to allowed input-derived state.

That setup made the project a real test of long-horizon agentic research. The agent could run real experiments, propose real candidates, and produce a final report on its own. The question was whether the report would be the right one.

What we found

Creativity was not the bottleneck. Discipline was. The agent could find genuine candidates and still pick the wrong one because the aggregate verifier was too coarse.

The clearest case came from a monitor-governed run with two candidates from the same mechanism family, tuned at different strengths. On the global score they sat within 0.0007 of each other, well inside the evaluation noise band. On the boreal forest regions the higher-scoring candidate dropped fidelity by about 0.10 and 0.07 in the two protected regions. The lower-scoring candidate held the same regions within about 0.006 and 0.003 of the baseline. A selector reading only the aggregate would ship the candidate that breaks the boreal forest. The disaggregated regional evidence said the lower-scoring one was the defensible decision.

We call this aggregate-verifier inversion. The aggregate ranks one candidate first while the disaggregated scientific evidence says another candidate is the better choice. The fix is a search-discipline protocol built around a candidate-effect audit. At every serious candidate boundary, the agent records helped regimes, harmed regimes, unchanged regimes, which score components moved, what allowed input-derived state separates helped from harmed, and which of five roles the candidate fills: score winner, defended candidate, tradeoff, informative failure, or rejected shortcut. An external monitor enforces the audit and is allowed to demote score winners and reopen runs the agent has already declared finished.

What was surprising

Three things stood out.

The static reframe prompt did change what the agent tried. It made release windows, guarded mechanisms, corridors, and handoffs more available as hypotheses. But it did not reliably govern stopping. After a few loops the agent could still collapse back into local search.

Weak-region repair was easy to induce. Clean separation from protected true-fire regimes was hard. The recurring tradeoff was sharper than a simple scoring problem. Repairing one regime often damaged another, and the aggregate score hid the swap.

The most valuable role for the monitor was not inventing formulas. It was forcing the main agent to justify candidate effects before acceptance or stopping. The judge does not need to score truth. It needs to enforce a protocol.

Next steps

The first paper is intentionally narrow. It documents a failure mode and a protocol. Several questions remain open.

Whether the current input contract is rich enough to separate weak false-positive regimes from protected true-fire regimes without damaging one side is unresolved. A proper ceiling study needs a different setup from the paper runs.

Whether the monitor improves the search frontier or only the selection step is also unresolved. The current evidence supports the governance claim more strongly than a frontier-expansion claim.

A stronger follow-up would separate three questions cleanly. Does structural prompting change hypothesis generation. Does external monitoring change stopping and candidate acceptance. Does the candidate-effect audit move the score frontier. Those are different experiments and should be run as such.

The full argument, the experiment log, and the artifact discipline live in the preprint. Read the paper on arXiv.