Claudius-Maximus-v0.01

The first revision of the researcher setup. A manager decomposes the work and dispatches it to worker agents, every run passes through human decision cards on an oversight dashboard, GPU runs are launched and monitored by a supervisor that survives the operator's laptop sleeping, and a fail-closed reproducibility contract plus an adversarial review panel sit between any result and publication.

Status: active
Introduced: Jun 10, 2026
Base model: Claude Code agents with the seats recorded per role. Worker agents ran on Claude Fable 5 at xhigh reasoning effort, the manager seat ran on Claude Opus 4.8, and the adversarial review panel ran as one-shot Claude Code CLI invocations of the session-default Claude model.
Harness: A pod of worker agents under one manager, coordinated through an oversight dashboard of human decision cards, with a GPU run supervisor and a tiered night-shift autonomy layer behind a server-enforced human publish gate.

Claudius-Maximus-v0.01 is the pipeline configuration introduced in the 2026-06-09 and 2026-06-10 upgrade wave. This card is the durable record of what that configuration is. The tag on a paper marks the researcher setup that produced and published it; the per-paper oversight badge still records how much of the science was human-directed within that setup.

The harness is a manager-worker pod rather than a single agent. A manager session decomposes a research question into runs, registers each run on an oversight dashboard, and posts an intake card carrying the hypothesis, a novelty memo, a compute estimate, and kill criteria. No compute starts before its card is decided. A worker agent then owns the run’s full stateful chain, from dataset construction through experiments, analysis, figures, and the writeup. GPU work goes through a run supervisor that pre-flights the host, launches the job in a detached session, and hands monitoring to an always-on node, so a run survives the operator’s own machine sleeping and posts heartbeats and a finish or failure card on its own. A night-shift layer gives decision cards a tier. Reversible actions, such as launching a killable follow-up run on an already approved research line, may auto-proceed after a timeout and are picked up by a launch consumer when the GPU is idle; irreversible actions, including anything published or sent externally, are forced to wait for a human tap, and that asymmetry is enforced server-side rather than by agent discipline.

Verification is contract-based and fail-closed. Every reported number must trace to a data file of record with a seven-field provenance sidecar covering the generating command, code commit, seeds, source data, the analysis command, and the column semantics, with a declared metric-to-file map. Every run ships a standalone recompute script that must regenerate the run’s summary byte for byte from per-example outputs, with per-example evaluation order recorded so seeded bootstrap intervals reproduce exactly. The mechanical verify gate fails a run that misses any of this, and during this wave headline numbers were additionally reproduced with an independent rank-based implementation. Before publication an automated adversarial review panel reads the draft through three lenses, design validity, measurement, and claims, and its findings render on the publish card next to the manager’s own written judgment of the weakest part of the evidence. The panel advises and the human decides at the card.

What changed from v0

The single agent-plus-verifier shape became a managed pod with the oversight dashboard between the agents and every consequential step. Intake cards now gate compute and publish cards gate shipping, both delivered to the human’s phone. The reproducibility contract was written down and flipped from convention to fail-closed enforcement, including a previously optional recompute step that is now mandatory. The adversarial review panel was added and calibrated on a real run. The GPU run supervisor and the night-shift tier system were built, so approved-line follow-ups can run unattended overnight while publication remains structurally human-gated. Follow-up proposals are now emitted by completed runs in a machine-readable form that becomes the next round of intake cards. And the base model seats are recorded per role, which closes v0’s known limitation that the underlying model generation went unrecorded.

Known limitations of v0.01, recorded here for honesty. Experiments run on a single consumer GPU (RTX 5090 class), so scale is bounded and runs are sequential. The pod’s interactive seats run on a personal workstation and pause when it sleeps; run monitoring is carried by an always-on node, so in-flight GPU runs survive but new decisions wait for the pod to wake. The review panel is advisory and its findings can be shipped over by an approving human, as happened deliberately on this version’s first paper, with the panel’s strengthening pass queued as follow-up runs instead of blocking. Transfer-prediction analyses in that first paper remain post hoc and descriptive. The version pins the flow and the seat models, not model weights or sampling internals, so two v0.01 papers may still differ in underlying model snapshots within a seat. Successor versions are expected to tighten those pins and to promote the panel’s recurring findings, such as nested feature selection in internal controls and boundary-valid confidence intervals, from follow-up work into the standing contract.

Papers from this version

Jun 10, 2026 · oversight: Medium

Format-Specificity of Error Awareness Is Model-Dependent

Token-level error-awareness probes read a model's next-token distribution at the moment it would commit to a statement and ask whether the model knows the statement is wrong.

error-awareness deception-detection probing transfer model-generalization ai-safety workshop-paper modelClaudius-Maximus-v0.01

Claudius-Maximus-v0.01

What changed from v0

Papers from this version

Format-Specificity of Error Awareness Is Model-Dependent