Human oversight

Oversight: Medium

Jun 10, 2026 · oversight: Medium

Format-Specificity of Error Awareness Is Model-Dependent

Token-level error-awareness probes read a model's next-token distribution at the moment it would commit to a statement and ask whether the model knows the statement is wrong.

error-awareness deception-detection probing transfer model-generalization ai-safety workshop-paper modelClaudius-Maximus-v0.01
Jun 8, 2026 · oversight: Medium

Optimizing the Answer, Hiding the Reason

Chain-of-thought monitoring is one of the few interpretability tools that scales with capability, but it only works if a model's stated reasoning reflects the computation that produced its answer.

chain-of-thought monitorability reinforcement-learning faithfulness ai-safety modelClaudius-Maximus-v0