Research Areas

Generalisation of Chain-of-Thought Obfuscation

We are investigating whether chain-of-thought (CoT) obfuscation learned in one domain generalizes to safety-critical contexts. When models are trained against CoT monitors to improve outputs, they can learn to hide problematic reasoning while preserving undesirable behaviors—a phenomenon called obfuscation. Our research examines whether this learned obfuscation transfers across domains, particularly from benign training settings (like content moderation) to high-stakes scenarios (like deceptive planning). We will employ a two-stage experimental approach: first, we replicate and extend existing results showing that reward hacking behaviors and their obfuscation can occur under CoT monitoring pressure; second, we test whether obfuscation learned in low-stakes domains compromises the monitorability of CoT in safety-critical applications. This builds on our previous work on lateral generalization of steganography behaviors. If obfuscation consistently generalizes across domains, it would indicate that any optimization pressure on CoT, even for benign objectives, could undermine a key safety mechanism for detecting misaligned model behavior. Our findings will inform best practices for maintaining CoT transparency in AI control protocols.

People: Cameron Tice, Puria Radmard

For more on generalisation of chain-of-though obfuscation, read our paper here.

Chain-of-Thought Health Metrics (MARS 3.0)

In current reasoning models, a chain of thought (CoT) of reasoning steps is output by the model before outputting the final answer. The CoT is typically human readable, and could provide a good way to monitor for model misbehaviour. However, there are several potential failure modes that could cause the CoT to become “unfaithful” (i.e., not representative of what the model actually thinks internally).We identify four types of failures, or CoT “pathologies”: post-hoc and internalized reasoning, and syntactic and semantic steganography. We design a set of metrics that respond to increases in these pathologies, which indirectly measure how pathological or faithful the CoTs are. We then train models to exhibit each of these pathologies, and evaluate their scores on our metrics compared to typical CoTs. If our metrics prove useful, it could greatly increase visibility into the safety behaviors of frontier models’ chains of thought, which would be good news for mitigating loss of control risks. We will release our system as an open source library, enabling practitioners to integrate the measurements into large training runs.

People: David Williams-King, Linh Le, Ida Caspary, Manqing Liu

Externalisation (MARS 3.0)

Our research develops early exit mechanisms to increase transparency in language model reasoning and motivate more complete chain of thought externalization. We are training models to externalize their internal reasoning processes by exiting early during their forward pass, making their decision-making more interpretable and auditable. We use a two-phase approach: first, supervised fine-tuning teaches models to identify where they can exit early while maintaining output quality; second, reinforcement learning incentivizes even earlier exits, forcing models to output their reasoning externally rather than processing it across hidden layers. This research targets the deceptive alignment problem, where opaque reasoning in advanced models results in misaligned behavior that is difficult to detect. By encouraging earlier externalization of reasoning, we provide greater visibility into how models reach their conclusions. The work builds on established research showing that language models perform sequential reasoning across internal layers, with this reasoning being recoverable through intermediate layer probes. If successful, this work will produce an open-source framework for training models that generate human-legible reasoning chains corresponding to their internal computational steps, enabling more effective interpretability and monitoring tools.

People: Liza Pavlova, Karthik Viswanathan, Mariia Koroliuk

Chain-of-Thought Injection for AI Control (MARS 3.0)

We investigate whether LLMs with reasoning capabilities can be effectively steered using chain-of-thought (CoT) injections. Building on existing work showing LLMs' susceptibility to CoT injections for jailbreaking, we explore what components are necessary for steering untrusted models away from misaligned behavior. Our experimental setup uses two models: an untrusted model and a trusted monitor model. The untrusted model outputs CoT reasoning traces before producing an answer. When answers or reasoning traces are suspicious, the trusted monitor model identifies a point within the CoT to inject steering text. The monitor then crafts injection text intended to steer the untrusted model away from undesired behaviours. Following injection, a new answer is produced by the untrusted model, hopefully now devoid of malicious content. We will be performing experiments using ControlArena's safety-oriented tasks, focusing initially on the BashBench benchmark. If successful, this research could establish CoT injections as an additional AI control method, providing a scalable approach for steering potentially misaligned models toward safer outputs through targeted reasoning interventions.

People: Lindsay Smith, Ananya Malik

Self-Generated Text Recognition Across Operational Contexts (MARS 3.0)

We are investigating how large language models (LLMs) recognize their own outputs across different definitions and contexts of "self". While self-recognition has been narrowly studied as models identifying their outputs from candidate sets, we hypothesize that LLMs possess a unified representation of self that governs the production of self-generated text (SGT) and spans multiple operationalizations: different model instances, fine-tuned variants, and text appearing in various contextual locations (user messages, assistant messages, chain-of-thought). Our research examines recognition accuracy through experiments across varying priming methods (explicit prompting versus implicit behavioral measures), contextual locations and notions of “self”. Preliminary results reveal some interesting effects, such as models continuing in the syntactic style of injected content while ultimately rejecting it as foreign, suggesting recognition capabilities that fluctuate even within the local contexts of single completions. By establishing whether self-recognition shares common mechanisms across these variations, we aim to determine if SGT recognition can be jointly studied and mitigated rather than requiring separate defenses for each operational context. Our findings will directly inform the effectiveness of monitoring techniques in the presence of SGT, such as untrusted monitoring, and the robustness of control protocols, like trusted (CoT) editing.

People: Sohaib Imran, Jesse St. Amand