
GEODESIC
The shortest path to impact.
Our Research Areas
Generalisation of Chain-of-Thought Obfuscation
We are investigating whether chain-of-thought (CoT) obfuscation learned in one domain generalizes to safety-critical contexts. When models are trained against CoT monitors to improve outputs, they can learn to hide problematic reasoning while preserving undesirable behaviors—a phenomenon called obfuscation. Our research examines whether this learned obfuscation transfers across domains, particularly from benign training settings (like content moderation) to high-stakes scenarios (like deceptive planning). We will employ a two-stage experimental approach: first, we replicate and extend existing results showing that reward hacking behaviors and their obfuscation can occur under CoT monitoring pressure; second, we test whether obfuscation learned in low-stakes domains compromises the monitorability of CoT in safety-critical applications. This builds on our previous work on lateral generalization of steganography behaviors. If obfuscation consistently generalizes across domains, it would indicate that any optimization pressure on CoT, even for benign objectives, could undermine a key safety mechanism for detecting misaligned model behavior. Our findings will inform best practices for maintaining CoT transparency in AI control protocols.
People: Cameron Tice, Puria Radmard
For more on generalisation of chain-of-though obfuscation, read our paper here.
Interested in CoT obfuscation? We’re hiring!
Chain-of-Thought Health Metrics (MARS 3.0)
In current reasoning models, a chain of thought (CoT) of reasoning steps is output by the model before outputting the final answer. The CoT is typically human readable, and could provide a good way to monitor for model misbehaviour. However, there are several potential failure modes that could cause the CoT to become “unfaithful” (i.e., not representative of what the model actually thinks internally). We identify four types of failures, or CoT “pathologies”: post-hoc and internalized reasoning, and syntactic and semantic steganography. We design a set of metrics that respond to increases in these pathologies, which indirectly measure how pathological or faithful the CoTs are. We then train models to exhibit each of these pathologies, and evaluate their scores on our metrics compared to typical CoTs. If our metrics prove useful, it could greatly increase visibility into the safety behaviors of frontier models’ chains of thought, which would be good news for mitigating loss of control risks. We will release our system as an open source library, enabling practitioners to integrate the measurements into large training runs.
People: David Williams-King, Linh Le, Ida Caspary, Manqing Liu
Externalisation (MARS 3.0)
Our research develops early exit mechanisms to increase transparency in language model reasoning and motivate more complete chain of thought externalization. We are training models to externalize their internal reasoning processes by exiting early during their forward pass, making their decision-making more interpretable and auditable. We use a two-phase approach: first, supervised fine-tuning teaches models to identify where they can exit early while maintaining output quality; second, reinforcement learning incentivizes even earlier exits, forcing models to output their reasoning externally rather than processing it across hidden layers. This research targets the deceptive alignment problem, where opaque reasoning in advanced models results in misaligned behavior that is difficult to detect. By encouraging earlier externalization of reasoning, we provide greater visibility into how models reach their conclusions. The work builds on established research showing that language models perform sequential reasoning across internal layers, with this reasoning being recoverable through intermediate layer probes. If successful, this work will produce an open-source framework for training models that generate human-legible reasoning chains corresponding to their internal computational steps, enabling more effective interpretability and monitoring tools.
People: Liza Pavlova, Karthik Viswanathan, Mariia Koroliuk
Chain-of-Thought Injection for AI Control (MARS 3.0)
We investigate whether LLMs with reasoning capabilities can be effectively steered using chain-of-thought (CoT) injections. Building on existing work showing LLMs' susceptibility to CoT injections for jailbreaking, we explore what components are necessary for steering untrusted models away from misaligned behavior. Our experimental setup uses two models: an untrusted model and a trusted monitor model. The untrusted model outputs CoT reasoning traces before producing an answer. When answers or reasoning traces are suspicious, the trusted monitor model identifies a point within the CoT to inject steering text. The monitor then crafts injection text intended to steer the untrusted model away from undesired behaviours. Following injection, a new answer is produced by the untrusted model, hopefully now devoid of malicious content. We will be performing experiments using ControlArena's safety-oriented tasks, focusing initially on the BashBench benchmark. If successful, this research could establish CoT injections as an additional AI control method, providing a scalable approach for steering potentially misaligned models toward safer outputs through targeted reasoning interventions.
People: Lindsay Smith, Ananya Malik