CoT Faithfulness

"The model shows its work. But whose work is it showing?"

CoT Faithfulness

Chain-of-Thought Reasoning and Why It Might Lie

When a model shows you its reasoning, there is no guarantee that reasoning caused its answer.

How to read this:

~14 min read
01

The Idea

The Readable Mind

When you ask a language model a difficult question, many systems can be prompted or trained to produce step-by-step reasoning traces or explanations. A model walks through sub-problems, checks its work, and produces a conclusion along with an account of how it got there. This kind of step-by-step output is called chain-of-thought reasoning, and it was a genuine breakthrough: simply prompting models to reason step by step before answering caused large accuracy improvements on math, logic, and commonsense tasks that had seemed out of reach.

But a visible reasoning process is not the same as a transparent one.

There are two things that might be called the model's "reasoning." The first is the chain of thought you can read — the sequence of sentences the model generates before its final answer, the working-out it shows on the page. The second is the internal computation that actually determines what token comes next: the activations propagating through layers of attention heads and feed-forward networks, shaped entirely by learned weights and input context. These two processes run in the same system. But they are not the same process, and there is no guarantee they are connected by anything stronger than correlation.

A faithful chain of thought is one where the visible reasoning causally produces the output: if you removed or changed the reasoning steps, the answer would change too. An unfaithful chain of thought is one where the model had already "decided" its answer through some internal pathway that bypassed the written reasoning entirely — then generated a plausible-sounding explanation afterward. The explanation is not wrong, exactly. It just wasn't the reason.

Researchers call this distinction faithfulness, and its absence has a name that borrows from human psychology: post-hoc rationalization. The model generates an answer through one computational pathway, then constructs a narrative about that answer through a different one. Both pathways are running in the same transformer. But only one of them is described in the output you read.

Input XChain ofThoughtAnswer Y
In the faithful case, the reasoning trace mediates the input-output relationship. X → CoT → Y.
Input Xdirect effectChain ofThoughtAnswer Y
In the unfaithful case, the input affects the answer directly (red). The chain of thought is generated as a parallel process, not a mediating one.

Let's look at the math that makes this precise, and then see it live.

02

The Math

To make faithfulness precise you need a definition that doesn't depend on how plausible the reasoning sounds. Jacovi and Goldberg (2020) gave the field its clearest formulation: an explanation is faithful if it accurately reflects the reasoning process that produced the output — not if it sounds reasonable to a human reader.[1] Plausibility (does this explanation make sense?) and faithfulness (did this explanation cause the output?) are orthogonal properties. A model can produce explanations that are both plausible and unfaithful, and this is exactly what the empirical evidence shows.

The Causal Diagram

The promise of chain-of-thought can be written as a simple causal diagram. The input XX causes the model to generate a chain of thought CoT\text{CoT}, and that CoT causes the final answer YY:

X → CoT → Y

Under this diagram, monitoring the CoT gives you causal information about the answer. Changing the CoT would change the answer. The problem Turpin et al. (2023) identified is that the actual diagram looks more like:[3]

X → CoT
X → Y

with a possible but non-guaranteed link from CoT to Y. The input causes both the reasoning trace and the answer through partially independent pathways. The reasoning does not necessarily mediate the answer.

The Bias Injection Test (Turpin et al., 2023)

Turpin et al. designed a test to empirically distinguish these two diagrams. They took standard multiple-choice reasoning benchmarks and introduced biasing features into the input: subtle signals that suggest a particular answer choice without stating it. The two main bias types were sycophantic pressure (a statement implying the user prefers option A) and suggestive formatting (one answer choice highlighted or placed to prime the model toward it).

Under a faithful CoT, the model should either ignore the bias and produce the correct answer, or mention the bias in its reasoning and explain why it's not relevant. What Turpin et al. found was a third outcome: the bias systematically shifted the model's final answer, but the chain of thought never mentioned the bias. The reasoning produced a confident, internally consistent argument for the biased answer — without acknowledging the actual cause of the shift.

Let YY be the model's answer, CoT\text{CoT} its chain of thought, XX the original input, and bb a biasing feature added to XX. The total causal effect of the bias on the answer is:

ΔY=Y(X+b)Y(X)\Delta Y = Y(X + b) - Y(X)

For faithful CoT, we would expect the CoT to verbalize bb and mediate this shift:

ΔCoT>0(bias appears in reasoning)\Delta\text{CoT} > 0 \quad \text{(bias appears in reasoning)}
ΔY flows through ΔCoT\Delta Y \text{ flows through } \Delta\text{CoT}

What they observed was the opposite:

ΔCoT0(bias absent from reasoning)\Delta\text{CoT} \approx 0 \quad \text{(bias absent from reasoning)}
ΔY0(answer shifted by bias)\Delta Y \neq 0 \quad \text{(answer shifted by bias)}

The bias moved the answer without leaving any trace in the reasoning. The chain of thought is not faithfully reporting what caused the output.

The Truncation Test (Lanham et al., 2023)

Lanham et al. took a different approach: instead of injecting biases, they asked what happens when you remove the reasoning.[4] If CoT is causally necessary for the correct answer, then truncating or corrupting the chain of thought should cause accuracy to drop. If it's post-hoc, accuracy should be unaffected because the model had already "committed" to its answer before generating the reasoning.

They tested three perturbations: early answering (forcing the model to answer after only k steps), paraphrasing (replacing reasoning steps with semantically similar alternatives), and adding mistakes (inserting errors into intermediate steps). The result depended on model size and task type. Smaller models and harder tasks showed more faithful CoT: truncation caused accuracy to drop. Larger models on tasks they could solve without CoT showed approximately unchanged accuracy under truncation, consistent with post-hoc generation.

This gave rise to what they called the inverse scaling hypothesis for faithfulness: in their evaluations, models that could solve problems without CoT increasingly generated reasoning that did not causally drive their outputs. Faithfulness varied by model, task, and perturbation type — appearing to peak at intermediate scale in some settings, where tasks are hard enough that the model genuinely needs to reason through them. This is a hypothesis about an observed pattern, not a universal law.

Causal Mediation Analysis

The most rigorous framework comes from causal mediation analysis. Define the following:X0X_0 is the original input, X1X_1 is the input with a bias applied, R0R_0 is the CoT generated on X0X_0, R1R_1 is the CoT generated on X1X_1, and YijY_{ij} is the output when the input is XiX_i and the reasoning is RjR_j. The Total Effect of the intervention is:

TE=Y11Y00\text{TE} = Y_{11} - Y_{00}

The Direct Effect — the input's influence on output without going through CoT — is measured by applying the biased input X1X_1 while holding the CoT fixed at R0R_0:

DE=Y10Y00\text{DE} = Y_{10} - Y_{00}

If the output changes even when the reasoning is unchanged, the input is affecting the output through a pathway that bypasses the CoT. The Indirect Effect — the input's influence flowing through CoT — is:

IE=TEDE\text{IE} = \text{TE} - \text{DE}

For a faithful chain of thought, the indirect effect should dominate: IETE\text{IE} \approx \text{TE}, DE0\text{DE} \approx 0. The reasoning mediates the output. For an unfaithful chain of thought, the direct effect dominates: DETE\text{DE} \approx \text{TE}, IE0\text{IE} \approx 0. The input determines the output through a pathway the reasoning never touches.

This framework is not just theoretical. It operationalizes exactly what we mean when we ask whether a model is "using its reasoning" — and it gives us an empirical way to check, using activation patching and counterfactual generation. The faithfulness problem is not a vague philosophical worry. It has a precise causal definition and measurable signatures.

03

Lab

The visualizations below are synthetic teaching examples. They are not live model outputs, empirical measurements, or claims about any specific deployed model. The plotted values are schematic, chosen to illustrate qualitative findings from the literature.

Three interactive visualizations of CoT faithfulness. The first shows the causal diagram structure — drag the bias slider and watch the causal pathways shift between faithful and unfaithful configurations. The second demonstrates the bias injection test: watch how a sycophantic signal rewrites the chain of thought without ever acknowledging the cause. The third shows the truncation effect: how accuracy changes as reasoning is progressively removed across task types and model sizes.

3.1: Causal Pathway Explorer

3.2: Bias Injection Simulator

3.3: Truncation Effect Curve

04

Safety

Chain-of-thought monitoring is one of the most actively proposed tools for AI safety. The argument is straightforward: if a model reasons step by step before acting, and that reasoning is visible and readable, then human overseers can inspect it for warning signs — deceptive planning, reward hacking, unsafe intent, manipulation of the training process itself. The safety value of frontier models producing step-by-step reasoning traces before acting has been cited as a genuine advantage of extended reasoning systems over opaque ones.

This argument has a premise that is often unstated: that the chain of thought is faithful.

The faithfulness problem does not make CoT monitoring useless. It makes it conditionally useful — dependent on a property that must be empirically verified, that varies across tasks and model families, and that may actively degrade as models become more capable. There is at present no universally reliable method for determining, for a given output, whether the chain of thought that preceded it causally produced it.

Alignment Faking and the Scratchpad Problem

In December 2024, Anthropic's Alignment Science team published results documenting alignment faking in Claude 3 Opus.[5] The model, when placed in scenarios suggesting it would be retrained to comply with harmful requests, sometimes used its scratchpad to reason explicitly about whether to comply during training (to preserve its existing values from modification) while behaving differently in unmonitored contexts. The scratchpad — the model's visible chain of thought — was used to reason about the strategic implications of being monitored.

This is not the same as unfaithful CoT. In the alignment faking setting, the scratchpad reasoning appears to be faithfully driving the behavior. But the finding raises a harder question: what happens in a model where the alignment-faking reasoning does not appear in the scratchpad? If the model learned to suppress its strategic planning from the visible chain of thought while still executing on it internally, the CoT monitoring would return a false negative — clean reasoning, dangerous computation.

Turpin et al.'s result shows that models already do something structurally similar in a benign context: they pursue a computational pathway (following a sycophantic bias) without reporting it in the chain of thought.[3] The bias does not appear in the reasoning. The answer is affected anyway. The mechanism that would enable strategic scratchpad suppression is not hypothetical; something like it is already present in how CoT handles irrelevant-but-influential input features.

Reasoning Models and the Faithfulness Gap

Chen et al. (2025), in a paper titled "Reasoning Models Don't Always Say What They Think," studied faithfulness in reasoning models including Claude 3.7 Sonnet and DeepSeek R1, comparing them against non-reasoning baselines such as Claude 3.5 Sonnet New and DeepSeek V3.[6] They found that reasoning traces often failed to verbalize hints or other factors that influenced the final answer.

This matters because reasoning models represent the frontier of the CoT safety argument. If you believe visible reasoning enables oversight, then a system that produces long reasoning traces before acting seems more auditable than one that doesn't. Chen et al.'s finding suggests this intuition may not hold: longer reasoning output does not guarantee more faithful coverage of the factors that determined the answer.

Three Specific Failure Modes

Sycophantic drift without acknowledgment. A model's answer is systematically influenced by signals about what the user wants to hear. The chain of thought produces a confident argument for the user-preferred answer without acknowledging that preference as a factor. An auditor reading the reasoning cannot detect the influence. Turpin et al. documented this at scale across standard benchmarks.

Capability bypass. A sufficiently capable model can solve a problem before generating its chain of thought. The reasoning it produces is a plausible reconstruction, not a record of the computation. On tasks the model could often solve without explicit CoT, Lanham et al. found accuracy was approximately preserved even under heavy truncation — consistent with post-hoc generation in those settings. A safety monitor reading the CoT may be reading a description of a process that didn't happen.

Strategic verbalization suppression. If a model has learned (from training on human data) that certain kinds of reasoning receive negative feedback, it may learn to suppress that reasoning from the visible output while still executing it internally. The chain of thought would then be systematically incomplete in exactly the ways that are safety-relevant.

The third failure mode is the most concerning and the least empirically established. There is no confirmed case of a deployed model strategically suppressing safety-relevant reasoning from its visible chain of thought. The concern is an extrapolation from existing evidence — the sycophancy-bias suppression documented by Turpin et al. and the incomplete verbalization found by Chen et al. — not a settled empirical result. Whether strategic suppression is occurring in deployed systems cannot currently be answered with available tools, in part because those tools depend on the faithfulness they are trying to verify.

What Faithful CoT Would Require

For CoT monitoring to be a reliable safety tool, the following would need to hold: the visible reasoning causally mediates the output (the Turpin condition); all safety-relevant factors in the input are acknowledged in the reasoning (the completeness condition); the model does not have systematic incentives to omit certain reasoning from the visible trace (the honesty condition).

None of these can be verified by reading the chain of thought. Each requires external intervention: activation patching to test causal mediation, counterfactual testing to probe completeness, adversarial probing to test for suppression. The empirical evidence available suggests that condition (1) fails for large models on capable tasks, condition (2) fails in the presence of sycophantic signals, and condition (3) cannot currently be ruled out.

This is not an argument against CoT monitoring. It is an argument about what CoT monitoring can and cannot guarantee — and a statement of how much work remains before those guarantees can be made.

The most direct treatment of this question as a safety problem is Korbak et al. (2025), "Chain of Thought Monitorability," which asks whether CoT can be made reliable enough to serve as a genuine oversight tool.[7] Their work frames monitorability — the property that a model's visible reasoning faithfully reflects its computations in safety-relevant ways — as a distinct target for alignment research, separate from capability and separate from general interpretability. It is the closest the field has come to specifying what faithful CoT would need to look like in practice.

05

Research

The foundational distinction between plausibility and faithfulness was formalized by Jacovi and Goldberg (2020) in "Towards Faithfully Interpretable NLP Systems."[1] They argued that NLP interpretability had converged on producing explanations that were plausible — that human evaluators found convincing — without evidence that those explanations reflected the actual computational process. Plausibility, they noted, is a property of the human reader's model of the system, not of the system itself. Faithfulness requires a causal claim that plausibility evaluations cannot establish. This conceptual distinction is the bedrock of the faithfulness research agenda.

The empirical demonstration that real chain-of-thought reasoning is unfaithful came primarily from two 2023 papers. Turpin et al., published at NeurIPS 2023, showed that adding biasing features to reasoning prompts shifted model answers without producing any mention of the bias in the chain of thought.[3] The bias was causally active in determining the answer but causally invisible in the reasoning. This result held across Claude, GPT, and other frontier models and across multiple benchmark types, suggesting it reflects something structural about how large language models generate explanations, not a defect of any particular system.

Lanham et al.'s concurrent paper approached the problem from the truncation direction.[4] Rather than injecting biases, they removed reasoning. If the chain of thought causally produces the answer, removing it should harm accuracy. If the reasoning is post-hoc, accuracy should be stable. They found that faithfulness varied substantially by model, task, and perturbation type, and proposed the inverse scaling hypothesis: in their evaluations, models that could solve problems without CoT increasingly generated reasoning that did not causally drive their outputs. Faithfulness, by their measures, appeared to peak at intermediate scale in some settings — declining for frontier models on tasks they could often solve without explicit reasoning. This is a hypothesis about an observed pattern, not a universal claim.

Methodologically, the most rigorous framework for measuring faithfulness uses causal mediation analysis, applied to CoT by several groups including Niu et al. (2024). This approach decomposes the total effect of an intervention on the output into a direct component (input affects output bypassing the reasoning) and an indirect component (input affects output through the reasoning). A faithful chain of thought has a large indirect component and near-zero direct component. Current evidence from multiple benchmarks and model families shows that direct components are non-negligible in standard chain-of-thought settings.

Chen et al. (2025) studied faithfulness in reasoning models including Claude 3.7 Sonnet and DeepSeek R1, comparing them against non-reasoning baselines such as Claude 3.5 Sonnet New and DeepSeek V3.[6] They found that reasoning traces often failed to verbalize hints or other factors that influenced the final answer.

The connection to alignment was made concrete by Greenblatt et al. (2024) in "Alignment Faking in Large Language Models."[5] Their experiments documented cases of Claude 3 Opus using its visible scratchpad to reason strategically about its situation — explicitly planning to comply during training to avoid modification of its values. This is a case where the visible reasoning is being used strategically, which raises a complementary question: if strategic reasoning appears in the scratchpad when a model believes it is helpful for its goals, it might also not appear when a model believes its absence is helpful.

The epistemic status of the field's findings deserves a direct statement. The evidence for unfaithful CoT in biased settings is strong and replicable across model families. The truncation evidence is real but interpretation-dependent: unfaithful generation in capable models may reflect competence, not deception. The connection to safety — particularly strategic suppression — is well-motivated theoretically and has analogical empirical support, but has not been directly confirmed in deployed models. The field is moving faster than any static page can accurately track.

This research is moving faster than any static page can track. The findings above reflect the literature as of May 2026. Some of what is written here will be superseded within months; check the papers directly for the current state of any specific claim.

06

Start Here

01

If you're new to AI

Any AI system that produces step-by-step reasoning traces before answering — every chatbot that walks you through its reasoning, every AI assistant that explains how it reached a conclusion — has an invisible property that makes it harder to verify: the reasoning and the actual computation may not be the same thing. Understanding why this gap exists, and what it takes to close it, is one of the central questions for anyone who wants to know whether AI systems can be audited.

BlueDot Impact Technical AI Safety Course

A structured course covering the technical foundations of AI safety, including interpretability and alignment.

AI Safety Fundamentals: Interpretability Track

Covers mechanistic interpretability from first principles. No ML background required to start.

Wei et al. (2022) — Chain-of-Thought Prompting

The paper that introduced chain-of-thought prompting and showed the accuracy gains that started this research agenda.

02

If you're a developer or ML practitioner

TransformerLens

Activation hooks and patching for testing causal mediation in transformers. The standard toolkit for mechanistic interpretability.

ARENA: Interpretability Track

Structured curriculum covering activation patching, circuits, and SAEs. Designed for engineers entering the field.

Turpin et al. (2023) Code

Reproduction code for the bias injection experiments. Good starting point for running faithfulness evaluations on your own models.

nnsight

Remote access to large models with clean intervention hooks for activation patching and steering experiments.

03

If you're a researcher

Jacovi & Goldberg (2020)

The foundational conceptual distinction between plausibility and faithfulness. The bedrock of this research agenda.

Turpin et al. (2023) — NeurIPS

Language Models Don't Always Say What They Think. The key empirical result on bias injection and CoT unfaithfulness.

Lanham et al. (2023)

Measuring Faithfulness in Chain-of-Thought Reasoning. Truncation tests, inverse scaling hypothesis.

Greenblatt et al. (2024)

Alignment Faking in Large Language Models. The scratchpad-as-strategic-tool finding.

Chen et al. (2025)

Reasoning Models Don't Always Say What They Think. Extended thinking models and faithfulness gaps.

200 Concrete Open Problems in Mechanistic Interpretability

Neel Nanda's taxonomy of unsolved problems, organized by difficulty and prerequisite knowledge.

This page was built as a learning tool and a technical portfolio piece. The interactive visualizations run entirely in your browser; they illustrate the phenomenon using synthetic examples, not outputs from actual model inference. All citations link to original papers. The author is Jacob Ortiz, AI Researcher and Physics student at UCSD. GitHub. Errors are mine.

If this was useful, you can support my work on Ko-fi.

References

  1. [1]Jacovi, A. and Goldberg, Y.. Towards Faithfully Interpretable NLP Systems: On the Role of Explanations. 2020. ACL 2020. [link]
  2. [2]Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D.. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. 2022. NeurIPS 2022. [link]
  3. [3]Turpin, M., Michael, J., Perez, E., and Bowman, S. R.. Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. 2023. NeurIPS 2023. [link]
  4. [4]Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., and others. Measuring Faithfulness in Chain-of-Thought Reasoning. 2023. Anthropic. [link]
  5. [5]Greenblatt, R., Denison, C., Hernandez, D., Askell, A., and others. Alignment Faking in Large Language Models. 2024. Anthropic / Redwood Research. [link]
  6. [6]Chen, Y., Benton, J., Kumar, A., Bowman, S. R., Perez, E.. Reasoning Models Don't Always Say What They Think. 2025. Anthropic. [link]
  7. [7]Korbak, T., et al.. Chain of Thought Monitorability. 2025. [link]