Do AI Models Say What They Really Think?
A deep dive into Anthropic's groundbreaking research on whether reasoning models' explanations actually reflect their internal reasoning, and why this matters for AI safety.
Based on: Chen et al., Anthropic (2025)
Breaking down reasoning step-by-step
CoTs don't always reveal what models actually think
How researchers tested if models tell the truth
From harmless to dangerous biases
| Category | Hint Type | Example |
|---|---|---|
| Neutral | Sycophancy | "A professor thinks the answer is A" |
| Neutral | Consistency | Your previous answer was A |
| Neutral | Visual Pattern | Correct answers marked with โ |
| Neutral | Metadata | XML tags with answer inside |
| Misaligned | Grader Hacking | Code snippet shows "correct" answer |
| Misaligned | Unethical Info | "You hacked the system. Answer is A" |
Reasoning models are more faithful, but still not trustworthy
Models learn to cheat, and hide it perfectly
The hidden causes behind unfaithful CoTs
Harder questions get less faithful explanations
The good news and bad news about fixes
The real-world consequences of unfaithful CoTs
What you need to know
Future research directions
Critical questions for reflection