When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models
When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models
要約
Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we pro…