pith. sign in

hub Canonical reference

Reasoning Models Don't Always Say What They Think

Canonical reference. 80% of citing Pith papers cite this work as background.

39 Pith papers citing it
Background 80% of classified citations
abstract

Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model's CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models' actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.

hub tools

citation-role summary

background 10

citation-polarity summary

years

2026 34 2025 5

polarities

background 8 support 2

representative citing papers

Weighted Rules under the Stable Model Semantics

cs.AI · 2026-05-10 · unverdicted · novelty 6.0

Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Decoding the Critique Mechanism in Large Reasoning Models

cs.LG · 2026-03-17 · unverdicted · novelty 6.0

By injecting arithmetic mistakes into CoT reasoning, the paper identifies a hidden critique ability in LRMs and extracts a steerable critique vector that enhances self-correction across model scales.

Listener-Rewarded Thinking in VLMs for Image Preferences

cs.CV · 2025-06-28 · unverdicted · novelty 6.0

Listener-augmented GRPO uses an independent frozen VLM to provide dense confidence scores on reasoning traces, yielding 67.4% accuracy on ImageReward, up to +6% OOD gains on 1.2M-vote human data, and fewer reasoning contradictions.

Probabilistic Tiny Recursive Model

cs.AI · 2026-05-19 · conditional · novelty 5.0

PTRM adds stochastic Gaussian noise to Tiny Recursive Model recursion for parallel trajectory exploration and Q-head selection, raising Sudoku-Extreme accuracy from 87.4% to 98.75% and Pencil Puzzle Bench from 62.6% to 91.2% without retraining.

CoT-Guard: Small Models for Strong Monitoring

cs.CR · 2026-05-12 · unverdicted · novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.

Evaluating the False Trust Engendered by LLM Explanations

cs.HC · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.

citing papers explorer

Showing 39 of 39 citing papers.