pith. sign in

arxiv: 2505.13775 · v4 · pith:4STNHU57new · submitted 2025-05-19 · 💻 cs.LG · cs.AI

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

classification 💻 cs.LG cs.AI
keywords reasoningtracesmodelscorrectintermediatesemanticsthemthey
0
0 comments X
read the original abstract

Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. While these traces certainly seem to help model performance, it is not clear how they influence it, with some works ascribing semantics to them and others cautioning against relying on them as transparent and faithful proxies of the model's internal computational process. To systematically investigate the role of end-user semantics of derivational traces, we set up a controlled study where we train transformer models from scratch on formally verifiable reasoning traces and the solutions they lead to. We notice that, despite gains over the solution-only baseline, models trained on entirely correct traces can still produce invalid reasoning traces even when arriving at correct solutions. More interestingly, our experiments also show that models trained on corrupted traces, whose intermediate reasoning steps bear no relation to the problem they accompany, perform similarly to those trained on correct ones, and even generalize better on out-of-distribution tasks. We also study the effect of GRPO-based RL post-training on trace validity, noting that while solution accuracy increases, this is not accompanied by improvements in trace validity. Finally, we examine whether reasoning-trace length reflects inference-time scaling and find that trace length is largely agnostic to the underlying computational complexity of the problem being solved. These results challenge the assumption that intermediate tokens or ``Chains of Thought'' reflect or induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly seemingly forms) as evidence of human-like or algorithmic behaviors in language models.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tracing Uncertainty in Language Model "Reasoning"

    cs.LG 2026-05 unverdicted novelty 7.0

    Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.

  2. Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

    cs.AI 2026-05 conditional novelty 6.0

    A compact 25M chess move predictor exceeds larger fine-tuned models on puzzles, indicating memorization in earlier claims, while LLM-Modulo raises general LLM move accuracy from 1.2% to 21.2% and validity to 95.3%.

  3. Instructions Shape Production of Language, not Processing

    cs.CL 2026-05 unverdicted novelty 6.0

    Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.

  4. Evaluating the False Trust Engendered by LLM Explanations

    cs.HC 2026-05 unverdicted novelty 6.0

    A user study finds that LLM reasoning traces and post-hoc explanations create false trust by increasing acceptance of incorrect answers, whereas contrastive dual explanations improve users' ability to detect errors.

  5. Weighted Rules under the Stable Model Semantics

    cs.AI 2026-05 unverdicted novelty 6.0

    Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.

  6. Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

    cs.CL 2026-04 unverdicted novelty 6.0

    PDDL planning problems are used to generate about one million precise reasoning steps for training Process Reward Models, and adding this data to existing datasets improves LLM performance on both mathematical and non...

  7. Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

    cs.AI 2025-08 unverdicted novelty 6.0

    CoT reasoning is a brittle mirage governed by distribution discrepancy between training and test data, demonstrated via controlled experiments in the new DataAlchemy environment.

  8. Instructions Shape Production of Language, not Processing

    cs.CL 2026-05 unverdicted novelty 5.0

    Instructions primarily shape the production stage of language models rather than the processing stage, with task-specific information and causal effects stronger in output tokens than input tokens.

  9. Evaluating the False Trust Engendered by LLM Explanations

    cs.HC 2026-05 unverdicted novelty 5.0

    LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.

  10. Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction

    cs.CL 2025-10 unverdicted novelty 5.0

    Using prompts that incorporate implicature leads to responses that humans prefer 67.6% of the time over literal prompts, with larger models better at inferring intent.