Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

Durgesh Kalwar; Karthik Valmeekam; Kaya Stechly; Lucas Saldyt; Siddhant Bhambri; Soumya Rani Samineni; Subbarao Kambhampati; Upasana Biswas; Vardhan Palod

arxiv: 2504.09762 · v4 · pith:PMC57UQEnew · submitted 2025-04-14 · 💻 cs.AI

Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

Subbarao Kambhampati , Karthik Valmeekam , Siddhant Bhambri , Vardhan Palod , Lucas Saldyt , Kaya Stechly , Soumya Rani Samineni , Durgesh Kalwar

show 1 more author

Upasana Biswas

This is my paper

classification 💻 cs.AI

keywords tracesintermediatereasoningthinkingtokensanthropomorphizationanthropomorphizingmodel

0 comments

read the original abstract

Intermediate token generation (ITG), where a model produces output before the solution, has become a standard method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called \say{reasoning traces} or even \say{thinking traces} -- implicitly anthropomorphizing the traces, and implying that these traces resemble steps a human might take when solving a challenging problem, and as such can provide an interpretable window into the operation of the model's thinking process to the end user. In this position paper, we present evidence that this anthropomorphization isn't a harmless metaphor, and instead is quite dangerous -- it confuses the nature of these models and how to use them effectively, and leads to questionable research. We call on the community to avoid such anthropomorphization of intermediate tokens.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?
cs.CL 2026-03 unverdicted novelty 7.0

The Stepwise Informativeness Assumption explains the correlation between LLM entropy dynamics and reasoning correctness by positing that correct traces accumulate answer-relevant information stepwise during generation.
What properties of reasoning supervision are associated with improved downstream model quality?
cs.AI 2026-05 unverdicted novelty 5.0

Intrinsic data metrics predict reasoning dataset utility for model fine-tuning, with different predictors working best for smaller versus larger models.
LLM Reasoning Is Latent, Not the Chain of Thought
cs.AI 2026-04 unverdicted novelty 5.0

LLM reasoning is primarily mediated by latent-state trajectories rather than by explicit surface chain-of-thought outputs.
Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction
cs.CL 2025-10 unverdicted novelty 5.0

Using prompts that incorporate implicature leads to responses that humans prefer 67.6% of the time over literal prompts, with larger models better at inferring intent.
Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding
cs.CL 2025-07 unverdicted novelty 5.0

Temperature and persona variations shape consensus speed in LLM multi-agent coding but produce no robust accuracy gains over single agents on human-annotated tutoring transcripts.