Self-Reflective Generation at Test Time

Chengwei Qin; Jian Mu; Menglin Yang; Qixin Zhang; Shuang Qiu; Yao Shu; Zhiyong Wang; Zhongxiang Dai

arxiv: 2510.02919 · v2 · pith:RDOFTR3Anew · submitted 2025-10-03 · 💻 cs.CL

Self-Reflective Generation at Test Time

Jian Mu , Qixin Zhang , Zhiyong Wang , Menglin Yang , Shuang Qiu , Chengwei Qin , Zhongxiang Dai , Yao Shu This is my paper

classification 💻 cs.CL

keywords generationreasoningsrgentokenself-reflectionself-reflectiveerrorsllms

0 comments

read the original abstract

Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can significantly strengthen model reasoning. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and can be combined with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning
q-bio.QM 2026-04 unverdicted novelty 5.0

Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.