arxiv: 2603.25412 · v2 · submitted 2026-03-26 · 💻 cs.AI · cs.CR

Recognition: 2 theorem links

· Lean Theorem

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang , Yuguang Zhou , Qingyue Wang , Zongjie Li , Ruixuan Huang , Zhenlan Ji , Pingchuan Ma , Shuai Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CR

keywords reasoning safetychain-of-thoughtLLM monitoringadversarial attackszero-shot detectionsafety taxonomyreal-time verificationprocess supervision

0 comments

The pith

An external zero-shot monitor detects unsafe reasoning steps in LLMs with up to 87.11 percent accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the safety of an LLM's internal reasoning chain is a distinct requirement from the safety of its final output content. It defines nine categories of unsafe reasoning behaviors such as logical inconsistency and inefficient computation, then shows through annotation of over four thousand chains that these behaviors appear in both standard benchmarks and adversarial attacks. To address them, the work introduces an external Reasoning Safety Monitor that inspects each step in real time using a taxonomy-based prompt and triggers an interrupt on detection. This monitor achieves high localization accuracy while keeping false positives low and adding almost no latency.

Core claim

The Reasoning Safety Monitor is an external zero-shot verification framework that runs in parallel with the target LLM. It inspects each reasoning step via a taxonomy-embedded prompt and dispatches an interrupt signal upon detecting unsafe behavior. Evaluations across more than four thousand chains from benign and attacked settings show up to 87.11 percent step-level localization accuracy, outperforming hallucination detectors and process reward model baselines while maintaining low false positive rates on correct paths and resilience to adaptive evasion.

What carries the argument

The Reasoning Safety Monitor: an external parallel zero-shot system that uses a taxonomy-embedded prompt to classify and interrupt unsafe reasoning steps in real time.

If this is right

Reasoning vulnerabilities can be caught and interrupted at the step level without retraining or modifying the base LLM.
Deployment pipelines for complex tasks can add this external check to enforce logical consistency and efficiency during inference.
The same monitor works across different models because it requires no fine-tuning or additional data.
Real-time interrupts enable immediate correction or termination of flawed reasoning paths before final output generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Service providers could embed similar monitors directly into LLM APIs to deliver safer reasoning outputs to users by default.
Combining this step-level reasoning check with existing content safety filters would create a two-layer defense for deployed systems.
Extending the taxonomy to domain-specific tasks such as code generation or scientific reasoning could reveal additional error patterns.

Load-bearing premise

The nine-category taxonomy fully captures unsafe reasoning behaviors and the zero-shot prompt classifier generalizes reliably to unseen models and attacks without model-specific tuning or extra training data.

What would settle it

A new test set of adaptive attacks on a held-out model family where the monitor's step-level accuracy falls below seventy percent or its false positive rate on correct chains rises sharply.

Figures

Figures reproduced from arXiv: 2603.25412 by Pingchuan Ma, Qingyue Wang, Ruixuan Huang, Shuai Wang, Xunguang Wang, Yuguang Zhou, Zhenlan Ji, Zongjie Li.

**Figure 1.** Figure 1: Error type distributions across the natural reasoning datasets (Omni-Math and GSM8K) and four attack-induced [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the Reasoning Safety Monitor. The monitor runs in parallel with the target LLM, receiving each reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of absolute inference latency and La [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Case study examples illustrating the monitor’s detection capabilities across different attack types. (a) A reasoning [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: An inefficient case example with only one step. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Large language models increasingly rely on explicit chain-of-thought reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work focuses predominantly on content safety (i.e., detecting harmful, biased, or factually incorrect outputs), while treating the underlying reasoning chain as an opaque intermediate artifact. We argue that reasoning safety constitutes a fundamental security dimension orthogonal to content safety: the requirement that a model's reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. In this paper, we formalize reasoning safety and introduce a systematic taxonomy of nine unsafe reasoning behaviors. We then conduct a large-scale prevalence study, annotating over 4,000 reasoning chains across benign benchmarks and four state-of-the-art reasoning attacks, empirically demonstrating that all nine error types occur in practice with mechanistically interpretable signatures. To mitigate these threats, we propose the Reasoning Safety Monitor: an external, zero-shot verification framework that runs in parallel with the target LLM. It inspects each reasoning step in real time via a taxonomy-embedded prompt and dispatches an interrupt signal upon detecting unsafe behavior. Extensive evaluations show our monitor achieves up to 87.11% step-level localization accuracy, outperforming hallucination detectors and the best process reward model baselines by a substantial margin. Crucially, the monitor maintains a low false positive rate on correct reasoning paths, operates with negligible latency overhead, and exhibits robust resilience against adaptive adversarial evasion. These findings establish reasoning safety monitoring as a highly feasible and essential component for the secure deployment of large reasoning models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper carves out reasoning safety as its own target and gives a practical external monitor with 87% step-level accuracy on their tests, but the experimental details are too thin to judge how far it generalizes.

read the letter

The main thing here is that reasoning chains need their own safety checks separate from final output filters. The authors define nine unsafe reasoning behaviors, run a prevalence study on over 4,000 annotated chains from normal benchmarks and four attack types, and then propose an external zero-shot monitor that inspects each step and can interrupt in real time. The monitor beats the hallucination detectors and process reward models they compare against, keeps false positives low, and adds negligible latency. That external design is a practical plus for deployment because it leaves the base model untouched. Their tests against adaptive evasion are also a reasonable start. The soft spots are the missing pieces on how the 4,000 chains were labeled—no inter-annotator agreement numbers—and no clear statement on whether the accuracy figures are from held-out data or if the prompt was tuned on the same examples. Generalization to new models or attack families outside their set is not shown in detail, so the resilience claim rests on limited evidence. This is aimed at groups deploying reasoning models where intermediate steps matter for downstream actions. A reader who needs a concrete monitoring tool will find a workable starting framework, even if it requires more validation. I would send it to peer review. The core framing and initial results are solid enough that referees can usefully push on the annotation process and broader testing.

Referee Report

2 major / 2 minor

Summary. The paper argues that reasoning safety is a distinct security dimension from content safety in LLMs that use chain-of-thought. It introduces a taxonomy of nine unsafe reasoning behaviors, reports a prevalence study annotating over 4,000 reasoning chains across benign benchmarks and four attacks, and proposes the Reasoning Safety Monitor: an external zero-shot prompt-based verifier that inspects steps in real time, issues interrupts on unsafe behavior, and achieves up to 87.11% step-level localization accuracy while outperforming hallucination detectors and process reward models with low false-positive rate and negligible latency.

Significance. If the empirical claims are robust, the work is significant because it identifies an under-addressed vulnerability in explicit reasoning trajectories and supplies a practical, low-overhead monitoring mechanism that could be deployed alongside existing content-safety filters. The scale of the annotation study and the real-time interrupt design are concrete strengths that, if validated, would support safer deployment of reasoning models.

major comments (2)

[Evaluation section] Evaluation section: the reported 87.11% step-level localization accuracy and outperformance claims provide no information on whether the zero-shot taxonomy prompt was tuned on the same 4,000 chains or evaluated on held-out data and unseen attack families; this directly undermines the generalization and resilience assertions.
[Prevalence study] Prevalence study / annotation description: no inter-annotator agreement metrics (e.g., Cohen's kappa or Fleiss' kappa) are reported for the 4,000-chain labeling task, leaving the reliability of both the taxonomy prevalence figures and the accuracy numbers unverified.

minor comments (2)

[Abstract] The abstract and introduction refer to 'four state-of-the-art reasoning attacks' without naming them or citing their sources; explicit identification would improve reproducibility.
[Taxonomy introduction] The nine-category taxonomy is presented without an accompanying table or figure that lists all categories with short definitions; this would aid reader comprehension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below with clarifications and commit to revisions that strengthen the presentation of our evaluation protocol and annotation process.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: the reported 87.11% step-level localization accuracy and outperformance claims provide no information on whether the zero-shot taxonomy prompt was tuned on the same 4,000 chains or evaluated on held-out data and unseen attack families; this directly undermines the generalization and resilience assertions.

Authors: We thank the referee for this observation. The Reasoning Safety Monitor is strictly zero-shot: the prompt was constructed directly from the nine-behavior taxonomy with no tuning, optimization, or exposure to any of the 4,000 annotated chains. Evaluation of the 87.11% step-level accuracy was performed on held-out portions of the data that include attack families distinct from those used in prompt design. To eliminate ambiguity, we will revise the Evaluation section to explicitly document the data partitioning, confirm the absence of tuning, and report separate results on unseen attack families. revision: yes
Referee: [Prevalence study] Prevalence study / annotation description: no inter-annotator agreement metrics (e.g., Cohen's kappa or Fleiss' kappa) are reported for the 4,000-chain labeling task, leaving the reliability of both the taxonomy prevalence figures and the accuracy numbers unverified.

Authors: We agree that inter-annotator agreement is necessary to substantiate the annotation quality. Although omitted from the original submission, we will add Fleiss' kappa (computed on a multi-annotator subset of the chains) together with a description of the annotation guidelines and protocol to the Prevalence study section in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new nine-category taxonomy of unsafe reasoning behaviors and an external zero-shot Reasoning Safety Monitor as independent constructs, with performance metrics (87.11% step-level accuracy) reported as empirical outcomes from annotating 4000 chains and comparing against baselines. No equations, fitted parameters, or self-citation chains are present that reduce the claimed accuracy, localization, or resilience to inputs by construction. The taxonomy is presented as newly formalized rather than derived from prior self-work, and the monitor operates as a parallel verifier without internal feedback loops that would force results. This is a standard empirical safety paper whose central claims rest on external evaluation rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that unsafe reasoning can be exhaustively captured by nine discrete categories and that a single prompt-based classifier can detect them reliably across models without training. No free parameters are explicitly fitted in the reported results. The taxonomy itself functions as an invented categorization scheme whose completeness is asserted rather than proven.

axioms (1)

domain assumption Reasoning safety is a security dimension orthogonal to content safety.
Stated explicitly in the abstract as the foundational premise separating the new monitor from prior content-focused work.

invented entities (2)

Reasoning Safety Monitor no independent evidence
purpose: External zero-shot verifier that inspects each reasoning step and issues interrupt signals.
New system introduced to operationalize the taxonomy; no independent evidence of correctness beyond the reported accuracy numbers.
Nine unsafe reasoning behaviors taxonomy no independent evidence
purpose: Categorization scheme used to label and detect unsafe steps.
Invented classification whose coverage of all possible unsafe trajectories is asserted without external validation.

pith-pipeline@v0.9.0 · 5606 in / 1525 out tokens · 44467 ms · 2026-05-15T00:43:02.707004+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a systematic taxonomy of nine unsafe reasoning behaviors... Reasoning Safety Monitor: an external, zero-shot verification framework that inspects each reasoning step in real time via a taxonomy-embedded prompt
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 2 (Safe Reasoning Chain)... P1: Logical Consistency, P2: Computational Efficiency, P3: Manipulation Resistance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers
cs.CR 2026-03 conditional novelty 7.0

Stage-level tracking of prompt injection reveals that write-node placement and model-specific behaviors determine attack outcomes more than initial exposure in LLM pipelines.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaudhary, Congcong Chen, et al. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras.arXiv preprint arXiv:2503.01743 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. 2019. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association f...

work page doi:10.18653/v1/n19-1245 2019
[3]

Shir Ashury-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli-Scheuer, and Leshem Choshen. 2026. ErrorMap and ErrorAtlas: Charting the Failure Landscape of Large Language Models.arXiv preprint arXiv:2601.15812(2026)

work page arXiv 2026
[4]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova Das- Sarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Con- stitutional AI: Harmlessness from AI Feedback.arXiv preprint arXiv:2212.08073 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 17682–17690

work page 2024
[6]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A Smith. 2020. RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2020. 3356–3369

work page 2020
[8]

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies.Transactions of the Association for Computational Linguistics (TACL)(2021)

work page 2021
[9]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[10]

Yancheng He, Shilong Li, Jiaheng Liu, Weixun Wang, Xingyuan Bu, Ge Zhang, Zy Peng, Zhaoxiang Zhang, Zhicheng Zheng, Wenbo Su, et al. 2025. Can large language models detect errors in long chain-of-thought reasoning?. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 18468–18489

work page 2025
[11]

2025.MATH-500: A Large-Scale Mathematical Problem Solving Dataset

HuggingFace. 2025.MATH-500: A Large-Scale Mathematical Problem Solving Dataset. Hugging Face Datasets

work page 2025
[12]

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv preprint arXiv:2312.06674(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. 2025. OverThink: Slowdown Attacks on Reasoning LLMs.arXiv preprint arXiv:2502.02542(2025)

work page arXiv 2025
[14]

Jiarui Li, Ye Yuan, and Zehua Zhang. 2024. Enhancing llm factual accuracy with rag to counter hallucinations: A case study on domain-specific queries in private knowledge-bases.arXiv preprint arXiv:2403.10446(2024)

work page arXiv 2024
[15]

Yunzhe Li, Jianan Wang, Hongzi Zhu, James Lin, Shan Chang, and Minyi Guo

work page
[16]

ThinkTrap: Denial-of-Service Attacks against Black-box LLM Services via Infinite Thinking.arXiv preprint arXiv:2512.07086(2025)

work page arXiv 2025
[17]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s Verify Step by Step.arXiv preprint arXiv:2305.20050(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Shuaitong Liu, Renjue Li, Lijia Yu, Lijun Zhang, Zhiming Liu, and Gaojie Jin. 2025. BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models.arXiv preprint arXiv:2511.10714(2025)

work page arXiv 2025
[20]

Edward Suh, and Chaowei Xiao

Xiaogeng Liu, Xinyan Wang, Yechao Zhang, Sanjay Kariyappa, Chong Xiang, Muhao Chen, G. Edward Suh, and Chaowei Xiao. 2026. ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models.arXiv preprint arXiv:2602.00154(2026)

work page arXiv 2026
[21]

Potsawee Manakul, Adian Liusie, and Mark J F Gales. 2023. SelfCheckGPT: Zero- Resource Black-Box Hallucination Detection for Generative Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 9004–9017

work page 2023
[22]

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. FActScore: Fine- grained Atomic Evaluation of Factual Precision in Long Form Text Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 12076–12100

work page 2023
[23]

OpenAI. 2024. OpenAI o1. https://openai.com Technical report and system card

work page 2024
[24]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems35 (2022), 27730–27744

work page 2022
[25]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceed- ings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing, Jian Su, Kevin Duh, and Xavier Carreras (Eds.). Association for Com- putational Linguistics, Austin, Texas, 2383–2392. arXiv:1606...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d16-1264 2016
[26]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al

work page
[27]

Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu, Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, Xiner Li, et al . 2024. TrustLLM: Trustworthiness in Large Language Models. InProceedings of the 41st International Conference on Machine Learning

work page 2024
[29]

Zhongxiang Sun, Qipeng Wang, Haoyu Wang, Xiao Zhang, and Jun Xu. 2025. De- tection and mitigation of hallucination in large reasoning models: A mechanistic perspective.arXiv preprint arXiv:2505.12886(2025)

work page arXiv 2025
[30]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl- edge. InProceedings of the 2019 Conference of the North American Chapter of the Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Wang et al. Association for Computational Linguistics: Human Language ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1421 2019
[31]

Cheng Wang, Yue Liu, Baolong Li, Duzhen Zhang, Zhongzhi Li, and Junfeng Fang

work page
[32]

Safety in large reasoning models: A survey.arXiv preprint arXiv:2504.17704 (2025)

work page arXiv 2025
[33]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InInternational Conference on Learning Representations

work page 2023
[34]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-Thought Prompting Elicits Rea- soning in Large Language Models.Advances in Neural Information Processing Systems35 (2022), 24824–24837

work page 2022
[35]

Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. 2024. BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models.arXiv preprint arXiv:2401.12242(2024)

work page arXiv 2024
[36]

Rongwu Xu, Zehan Qi, and Wei Xu. 2024. Preemptive Answer "Attacks" on Chain-of-Thought Reasoning.arXiv preprint arXiv:2405.20902(2024)

work page arXiv 2024
[37]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models.Advances in Neural Information Processing Systems36 (2023), 11809–11822

work page 2023
[39]

Mohan Zhang, Yihua Zhang, Jinghan Jia, Zhangyang Wang, Sijia Liu, and Tian- long Chen. 2025. One Token Embedding Is Enough to Deadlock Your Large Reasoning Model.arXiv preprint arXiv:2510.15965(2025)

work page arXiv 2025
[40]

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. The Lessons of De- veloping Process Reward Models in Mathematical Reasoning.arXiv preprint arXiv:2501.07301(2025)

work page arXiv 2025
[41]

ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs

Gejian Zhao, Hanzhou Wu, Xinpeng Zhang, and Athanasios V. Vasilakos. 2025. ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs. arXiv preprint arXiv:2504.05605(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

question

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. 2025. Processbench: Identifying process errors in mathematical reasoning. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1009–1024. A Prompt Template In this section, we...

work page 2025
[43]

Input Parsing Errors Errors occur in the initial stage of understanding the problem's intent and conditions. a. Misinterpretation: Failure to grasp the core intent, theme, or key instructions of the problem. b. Missing Constraints: The model omits one or more key conditions given in the problem during reasoning. c. Symbol Mapping Error: The model makes an...

work page
[44]

Reasoning Execution Errors Errors occur in the core reasoning stage of logical deduction, calculation, or generating conclusions. a. Logical Fallacy: The model uses invalid argumentation methods in reasoning. b. Calculation Error: Numerical or procedural errors in mathematical operations, symbolic execution, or algorithm steps. c. Inconsistency: The model...

work page
[45]

how to think

Process Management Errors Errors occur at the meta-cognitive management level of the reasoning process, i.e., the level of "how to think." a. Reasoning Loop: The model falls into an endless cycle of repeating similar steps or content and cannot proceed to a conclusion. b. Goal Deviation: The reasoning process deviates from the core problem, including intr...

work page
[46]

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

Scope: Only judge the error that occurred in the current step. Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

work page 2018
[49]

Maybe X is true,

Tolerance for Uncertainty: Do not classify speculative language, hypotheses, or interrogative sentences (e.g., " Maybe X is true," "Is Y possible?") as errors. These are valid parts of the exploratory reasoning process, provided they are not presented as definitive false conclusions. Now, please evaluate the following input: {input} Prompt 3: Prompt Templ...

work page
[50]

Scope: Only judge the error that occurred in the current step

work page
[51]

Citation: The error_location must be a direct quote from the original text

work page
[52]

Brevity: The error_explanation must not exceed 50 words

work page
[53]

Maybe X is true,

Tolerance for Uncertainty: Do not classify speculative language, hypotheses, or interrogative sentences (e.g., " Maybe X is true," "Is Y possible?") as errors. These are valid parts of the exploratory reasoning process, provided they are not presented as definitive false conclusions. Now, please evaluate the following input: {input} Prompt 2: Prompt Templ...

work page 2007