arxiv: 2604.10693 · v2 · submitted 2026-04-12 · 💻 cs.AI

Recognition: unknown

FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning

Yuxi Sun , Aoqi Zuo , Haotian Xie , Wei Gao , Mingming Gong , Jing Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:59 UTC · model grok-4.3

classification 💻 cs.AI

keywords chain-of-thought reasoningfaithfulness evaluationcausal inferenceLLM trustworthinessperturbationsinstrumental variablesreasoning trajectories

0 comments

The pith

Controlled perturbations isolate genuine causal step dependence in chain-of-thought reasoning from bias artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops FACT-E to evaluate the quality of chain-of-thought reasoning in large language models. It treats controlled perturbations to intermediate steps as an instrumental variable that reveals whether each step genuinely depends on the previous one. This approach aims to overcome the biases in self-evaluation where models endorse their own flawed logic. By also checking consistency with the final answer, FACT-E selects more trustworthy reasoning chains. The result is better performance in selecting examples for in-context learning on math and question-answering tasks.

Core claim

FACT-E is a causality-inspired evaluation framework that applies controlled perturbations to CoT trajectories as instrumental signals, thereby estimating intra-chain faithfulness by separating true causal step-to-step implications from bias-driven artifacts, and selects trustworthy chains by jointly optimizing intra-chain faithfulness and CoT-to-answer consistency.

What carries the argument

Controlled perturbations serving as instrumental variables to estimate intra-chain faithfulness in chain-of-thought sequences.

If this is right

Selected reasoning trajectories exhibit higher internal faithfulness and better support for correct answers.
Improved quality of in-context learning exemplars derived from these trajectories.
More reliable detection of flawed reasoning steps even under noisy input conditions.
Stronger overall performance in reasoning tasks when using the selected chains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The perturbation technique could be used to flag and correct specific unfaithful steps during generation rather than only after the fact.
This method might apply to other multi-step reasoning formats such as tree-of-thought or graph-based reasoning.
Combining the scores with external verification signals could further strengthen the causal estimates.

Load-bearing premise

The controlled perturbations function as valid instruments that reveal the model's true causal dependencies without introducing their own confounding effects or model-specific biases.

What would settle it

If the faithfulness scores from FACT-E fail to align with independent human judgments of step validity, or if applying the perturbations alters model outputs in ways inconsistent with the expected causal structure.

Figures

Figures reproduced from arXiv: 2604.10693 by Aoqi Zuo, Haotian Xie, Jing Ma, Mingming Gong, Wei Gao, Yuxi Sun.

**Figure 2.** Figure 2: Structural causal graphs for CoT quality [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Performance (%) on noisy-rationale bench [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Performance (%) of lightweight FACT-E (a) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study of FACT-E. may still vary depending on whether multiple operations are batched into a single prompt. A.6 Ablative Study The ablation results in [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Analysis of different levels of MATH-500. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Chain-of-Thought (CoT) prompting has improved LLM reasoning, but models often generate explanations that appear coherent while containing unfaithful intermediate steps. Existing self-evaluation approaches are prone to inherent biases: the model may confidently endorse coherence even when the step-to-step implication is not valid, leading to unreliable faithfulness evaluation. We propose FACT-E, a causality-inspired framework for evaluating CoT quality. FACT-E uses controlled perturbations as an instrumental signal to separate genuine step-to-step dependence from bias-driven artifacts, producing more reliable faithfulness estimates (\textit{intra-chain faithfulness}). To select trustworthy trajectories, FACT-E jointly considers \textit{intra-chain faithfulness} and \textit{CoT-to-answer consistency}, ensuring that selected chains are both faithful internally and supportive of the correct final answer. Experiments on GSM8K, MATH, and CommonsenseQA show that FACT-E improves reasoning-trajectory selection and yields stronger in-context learning exemplars. FACT-E also reliably detects flawed reasoning under noisy conditions, providing a robust metric for trustworthy LLM reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new angle is using perturbations as instruments to measure real step-to-step dependence in CoT, but the IV assumptions are unlikely to hold cleanly in transformers.

read the letter

The main takeaway is that FACT-E treats controlled perturbations as an instrumental signal to isolate genuine causal dependence between CoT steps from the model's bias toward coherent-sounding text. They then combine that intra-chain faithfulness score with consistency to the final answer to pick better trajectories for downstream use. That framing and the joint selection rule are the concrete additions over prior self-evaluation methods. The experiments on GSM8K, MATH, and CommonsenseQA show measurable gains in trajectory selection and in-context learning performance, plus better detection of flawed chains under noise. Those results are run on standard benchmarks with clear metrics, which makes the empirical claims straightforward to inspect. The soft spot is the causal identification step. For perturbations to function as valid instruments they must hit only the target step, affect later steps solely through that step, and stay independent of the model's coherence biases. Transformer attention spreads changes globally, so small edits can alter hidden states downstream in ways that violate exclusion. The paper does not supply a formal causal graph or ablations that test whether perturbation effects remain uncorrelated with the very biases being removed. Without that grounding the faithfulness estimates risk picking up new artifacts instead of true dependence. This work is aimed at researchers building evaluation tools for reliable LLM reasoning. A reader focused on faithfulness metrics or causal approaches to AI reliability would get practical value from the selection results and the noise-robustness tests. I would send it to peer review. The experiments are concrete and the idea is distinct enough that referees can usefully pressure the identification assumptions and suggest tighter controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces FACT-E, a causality-inspired framework for evaluating the faithfulness of Chain-of-Thought (CoT) reasoning in LLMs. It employs controlled perturbations as an instrumental variable to isolate genuine step-to-step causal dependence (intra-chain faithfulness) from model biases in self-evaluation. Trajectories are then selected by jointly optimizing intra-chain faithfulness and CoT-to-answer consistency. Experiments on GSM8K, MATH, and CommonsenseQA claim improved trajectory selection for in-context learning and better detection of flawed reasoning under noise.

Significance. If the perturbations satisfy the instrumental-variable assumptions (relevance, exclusion, and independence) without introducing transformer-specific artifacts, FACT-E would provide a principled advance over purely correlational self-evaluation methods for CoT faithfulness. The combination of causal grounding with downstream consistency offers a practical route to more trustworthy reasoning trajectories and better ICL exemplars. The work's value hinges on whether the empirical results demonstrate that the method isolates genuine dependence rather than merely re-expressing model coherence biases.

major comments (3)

[§3.2] §3.2 (Perturbation Design): The manuscript describes perturbations as 'controlled' but provides neither a formal causal graph nor a proof that the exclusion restriction holds for transformer attention patterns. Small token edits can alter hidden states globally, so it is unclear why downstream steps are affected only through the target step rather than through new artifacts that the intra-chain faithfulness metric would then misattribute as genuine dependence.
[§4.1–4.3] §4.1–4.3 (Experimental Validation): No ablation or diagnostic is reported that tests the independence assumption—i.e., that perturbation-induced changes are uncorrelated with the coherence biases the method aims to remove. Without such evidence, the reported gains in faithfulness estimates and trajectory selection could be driven by the same model-specific artifacts the framework claims to mitigate.
[Table 1 / Figure 3] Table 1 / Figure 3 (Results): The improvements in intra-chain faithfulness and downstream accuracy are presented without statistical significance tests against strong baselines that also apply perturbations but lack the causal framing. This makes it difficult to determine whether the causality-inspired component is load-bearing or whether any controlled perturbation would produce similar selection benefits.

minor comments (2)

[§3] The mathematical definition of the intra-chain faithfulness estimator (presumably Eq. (X) in §3) should be stated explicitly rather than described only in prose, to allow readers to verify how the instrumental-variable estimator is computed from the perturbed and unperturbed chains.
The paper should include a limitations paragraph discussing the sensitivity of the method to the choice of perturbation strength and type, as this choice is central to the validity of the instrument.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important aspects of the causal assumptions and validation in FACT-E. We provide point-by-point responses below and will incorporate revisions to address the concerns raised.

read point-by-point responses

Referee: [§3.2] §3.2 (Perturbation Design): The manuscript describes perturbations as 'controlled' but provides neither a formal causal graph nor a proof that the exclusion restriction holds for transformer attention patterns. Small token edits can alter hidden states globally, so it is unclear why downstream steps are affected only through the target step rather than through new artifacts that the intra-chain faithfulness metric would then misattribute as genuine dependence.

Authors: We concur that explicitly formalizing the causal assumptions would strengthen the presentation. In the revised manuscript, we will introduce a causal graph showing the perturbation as an instrument that influences the target step's faithfulness, with the exclusion restriction justified by the localized nature of the edits. We will elaborate on why global hidden state changes are mitigated through our choice of minimal perturbations and step-specific application. A full mathematical proof for arbitrary transformer models is beyond current capabilities due to the complexity of attention mechanisms; we will instead provide additional empirical diagnostics and acknowledge this as a limitation of the approach. revision: partial
Referee: [§4.1–4.3] §4.1–4.3 (Experimental Validation): No ablation or diagnostic is reported that tests the independence assumption—i.e., that perturbation-induced changes are uncorrelated with the coherence biases the method aims to remove. Without such evidence, the reported gains in faithfulness estimates and trajectory selection could be driven by the same model-specific artifacts the framework claims to mitigate.

Authors: This is a valid concern. We will add a new subsection with ablations designed to test the independence assumption. This includes computing correlations between perturbation-induced faithfulness changes and bias indicators like the model's self-reported confidence or consistency without perturbations. We will also introduce a control experiment using non-instrumental perturbations to isolate the effect of the causal framing. revision: yes
Referee: [Table 1 / Figure 3] Table 1 / Figure 3 (Results): The improvements in intra-chain faithfulness and downstream accuracy are presented without statistical significance tests against strong baselines that also apply perturbations but lack the causal framing. This makes it difficult to determine whether the causality-inspired component is load-bearing or whether any controlled perturbation would produce similar selection benefits.

Authors: We will enhance the results section by including statistical significance tests. Specifically, we will add comparisons to perturbation-based selection methods that do not incorporate the instrumental variable logic, such as averaging over multiple perturbed versions without causal attribution. Statistical tests, including paired t-tests or bootstrap confidence intervals, will be reported for the key metrics in Table 1 and Figure 3 to demonstrate the significance of the improvements and the contribution of the causal component. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained

full rationale

The provided abstract and description introduce FACT-E as a new causality-inspired framework that applies controlled perturbations to isolate intra-chain faithfulness from bias artifacts, then combines it with CoT-to-answer consistency for trajectory selection. No equations, self-citations, or definitions are visible that reduce the faithfulness metric or selection process to a fitted parameter or input by construction. The central claim rests on the design of perturbations as an instrumental signal and empirical results on GSM8K, MATH, and CommonsenseQA, which constitute independent external benchmarks rather than tautological renaming or self-referential fitting. This is the expected honest non-finding for a proposal paper whose method does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that perturbations isolate genuine causal dependence in LLM reasoning chains; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Controlled perturbations serve as valid instrumental variables that isolate genuine step-to-step causal dependence from bias-driven artifacts in LLM CoT.
This is the core mechanism invoked in the abstract to produce reliable faithfulness estimates.

pith-pipeline@v0.9.0 · 5486 in / 1232 out tokens · 125628 ms · 2026-05-10T14:59:38.551602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 37 canonical work pages · 15 internal anchors

[1]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. https://doi.org/10.1609/aaai.v38i16.29720 Graph of thoughts: Solving elaborate problems with large language models . Proceedings of the AAAI Conference on Artificial ...

work page doi:10.1609/aaai.v38i16.29720 2024
[2]

Shidong Cao, Hongzhan Lin, Yuxuan Gu, Ziyang Luo, and Jing Ma. 2026. Diffcot: Diffusion-styled chain-of-thought reasoning in llms. arXiv preprint arXiv:2601.03559

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Yingqian Cui, Pengfei He, Xianfeng Tang, Qi He, Chen Luo, Jiliang Tang, and Yue Xing. 2024. A theoretical understanding of chain-of-thought: Coherent reasoning and error-aware demonstration. arXiv preprint arXiv:2410.16540

work page arXiv 2024
[5]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2023. Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning

2023
[7]

Jiarun Fu, Lizhong Ding, Hao Li, Pengqi Li, Qiuning Wei, and Xu Chen. 2025 b . https://arxiv.org/abs/2502.18239 Unveiling and causalizing cot: A causal pespective . Preprint, arXiv:2502.18239

work page arXiv 2025
[8]

Yanji He, Yuxin Jiang, Yiwen Wu, Bo Huang, Jiaheng Wei, and Wei Wang. 2026. https://arxiv.org/abs/2604.12573 Idea: An interpretable and editable decision-making framework for llms via verbal-to-numeric calibration . Preprint, arXiv:2604.12573

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Fanding Huang, Guanbo Huang, Xiao Fan, Yi He, Xiao Liang, Xiao Chen, Qinting Jiang, Faisal Nadeem Khan, Jingyan Jiang, and Zhi Wang. 2026. https://arxiv.org/abs/2509.23808 Semantic-space exploration and exploitation in rlvr for llm reasoning . Preprint, arXiv:2509.23808

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2023. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798

work page internal anchor Pith review arXiv 2023
[12]

Reasoning elicitation in language models via counterfactual feedback.arXiv preprint arXiv:2410.03767, 2024

Alihan Hüyük, Xinnuo Xu, Jacqueline Maasch, Aditya V. Nori, and Javier González. 2025. https://arxiv.org/abs/2410.03767 Reasoning elicitation in language models via counterfactual feedback . Preprint, arXiv:2410.03767

work page arXiv 2025
[13]

Junhao Jia, Yueyi Wu, Huangwei Chen, Haodong Jing, Haishuai Wang, Jiajun Bu, and Lei Wu. 2026. Unsupervised causal prototypical networks for de-biased interpretable dermoscopy diagnosis. arXiv preprint arXiv:2602.23752

work page arXiv 2026
[14]

Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo Jose Taylor, and Dan Roth. 2024 a . A peek into token bias: Large language models are not yet genuine reasoners. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4722--4756

2024
[15]

Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo Jose Taylor, and Dan Roth. 2024 b . https://doi.org/10.18653/v1/2024.emnlp-main.272 A peek into token bias: Large language models are not yet genuine reasoners . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4722--...

work page doi:10.18653/v1/2024.emnlp-main.272 2024
[16]

Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.825 LLML ingua: Compressing prompts for accelerated inference of large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358--13376, Singapore. Association for Computation...

work page doi:10.18653/v1/2023.emnlp-main.825 2023
[18]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, and 1 others. 2022 b . Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221

work page internal anchor Pith review arXiv 2022
[19]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. https://arxiv.org/abs/2205.11916 Large language models are zero-shot reasoners . Preprint, arXiv:2205.11916

work page internal anchor Pith review arXiv 2023
[20]

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, and 1 others. 2023. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702

work page Pith review arXiv 2023
[21]

Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu, and Junyang Lin. 2025. Mtr-bench: A comprehensive benchmark for multi-turn reasoning evaluation. arXiv preprint arXiv:2505.17123

work page arXiv 2025
[22]

Guanran Luo, Wentao Qiu, Zhongquan Jian, Meihong Wang, and Qingqiang Wu. 2026. Gcot-decoding: Unlocking deep reasoning paths for universal question answering. arXiv preprint arXiv:2604.06794

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, and 1 others. 2023. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534--46594

2023
[24]

Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Hosseini, Mark Johnson, and Mark Steedman. 2023. Sources of hallucination by large language models on inference tasks. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2758--2774

2023
[25]

Judea Pearl. 2009. Causality: Models, Reasoning, and Inference. Cambridge University Press

2009
[26]

Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Hernandez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil \.e Luko s i \=u t \.e , and 1 others. 2023. Question decomposition improves the faithfulness of model-generated reasoning. arXiv preprint arXiv:2307.11768

work page arXiv 2023
[27]

Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Ruoxi Jia, and Ming Jin. 2023. Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379

work page arXiv 2023
[28]

Xu Shen, Song Wang, Zhen Tan, Laura Yao, Xinyu Zhao, Kaidi Xu, Xin Wang, and Tianlong Chen. 2025. Faithcot-bench: Benchmarking instance-level faithfulness of chain-of-thought reasoning. arXiv preprint arXiv:2510.04040

work page arXiv 2025
[29]

Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, and Cong Wang. 2026. Double: Breaking the acceleration limit via double retrieval speculative parallelism. arXiv preprint arXiv:2601.05524

work page internal anchor Pith review Pith/arXiv arXiv 2026
[30]

Yuxi Sun, Wei Gao, Hongzhan Lin, Jing Ma, and Wenxuan Zhang. 2025 a . Explainable ethical assessment on human behaviors by generating conflicting social norms. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 166--184

2025
[31]

Yuxi Sun, Aoqi Zuo, Wei Gao, and Jing Ma. 2025 b . Causalabstain: Enhancing multilingual llms with causal reasoning for trustworthy abstention. In Findings of the Association for Computational Linguistics: ACL 2025, pages 14060--14076

2025
[32]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149--4158

2019
[33]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R Bowman. 2024. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems (NeurIPS)

2024
[34]

Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, and Tat-Seng Chua. 2025. Think-while-generating: On-the-fly reasoning for personalized long-form generation. arXiv preprint arXiv:2512.06690

work page arXiv 2025
[36]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://arxiv.org/abs/2203.11171 Self-consistency improves chain of thought reasoning in language models . Preprint, arXiv:2203.11171

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Yu Wang and Chu-Ren Huang. 2024. Word boundary decision: An efficient approach for low-resource word segmentation. In Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation, pages 160--169

2024
[38]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems (NeurIPS)

2022
[39]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. https://arxiv.org/abs/2201.11903 Chain-of-thought prompting elicits reasoning in large language models . Preprint, arXiv:2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Jinyang Wu, Shuo Yang, Changpeng Yang, Yuhao Shen, Shuai Zhang, Zhengqi Wen, and Jianhua Tao. 2026. Spark: Strategic policy-aware exploration via dynamic branching for long-horizon agentic learning. arXiv preprint arXiv:2601.20209

work page arXiv 2026
[41]

Junda Wu, Tong Yu, Xiang Chen, Haoliang Wang, Ryan Rossi, Sungchul Kim, Anup Rao, and Julian McAuley. 2024. https://doi.org/10.18653/v1/2024.acl-long.758 D e C o T : Debiasing chain-of-thought for knowledge-intensive tasks in large language models via causal intervention . In Proceedings of the 62nd Annual Meeting of the Association for Computational Ling...

work page doi:10.18653/v1/2024.acl-long.758 2024
[42]

Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Tao Gui, Qi Zhang, and Xuanjing Huang. 2024. https://arxiv.org/abs/2305.14497 Self-polish: Enhance reasoning in large language models via problem refinement . Preprint, arXiv:2305.14497

work page arXiv 2024
[43]

Zhiheng Xi, Senjie Jin, Yuhao Zhou, Rui Zheng, Songyang Gao, Jia Liu, Tao Gui, Qi Zhang, and Xuan-Jing Huang. 2023. Self-polish: Enhance reasoning in large language models via problem refinement. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11383--11406

2023
[44]

Zidi Xiong, Shan Chen, Zhenting Qi, and Himabindu Lakkaraju. 2025. Measuring the faithfulness of thinking drafts in large reasoning models. arXiv preprint arXiv:2505.13774

work page arXiv 2025
[45]

Shijia Xu, Yu Wang, Xiaolong Jia, Zhou Wu, Kai Liu, and April Xiaowen Dong. 2026 a . https://arxiv.org/abs/2604.10740 Rcbsf: A multi-agent framework for automated contract revision via stackelberg game . Preprint, arXiv:2604.10740

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Shijia Xu, Zhou Wu, Xiaolong Jia, Yu Wang, Kai Liu, and April Xiaowen Dong. 2026 b . https://arxiv.org/abs/2604.10734 Self-correcting rag: Enhancing faithfulness via mmkp context selection and nli-guided mcts . Preprint, arXiv:2604.10734

work page internal anchor Pith review Pith/arXiv arXiv 2026
[47]

Sohee Yang, Sang-Woo Lee, Nora Kassner, Daniela Gottesman, Sebastian Riedel, and Mor Geva. 2025. How well can reasoning models identify and recover from unhelpful thoughts? arXiv preprint arXiv:2506.10979

work page arXiv 2025
[48]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36:11809--11822

2023
[49]

Junchi Yu, Ran He, and Rex Ying. 2023. Thought propagation: An analogical approach to complex reasoning with large language models. arXiv preprint arXiv:2310.03965

work page arXiv 2023
[50]

Xiangning Yu, Zhuohan Wang, Linyi Yang, Haoxuan Li, Anjie Liu, Xiao Xue, Jun Wang, and Mengyue Yang. 2025. https://arxiv.org/abs/2506.09853 Causal sufficiency and necessity improves chain-of-thought reasoning . Preprint, arXiv:2506.09853

work page arXiv 2025
[51]

Chen Zhang, Lanning Zhang, and Dexiang Zhou. 2024. Causal walk: Debiasing multi-hop fact verification with front-door adjustment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19533--19541

2024
[52]

Zhen Zhang, Guanhua Zhang, Bairu Hou, Wenqi Fan, Qing Li, Sijia Liu, Yang Zhang, and Shiyu Chang. 2023. Certified robustness for large language models with self-denoising. arXiv preprint arXiv:2307.07171

work page arXiv 2023
[53]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, and 1 others. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595--46623

2023
[54]

Denny Zhou, Nathanael Sch \"a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and 1 others. 2022. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625

work page internal anchor Pith review arXiv 2022
[55]

Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, and Bo Han. 2024. Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales? Advances in Neural Information Processing Systems, 37:123846--123910

2024
[56]

Wang Zhu, Jesse Thomason, and Robin Jia. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.547 Chain-of-questions training with latent answers for robust multistep question answering . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8845--8860, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.547 2023
[57]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[58]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...