SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning

Junjie Nian; Kang Chen; Yixin Cao; Yugang Jiang

arxiv: 2605.14619 · v1 · pith:AQ4F6N4Anew · submitted 2026-05-14 · 💻 cs.AI

SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning

Kang Chen , Junjie Nian , Yixin Cao , Yugang Jiang This is my paper

Pith reviewed 2026-06-30 20:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords chain-of-thought reasoningprocess isomersmulti-run CoTreasoning trajectoriesgraph-based analysisprocess familiesSliceGraph

0 comments

The pith

Correct chain-of-thought trajectories that reach the same answer frequently belong to separate process families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SliceGraph, a graph constructed from slices of chain-of-thought outputs, to examine how multiple reasoning runs relate at the level of intermediate states rather than final answers alone. It finds that in the great majority of tested problem-model combinations, correct runs sharing an answer divide into distinct process families whose trajectories do not share the same reasoning-state units. This structured divergence is termed process isomers. A sympathetic reader would care because standard evaluation that collapses runs to answer aggregates would miss this internal geometry of how models arrive at solutions.

Core claim

Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, correct CoTs sharing the same normalized answer split into multiple process families in 85.5% of 954 problem-model cells; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. Blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units.

What carries the argument

SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, treated as a measurement object that yields biconnected components as reasoning-state units and process families as route units.

If this is right

A label-seeded reward field shows success-associated regions often split into disconnected high-value cores, with route families specializing over these footprints rather than duplicating one another.
Typed-state transition analysis shows process families navigate the same atlas with distinct transition kernels under matched null controls.
Representation ablations, cross-architecture replication, and cross-scale replications support the robustness of the route-family scaffold.
Final-answer aggregation overlooks the structured multi-route process geometry revealed by the families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The existence of process isomers suggests that sampling or decoding strategies could be designed to target different families rather than repeated draws from one dominant route.
Reward models trained on final outcomes alone may need family-specific components to capture the disconnected high-value cores.
The atlas-like structure with distinct kernels implies that interventions at the transition level could steer trajectories between families.

Load-bearing premise

That mutual-kNN over sparse activation-key Jaccard similarity between CoT slices produces biconnected components and process families that correspond to meaningful shared reasoning-state units, as validated by blinded annotation.

What would settle it

If blinded annotators systematically disagree with the biconnected-component groupings produced by the graph, or if a different similarity measure produces substantially lower rates of cross-family correct pairs, the mapping from graph structure to process families would not hold.

read the original abstract

Multi-run chain-of-thought reasoning is usually collapsed to final-answer aggregates, which discard howsampled trajectories share, split, and rejoin through intermediate computation. We propose SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, and treat it as a measurement object for process geometry rather than as a decoding program. Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units. In 85.5% of 954 problem-model cells, correct CoTs sharing the same normalized answer split into multiple process families; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. A label-seeded reward field provides a separate value-landscape layer: success-associated regions often split into disconnected high-value cores, and route families specialize over these core footprints rather than merely duplicating one another. A typed-state transition analysis further shows that process families navigate the same atlas with distinct transition kernels under matched null controls. Representation ablations, a cross-architecture replication, and two cross-scale replications support the robustness of the route-family scaffold, showing that final-answer aggregation overlooks this structured multi-route process geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines process isomers as same-answer CoT trajectories from different graph families via SliceGraph, with replications, but the families rest on annotation whose reliability is not quantified in the available text.

read the letter

The main point is that correct multi-run CoTs for the same normalized answer frequently split into multiple process families rather than converging on one route. They build SliceGraph as a post-hoc object from mutual-kNN Jaccard on activation-key slices of the CoT, treat biconnected components as families, and report that 85.5% of 954 problem-model cells show this split, with 76.6% of pairs cross-family on average. They layer on a reward field and typed transition analysis to show specialization across families.

What is new is the explicit framing of process isomers and the graph measurement object for process geometry, which goes beyond final-answer aggregation or simple diversity counts. The replications across three primary models, cross-architecture, and cross-scale runs give some evidence that the route-family scaffold is not brittle.

The soft spot is exactly the one the stress-test flags: the claim that components correspond to meaningful shared reasoning-state units rests on blinded annotation, yet the abstract supplies no inter-annotator agreement, presentation protocol, or objective correlates. Without those, it remains possible the families track surface overlap instead of process structure, which would make the isomer percentages non-interpretable. The lack of error analysis or exclusion rules in the reported results adds to the verification problem.

This is for readers working on LLM reasoning evaluation who want a lens on internal trajectory structure. It deserves a serious referee to examine the full methods, annotation details, and any released data or code, because the core observation could matter for how we aggregate multi-run results if the measurement holds.

Referee Report

3 major / 2 minor

Summary. The paper introduces SliceGraph, a post-hoc graph over CoT slices constructed via mutual-kNN on sparse activation-key Jaccard similarity, to identify biconnected components as shared reasoning-state units and process families as strategy-coherent routes. It reports that in 85.5% of 954 problem-model cells from three 4B/8B models on math/science benchmarks, correct same-answer CoTs split into multiple families (process isomers), with 76.6% of run pairs cross-family on average; blinded annotation is cited as validation, alongside reward-field and typed-state transition analyses showing specialization and distinct kernels, plus representation, architecture, and scale ablations.

Significance. If the process-family construction and annotation validation hold, the result demonstrates that final-answer aggregation discards substantial structured diversity in correct reasoning trajectories, with families navigating distinct transition kernels and reward cores; this could inform more granular evaluation of LLM reasoning and training objectives that target route diversity rather than answer matching alone. The cross-architecture and cross-scale replications are a strength.

major comments (3)

[Abstract (blinded annotation support) and methods describing annotation] The central quantitative claims (85.5% multi-family cells and 76.6% cross-family pairs) rest on the claim that mutual-kNN Jaccard biconnected components correspond to meaningful reasoning-state units, which is supported solely by blinded annotation. No inter-annotator agreement, annotation guidelines, slice presentation protocol, or objective correlates (e.g., differential transition statistics or reward specialization metrics) are reported, leaving open the possibility that components reflect surface lexical overlap rather than shared process geometry.
[SliceGraph construction and quantitative results sections] The definition of process families via biconnected components in the SliceGraph is post-hoc and metric-dependent; it is unclear how sensitive the 85.5% and 76.6% statistics are to the choice of Jaccard threshold, k in kNN, or activation-key sparsity, and no sensitivity analysis or null-model comparison for family emergence is described.
[Typed-state transition analysis] The typed-state transition analysis claims distinct kernels under matched null controls, but without explicit description of how null controls are constructed or how kernel divergence is quantified (e.g., via specific distance on transition matrices), it is difficult to assess whether the reported specialization exceeds what would arise from random partitioning of the same trajectories.

minor comments (2)

[Results on process isomers] The abstract and results would benefit from explicit reporting of the total number of CoT runs per cell and the distribution of family sizes to allow readers to gauge the base rates underlying the 76.6% cross-family pair statistic.
[Methods] Notation for 'normalized answer' and 'activation-key' should be defined at first use with a short formal definition or pseudocode reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify areas where additional methodological detail will improve clarity and reproducibility. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract (blinded annotation support) and methods describing annotation] The central quantitative claims (85.5% multi-family cells and 76.6% cross-family pairs) rest on the claim that mutual-kNN Jaccard biconnected components correspond to meaningful reasoning-state units, which is supported solely by blinded annotation. No inter-annotator agreement, annotation guidelines, slice presentation protocol, or objective correlates (e.g., differential transition statistics or reward specialization metrics) are reported, leaving open the possibility that components reflect surface lexical overlap rather than shared process geometry.

Authors: We agree that expanded reporting on the annotation protocol is warranted. In revision we will add a methods subsection that includes the annotation guidelines, slice presentation protocol, and inter-annotator agreement statistics. We will also report objective correlates (differential transition statistics and reward specialization metrics) that distinguish process geometry from lexical overlap, thereby addressing the concern directly. revision: yes
Referee: [SliceGraph construction and quantitative results sections] The definition of process families via biconnected components in the SliceGraph is post-hoc and metric-dependent; it is unclear how sensitive the 85.5% and 76.6% statistics are to the choice of Jaccard threshold, k in kNN, or activation-key sparsity, and no sensitivity analysis or null-model comparison for family emergence is described.

Authors: We will add a sensitivity analysis subsection that varies the Jaccard threshold, k, and sparsity level and reports the resulting range for the 85.5 % and 76.6 % statistics. We will also include a null-model comparison that quantifies family emergence against randomized baselines, thereby demonstrating robustness to the chosen parameters. revision: yes
Referee: [Typed-state transition analysis] The typed-state transition analysis claims distinct kernels under matched null controls, but without explicit description of how null controls are constructed or how kernel divergence is quantified (e.g., via specific distance on transition matrices), it is difficult to assess whether the reported specialization exceeds what would arise from random partitioning of the same trajectories.

Authors: We will expand the methods to specify the exact construction of the matched null controls and the distance metric used to quantify divergence between transition matrices. This will allow direct comparison against random partitioning and make the specialization claim fully evaluable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurement procedure is self-contained

full rationale

The paper defines SliceGraph explicitly as a post-hoc construction (mutual-kNN over sparse activation-key Jaccard similarity on CoT slices) and reports direct empirical counts such as the 85.5% multi-family statistic and 76.6% cross-family pairs; these are measurements on the resulting graph rather than quantities derived from fitted parameters or reduced to inputs by construction. Blinded annotation is invoked for validation of biconnected components as reasoning-state units, but this is an external human judgment step with no self-citation load-bearing or uniqueness theorems from prior author work. No equations or steps in the abstract or described method exhibit self-definitional loops, fitted-input predictions, ansatz smuggling, or renaming of known results. The derivation chain consists of a transparent measurement pipeline whose outputs are not equivalent to its inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central measurement rests on the domain assumption that activation-key Jaccard similarity plus mutual-kNN captures process similarity; no free parameters or invented entities with independent evidence are stated in the abstract.

axioms (1)

domain assumption Mutual-kNN over sparse activation-key Jaccard similarity between CoT slices produces biconnected components that correspond to shared reasoning-state units.
Invoked to treat the graph as a measurement object for process geometry and to interpret families via blinded annotation.

invented entities (1)

process isomers no independent evidence
purpose: Label for same-answer family-divergent correct trajectories revealed by the graph.
New descriptive term introduced from the SliceGraph analysis; no independent falsifiable handle provided in abstract.

pith-pipeline@v0.9.1-grok · 5801 in / 1368 out tokens · 36565 ms · 2026-06-30T20:47:16.690338+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories
cs.AI 2026-05 unverdicted novelty 5.0

TraceGraph constructs shared state graphs from multi-model trajectories to expose productive cores and trap regions, then uses them to diagnose navigation differences across benchmarks and to drive a recovery pipeline...

Reference graph

Works this paper leans on

23 extracted references · 14 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. MathArena: Evaluating LLMs on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025. doi: 10.48550/arXiv.2505.23281. URLhttps://arxiv.org/abs/2505.23281

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.23281 2025
[2]

17682–17690

Maciej Besta, Nils Blach, Aleš Kubicek, Robert Gerstenberger, Michał Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 202...

work page doi:10.1609/aaai.v38i16.29720 2024
[3]

doi: 10.48550/arXiv.2510.26277

KangChen,YaoningWang,KaiXiong,ZhuokaFeng,WenheSun,HaotianChen,andYixinCao.DoLLMssignalwhen they’reright? evidencefromneuronagreement.arXivpreprintarXiv:2510.26277,2025. doi: 10.48550/arXiv.2510.26277. URLhttps://arxiv.org/abs/2510.26277

work page doi:10.48550/arxiv.2510.26277 2025
[4]

NEX: Neuron explore–exploit scoring for label-free chain-of-thought selection and model ranking.arXiv preprint arXiv:2602.05805, 2026

Kang Chen, Zhuoka Feng, Sihan Zhao, Kai Xiong, Junjie Nian, Yaoning Wang, Changyi Xiao, and Yixin Cao. NEX: Neuron explore–exploit scoring for label-free chain-of-thought selection and model ranking.arXiv preprint arXiv:2602.05805, 2026. doi: 10.48550/arXiv.2602.05805. URLhttps://arxiv.org/abs/2602.05805

work page doi:10.48550/arxiv.2602.05805 2026
[5]

Dongkyu Cho, Amy B. Z. Zhang, Bilel Fehri, Sheng Wang, Rumi Chunara, Hengrui Cai, and Rui Song. Correct reasoning paths visit shared decision pivots.arXiv preprint arXiv:2509.21549, 2025. doi: 10.48550/arXiv.2509.21549. URLhttps://arxiv.org/abs/2509.21549

work page doi:10.48550/arxiv.2509.21549 2025
[6]

Truth as a trajectory: What internal representations reveal about large language model reasoning.arXiv preprint arXiv:2603.01326, 2026

Hamed Damirchi, Ignacio Meza De la Jara, Ehsan Abbasnejad, Afshar Shamsi, Zhen Zhang, and Javen Shi. Truth as a trajectory: What internal representations reveal about large language model reasoning.arXiv preprint arXiv:2603.01326, 2026. doi: 10.48550/arXiv.2603.01326. URLhttps://arxiv.org/abs/2603.01326

work page doi:10.48550/arxiv.2603.01326 2026
[7]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

doi: 10.48550/arXiv.2311.12022. URLhttps://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12022
[9]

Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=4FWAwZtd2n

2025
[10]

LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan. LLM reasoning as trajectories: Step-specific representation geometry and correctness signals.arXiv preprint arXiv:2604.05655, 2026. doi: 10.48550/arXiv.2604.05655. URLhttps://arxiv.org/abs/2604.05655

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.05655 2026
[11]

The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models

Xue Wen Tan, Nathaniel Tan, Galen Lee, and Stanley Kok. The shape of reasoning: Topological analysis of reasoning traces in large language models.arXiv preprint arXiv:2510.20665, 2025. doi: 10.48550/arXiv.2510.20665. URL https://arxiv.org/abs/2510.20665

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.20665 2025
[12]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=1PL1NIMMrw

2023
[13]

The evolution of thought: Tracking LLM overthinking via reasoning dynamics analysis.arXiv preprint arXiv:2508.17627, 2025

Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, and Xueqi Cheng. The evolution of thought: Tracking LLM overthinking via reasoning dynamics analysis.arXiv preprint arXiv:2508.17627, 2025. doi: 10.48550/arXiv.2508.17627. URLhttps://arxiv.org/abs/2508.17627

work page doi:10.48550/arxiv.2508.17627 2025
[14]

Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs

Zhen Xiong, Yujun Cai, Zhecheng Li, and Yiwei Wang. Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs. InProceedingsofthe2025ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages17751– 17763, Suzhou, China, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.emnlp-main.896. URLhttps://aclanthology.org/2025.emnlp...

work page doi:10.18653/v1/2025.emnlp-main.896 2025
[15]

Griffiths, Yuan Cao, and Karthik R

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc., 2023. URLhttps://openreview.net/forum? id=5Xc1ecxO1h. 11

2023
[16]

American invitational mathematics examination (AIME) 2024

Yifan Zhang and Math-AI Team. American invitational mathematics examination (AIME) 2024. Hugging Face dataset, 2024. URLhttps://huggingface.co/datasets/math-ai/aime24

2024
[17]

American invitational mathematics examination (AIME) 2025

Yifan Zhang and Math-AI Team. American invitational mathematics examination (AIME) 2025. Hugging Face dataset, 2025. URLhttps://huggingface.co/datasets/math-ai/aime25

2025
[18]

From Chains to DAGs: Probing the Graph Structure of Reasoning in LLMs

Tianjun Zhong, Linyang He, and Nima Mesgarani. From chains to DAGs: Probing the graph structure of reasoning in LLMs.arXiv preprint arXiv:2601.17593, 2026. doi: 10.48550/arXiv.2601.17593. URLhttps://arxiv.org/abs/ 2601.17593

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.17593 2026
[19]

954-cellcorpus

ZhankeZhou,ZhaochengZhu,XuanLi,MikhailGalkin,XiaoFeng,SanmiKoyejo,JianTang,andBoHan. Landscape of thoughts: Visualizing the reasoning process of large language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=XpoQ812d0A. Poster. Appendix roadmap • Implementation and count conventions: ...

work page arXiv 2026
[20]

Degree-preserving graph rewire— preserves the degree sequence but destroys local adjacency; used for the family modularity headline (median𝑧=35.54)
[21]

3.Family-label shuffle— preserves family sizes and typed-state support but destroys route-specific labels; the exported population null for family-TV (80.9%above𝑝95, median𝑧=3.14)

Block-type-preserving rewire— preserves role counts and type marginals; used as a structural-sensitivity stress test (effect-drop only). 3.Family-label shuffle— preserves family sizes and typed-state support but destroys route-specific labels; the exported population null for family-TV (80.9%above𝑝95, median𝑧=3.14)
[22]

Temporal-order shuffle— preserves visited typed states but destroys transition order; used for kernel, escape, return, and MFPT stress tests (effect-drop only)
[23]

Use uncertainsparingly—only when two runs share a framework but execute it in materially different ways

Label permutation— preserves graph topology, family partition, and per-cell label count but randomises correctness association; the reward-core stress test (Table 16). Items 1 and 3 generate the exported headline nulls; items 2 and 4 are validity-ladder stress tests and should not be read as independent population claims. Label permutations tendto scatter...

[1] [1]

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. MathArena: Evaluating LLMs on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025. doi: 10.48550/arXiv.2505.23281. URLhttps://arxiv.org/abs/2505.23281

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.23281 2025

[2] [2]

17682–17690

Maciej Besta, Nils Blach, Aleš Kubicek, Robert Gerstenberger, Michał Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 202...

work page doi:10.1609/aaai.v38i16.29720 2024

[3] [3]

doi: 10.48550/arXiv.2510.26277

KangChen,YaoningWang,KaiXiong,ZhuokaFeng,WenheSun,HaotianChen,andYixinCao.DoLLMssignalwhen they’reright? evidencefromneuronagreement.arXivpreprintarXiv:2510.26277,2025. doi: 10.48550/arXiv.2510.26277. URLhttps://arxiv.org/abs/2510.26277

work page doi:10.48550/arxiv.2510.26277 2025

[4] [4]

NEX: Neuron explore–exploit scoring for label-free chain-of-thought selection and model ranking.arXiv preprint arXiv:2602.05805, 2026

Kang Chen, Zhuoka Feng, Sihan Zhao, Kai Xiong, Junjie Nian, Yaoning Wang, Changyi Xiao, and Yixin Cao. NEX: Neuron explore–exploit scoring for label-free chain-of-thought selection and model ranking.arXiv preprint arXiv:2602.05805, 2026. doi: 10.48550/arXiv.2602.05805. URLhttps://arxiv.org/abs/2602.05805

work page doi:10.48550/arxiv.2602.05805 2026

[5] [5]

Dongkyu Cho, Amy B. Z. Zhang, Bilel Fehri, Sheng Wang, Rumi Chunara, Hengrui Cai, and Rui Song. Correct reasoning paths visit shared decision pivots.arXiv preprint arXiv:2509.21549, 2025. doi: 10.48550/arXiv.2509.21549. URLhttps://arxiv.org/abs/2509.21549

work page doi:10.48550/arxiv.2509.21549 2025

[6] [6]

Truth as a trajectory: What internal representations reveal about large language model reasoning.arXiv preprint arXiv:2603.01326, 2026

Hamed Damirchi, Ignacio Meza De la Jara, Ehsan Abbasnejad, Afshar Shamsi, Zhen Zhang, and Javen Shi. Truth as a trajectory: What internal representations reveal about large language model reasoning.arXiv preprint arXiv:2603.01326, 2026. doi: 10.48550/arXiv.2603.01326. URLhttps://arxiv.org/abs/2603.01326

work page doi:10.48550/arxiv.2603.01326 2026

[7] [7]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

doi: 10.48550/arXiv.2311.12022. URLhttps://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12022

[9] [9]

Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=4FWAwZtd2n

2025

[10] [10]

LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals

Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan. LLM reasoning as trajectories: Step-specific representation geometry and correctness signals.arXiv preprint arXiv:2604.05655, 2026. doi: 10.48550/arXiv.2604.05655. URLhttps://arxiv.org/abs/2604.05655

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.05655 2026

[11] [11]

The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models

Xue Wen Tan, Nathaniel Tan, Galen Lee, and Stanley Kok. The shape of reasoning: Topological analysis of reasoning traces in large language models.arXiv preprint arXiv:2510.20665, 2025. doi: 10.48550/arXiv.2510.20665. URL https://arxiv.org/abs/2510.20665

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.20665 2025

[12] [12]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=1PL1NIMMrw

2023

[13] [13]

The evolution of thought: Tracking LLM overthinking via reasoning dynamics analysis.arXiv preprint arXiv:2508.17627, 2025

Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, and Xueqi Cheng. The evolution of thought: Tracking LLM overthinking via reasoning dynamics analysis.arXiv preprint arXiv:2508.17627, 2025. doi: 10.48550/arXiv.2508.17627. URLhttps://arxiv.org/abs/2508.17627

work page doi:10.48550/arxiv.2508.17627 2025

[14] [14]

Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs

Zhen Xiong, Yujun Cai, Zhecheng Li, and Yiwei Wang. Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs. InProceedingsofthe2025ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages17751– 17763, Suzhou, China, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.emnlp-main.896. URLhttps://aclanthology.org/2025.emnlp...

work page doi:10.18653/v1/2025.emnlp-main.896 2025

[15] [15]

Griffiths, Yuan Cao, and Karthik R

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc., 2023. URLhttps://openreview.net/forum? id=5Xc1ecxO1h. 11

2023

[16] [16]

American invitational mathematics examination (AIME) 2024

Yifan Zhang and Math-AI Team. American invitational mathematics examination (AIME) 2024. Hugging Face dataset, 2024. URLhttps://huggingface.co/datasets/math-ai/aime24

2024

[17] [17]

American invitational mathematics examination (AIME) 2025

Yifan Zhang and Math-AI Team. American invitational mathematics examination (AIME) 2025. Hugging Face dataset, 2025. URLhttps://huggingface.co/datasets/math-ai/aime25

2025

[18] [18]

From Chains to DAGs: Probing the Graph Structure of Reasoning in LLMs

Tianjun Zhong, Linyang He, and Nima Mesgarani. From chains to DAGs: Probing the graph structure of reasoning in LLMs.arXiv preprint arXiv:2601.17593, 2026. doi: 10.48550/arXiv.2601.17593. URLhttps://arxiv.org/abs/ 2601.17593

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.17593 2026

[19] [19]

954-cellcorpus

ZhankeZhou,ZhaochengZhu,XuanLi,MikhailGalkin,XiaoFeng,SanmiKoyejo,JianTang,andBoHan. Landscape of thoughts: Visualizing the reasoning process of large language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=XpoQ812d0A. Poster. Appendix roadmap • Implementation and count conventions: ...

work page arXiv 2026

[20] [20]

Degree-preserving graph rewire— preserves the degree sequence but destroys local adjacency; used for the family modularity headline (median𝑧=35.54)

[21] [21]

3.Family-label shuffle— preserves family sizes and typed-state support but destroys route-specific labels; the exported population null for family-TV (80.9%above𝑝95, median𝑧=3.14)

Block-type-preserving rewire— preserves role counts and type marginals; used as a structural-sensitivity stress test (effect-drop only). 3.Family-label shuffle— preserves family sizes and typed-state support but destroys route-specific labels; the exported population null for family-TV (80.9%above𝑝95, median𝑧=3.14)

[22] [22]

Temporal-order shuffle— preserves visited typed states but destroys transition order; used for kernel, escape, return, and MFPT stress tests (effect-drop only)

[23] [23]

Use uncertainsparingly—only when two runs share a framework but execute it in materially different ways

Label permutation— preserves graph topology, family partition, and per-cell label count but randomises correctness association; the reward-core stress test (Table 16). Items 1 and 3 generate the exported headline nulls; items 2 and 4 are validity-ladder stress tests and should not be read as independent population claims. Label permutations tendto scatter...