SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning
Pith reviewed 2026-06-30 20:47 UTC · model grok-4.3
The pith
Correct chain-of-thought trajectories that reach the same answer frequently belong to separate process families.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, correct CoTs sharing the same normalized answer split into multiple process families in 85.5% of 954 problem-model cells; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. Blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units.
What carries the argument
SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, treated as a measurement object that yields biconnected components as reasoning-state units and process families as route units.
If this is right
- A label-seeded reward field shows success-associated regions often split into disconnected high-value cores, with route families specializing over these footprints rather than duplicating one another.
- Typed-state transition analysis shows process families navigate the same atlas with distinct transition kernels under matched null controls.
- Representation ablations, cross-architecture replication, and cross-scale replications support the robustness of the route-family scaffold.
- Final-answer aggregation overlooks the structured multi-route process geometry revealed by the families.
Where Pith is reading between the lines
- The existence of process isomers suggests that sampling or decoding strategies could be designed to target different families rather than repeated draws from one dominant route.
- Reward models trained on final outcomes alone may need family-specific components to capture the disconnected high-value cores.
- The atlas-like structure with distinct kernels implies that interventions at the transition level could steer trajectories between families.
Load-bearing premise
That mutual-kNN over sparse activation-key Jaccard similarity between CoT slices produces biconnected components and process families that correspond to meaningful shared reasoning-state units, as validated by blinded annotation.
What would settle it
If blinded annotators systematically disagree with the biconnected-component groupings produced by the graph, or if a different similarity measure produces substantially lower rates of cross-family correct pairs, the mapping from graph structure to process families would not hold.
read the original abstract
Multi-run chain-of-thought reasoning is usually collapsed to final-answer aggregates, which discard howsampled trajectories share, split, and rejoin through intermediate computation. We propose SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, and treat it as a measurement object for process geometry rather than as a decoding program. Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units. In 85.5% of 954 problem-model cells, correct CoTs sharing the same normalized answer split into multiple process families; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. A label-seeded reward field provides a separate value-landscape layer: success-associated regions often split into disconnected high-value cores, and route families specialize over these core footprints rather than merely duplicating one another. A typed-state transition analysis further shows that process families navigate the same atlas with distinct transition kernels under matched null controls. Representation ablations, a cross-architecture replication, and two cross-scale replications support the robustness of the route-family scaffold, showing that final-answer aggregation overlooks this structured multi-route process geometry.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SliceGraph, a post-hoc graph over CoT slices constructed via mutual-kNN on sparse activation-key Jaccard similarity, to identify biconnected components as shared reasoning-state units and process families as strategy-coherent routes. It reports that in 85.5% of 954 problem-model cells from three 4B/8B models on math/science benchmarks, correct same-answer CoTs split into multiple families (process isomers), with 76.6% of run pairs cross-family on average; blinded annotation is cited as validation, alongside reward-field and typed-state transition analyses showing specialization and distinct kernels, plus representation, architecture, and scale ablations.
Significance. If the process-family construction and annotation validation hold, the result demonstrates that final-answer aggregation discards substantial structured diversity in correct reasoning trajectories, with families navigating distinct transition kernels and reward cores; this could inform more granular evaluation of LLM reasoning and training objectives that target route diversity rather than answer matching alone. The cross-architecture and cross-scale replications are a strength.
major comments (3)
- [Abstract (blinded annotation support) and methods describing annotation] The central quantitative claims (85.5% multi-family cells and 76.6% cross-family pairs) rest on the claim that mutual-kNN Jaccard biconnected components correspond to meaningful reasoning-state units, which is supported solely by blinded annotation. No inter-annotator agreement, annotation guidelines, slice presentation protocol, or objective correlates (e.g., differential transition statistics or reward specialization metrics) are reported, leaving open the possibility that components reflect surface lexical overlap rather than shared process geometry.
- [SliceGraph construction and quantitative results sections] The definition of process families via biconnected components in the SliceGraph is post-hoc and metric-dependent; it is unclear how sensitive the 85.5% and 76.6% statistics are to the choice of Jaccard threshold, k in kNN, or activation-key sparsity, and no sensitivity analysis or null-model comparison for family emergence is described.
- [Typed-state transition analysis] The typed-state transition analysis claims distinct kernels under matched null controls, but without explicit description of how null controls are constructed or how kernel divergence is quantified (e.g., via specific distance on transition matrices), it is difficult to assess whether the reported specialization exceeds what would arise from random partitioning of the same trajectories.
minor comments (2)
- [Results on process isomers] The abstract and results would benefit from explicit reporting of the total number of CoT runs per cell and the distribution of family sizes to allow readers to gauge the base rates underlying the 76.6% cross-family pair statistic.
- [Methods] Notation for 'normalized answer' and 'activation-key' should be defined at first use with a short formal definition or pseudocode reference.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify areas where additional methodological detail will improve clarity and reproducibility. We respond point-by-point below.
read point-by-point responses
-
Referee: [Abstract (blinded annotation support) and methods describing annotation] The central quantitative claims (85.5% multi-family cells and 76.6% cross-family pairs) rest on the claim that mutual-kNN Jaccard biconnected components correspond to meaningful reasoning-state units, which is supported solely by blinded annotation. No inter-annotator agreement, annotation guidelines, slice presentation protocol, or objective correlates (e.g., differential transition statistics or reward specialization metrics) are reported, leaving open the possibility that components reflect surface lexical overlap rather than shared process geometry.
Authors: We agree that expanded reporting on the annotation protocol is warranted. In revision we will add a methods subsection that includes the annotation guidelines, slice presentation protocol, and inter-annotator agreement statistics. We will also report objective correlates (differential transition statistics and reward specialization metrics) that distinguish process geometry from lexical overlap, thereby addressing the concern directly. revision: yes
-
Referee: [SliceGraph construction and quantitative results sections] The definition of process families via biconnected components in the SliceGraph is post-hoc and metric-dependent; it is unclear how sensitive the 85.5% and 76.6% statistics are to the choice of Jaccard threshold, k in kNN, or activation-key sparsity, and no sensitivity analysis or null-model comparison for family emergence is described.
Authors: We will add a sensitivity analysis subsection that varies the Jaccard threshold, k, and sparsity level and reports the resulting range for the 85.5 % and 76.6 % statistics. We will also include a null-model comparison that quantifies family emergence against randomized baselines, thereby demonstrating robustness to the chosen parameters. revision: yes
-
Referee: [Typed-state transition analysis] The typed-state transition analysis claims distinct kernels under matched null controls, but without explicit description of how null controls are constructed or how kernel divergence is quantified (e.g., via specific distance on transition matrices), it is difficult to assess whether the reported specialization exceeds what would arise from random partitioning of the same trajectories.
Authors: We will expand the methods to specify the exact construction of the matched null controls and the distance metric used to quantify divergence between transition matrices. This will allow direct comparison against random partitioning and make the specialization claim fully evaluable. revision: yes
Circularity Check
No significant circularity; empirical measurement procedure is self-contained
full rationale
The paper defines SliceGraph explicitly as a post-hoc construction (mutual-kNN over sparse activation-key Jaccard similarity on CoT slices) and reports direct empirical counts such as the 85.5% multi-family statistic and 76.6% cross-family pairs; these are measurements on the resulting graph rather than quantities derived from fitted parameters or reduced to inputs by construction. Blinded annotation is invoked for validation of biconnected components as reasoning-state units, but this is an external human judgment step with no self-citation load-bearing or uniqueness theorems from prior author work. No equations or steps in the abstract or described method exhibit self-definitional loops, fitted-input predictions, ansatz smuggling, or renaming of known results. The derivation chain consists of a transparent measurement pipeline whose outputs are not equivalent to its inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Mutual-kNN over sparse activation-key Jaccard similarity between CoT slices produces biconnected components that correspond to shared reasoning-state units.
invented entities (1)
-
process isomers
no independent evidence
Forward citations
Cited by 1 Pith paper
-
TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories
TraceGraph constructs shared state graphs from multi-model trajectories to expose productive cores and trap regions, then uses them to diagnose navigation differences across benchmarks and to drive a recovery pipeline...
Reference graph
Works this paper leans on
-
[1]
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Mislav Balunović, Jasper Dekoninck, Ivo Petrov, Nikola Jovanović, and Martin Vechev. MathArena: Evaluating LLMs on uncontaminated math competitions.arXiv preprint arXiv:2505.23281, 2025. doi: 10.48550/arXiv.2505.23281. URLhttps://arxiv.org/abs/2505.23281
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.23281 2025
-
[2]
Maciej Besta, Nils Blach, Aleš Kubicek, Robert Gerstenberger, Michał Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 202...
-
[3]
doi: 10.48550/arXiv.2510.26277
KangChen,YaoningWang,KaiXiong,ZhuokaFeng,WenheSun,HaotianChen,andYixinCao.DoLLMssignalwhen they’reright? evidencefromneuronagreement.arXivpreprintarXiv:2510.26277,2025. doi: 10.48550/arXiv.2510.26277. URLhttps://arxiv.org/abs/2510.26277
-
[4]
Kang Chen, Zhuoka Feng, Sihan Zhao, Kai Xiong, Junjie Nian, Yaoning Wang, Changyi Xiao, and Yixin Cao. NEX: Neuron explore–exploit scoring for label-free chain-of-thought selection and model ranking.arXiv preprint arXiv:2602.05805, 2026. doi: 10.48550/arXiv.2602.05805. URLhttps://arxiv.org/abs/2602.05805
-
[5]
Dongkyu Cho, Amy B. Z. Zhang, Bilel Fehri, Sheng Wang, Rumi Chunara, Hengrui Cai, and Rui Song. Correct reasoning paths visit shared decision pivots.arXiv preprint arXiv:2509.21549, 2025. doi: 10.48550/arXiv.2509.21549. URLhttps://arxiv.org/abs/2509.21549
-
[6]
Hamed Damirchi, Ignacio Meza De la Jara, Ehsan Abbasnejad, Afshar Shamsi, Zhen Zhang, and Javen Shi. Truth as a trajectory: What internal representations reveal about large language model reasoning.arXiv preprint arXiv:2603.01326, 2026. doi: 10.48550/arXiv.2603.01326. URLhttps://arxiv.org/abs/2603.01326
-
[7]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
doi: 10.48550/arXiv.2311.12022. URLhttps://arxiv.org/abs/2311.12022
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.12022
-
[9]
Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=4FWAwZtd2n
2025
-
[10]
LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals
Lihao Sun, Hang Dong, Bo Qiao, Qingwei Lin, Dongmei Zhang, and Saravan Rajmohan. LLM reasoning as trajectories: Step-specific representation geometry and correctness signals.arXiv preprint arXiv:2604.05655, 2026. doi: 10.48550/arXiv.2604.05655. URLhttps://arxiv.org/abs/2604.05655
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.05655 2026
-
[11]
The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models
Xue Wen Tan, Nathaniel Tan, Galen Lee, and Stanley Kok. The shape of reasoning: Topological analysis of reasoning traces in large language models.arXiv preprint arXiv:2510.20665, 2025. doi: 10.48550/arXiv.2510.20665. URL https://arxiv.org/abs/2510.20665
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.20665 2025
-
[12]
Le, Ed H
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=1PL1NIMMrw
2023
-
[13]
Zihao Wei, Liang Pang, Jiahao Liu, Wenjie Shi, Jingcheng Deng, Shicheng Xu, Zenghao Duan, Fei Sun, Huawei Shen, and Xueqi Cheng. The evolution of thought: Tracking LLM overthinking via reasoning dynamics analysis.arXiv preprint arXiv:2508.17627, 2025. doi: 10.48550/arXiv.2508.17627. URLhttps://arxiv.org/abs/2508.17627
-
[14]
Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs
Zhen Xiong, Yujun Cai, Zhecheng Li, and Yiwei Wang. Mapping the minds of LLMs: A graph-based analysis of reasoningLLMs. InProceedingsofthe2025ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,pages17751– 17763, Suzhou, China, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.emnlp-main.896. URLhttps://aclanthology.org/2025.emnlp...
-
[15]
Griffiths, Yuan Cao, and Karthik R
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik R. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc., 2023. URLhttps://openreview.net/forum? id=5Xc1ecxO1h. 11
2023
-
[16]
American invitational mathematics examination (AIME) 2024
Yifan Zhang and Math-AI Team. American invitational mathematics examination (AIME) 2024. Hugging Face dataset, 2024. URLhttps://huggingface.co/datasets/math-ai/aime24
2024
-
[17]
American invitational mathematics examination (AIME) 2025
Yifan Zhang and Math-AI Team. American invitational mathematics examination (AIME) 2025. Hugging Face dataset, 2025. URLhttps://huggingface.co/datasets/math-ai/aime25
2025
-
[18]
From Chains to DAGs: Probing the Graph Structure of Reasoning in LLMs
Tianjun Zhong, Linyang He, and Nima Mesgarani. From chains to DAGs: Probing the graph structure of reasoning in LLMs.arXiv preprint arXiv:2601.17593, 2026. doi: 10.48550/arXiv.2601.17593. URLhttps://arxiv.org/abs/ 2601.17593
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.17593 2026
-
[19]
ZhankeZhou,ZhaochengZhu,XuanLi,MikhailGalkin,XiaoFeng,SanmiKoyejo,JianTang,andBoHan. Landscape of thoughts: Visualizing the reasoning process of large language models. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=XpoQ812d0A. Poster. Appendix roadmap • Implementation and count conventions: ...
-
[20]
Degree-preserving graph rewire— preserves the degree sequence but destroys local adjacency; used for the family modularity headline (median𝑧=35.54)
-
[21]
3.Family-label shuffle— preserves family sizes and typed-state support but destroys route-specific labels; the exported population null for family-TV (80.9%above𝑝95, median𝑧=3.14)
Block-type-preserving rewire— preserves role counts and type marginals; used as a structural-sensitivity stress test (effect-drop only). 3.Family-label shuffle— preserves family sizes and typed-state support but destroys route-specific labels; the exported population null for family-TV (80.9%above𝑝95, median𝑧=3.14)
-
[22]
Temporal-order shuffle— preserves visited typed states but destroys transition order; used for kernel, escape, return, and MFPT stress tests (effect-drop only)
-
[23]
Use uncertainsparingly—only when two runs share a framework but execute it in materially different ways
Label permutation— preserves graph topology, family partition, and per-cell label count but randomises correctness association; the reward-core stress test (Table 16). Items 1 and 3 generate the exported headline nulls; items 2 and 4 are validity-ladder stress tests and should not be read as independent population claims. Label permutations tendto scatter...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.