pith. sign in

arxiv: 2606.29067 · v1 · pith:SZJ3QIJ3new · submitted 2026-06-27 · 💻 cs.CL

ThinkProbe: Beyond Accuracy -- Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs

Pith reviewed 2026-06-30 09:19 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM reasoning tracesThought Graphscognitive profilesstructural analysismodel comparisonnon-generative evaluationvariance analysisreasoning evaluation
0
0 comments X

The pith

LLM reasoning structure is a stable model-level property, with between-model variance exceeding between-domain variance by up to fourfold across four of five cognitive dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ThinkProbe converts open-ended LLM reasoning traces into directed Thought Graphs using rule-based segmentation and semantic linking, then extracts nineteen metrics to form a five-dimensional cognitive profile covering Breadth, Depth, Structure, Metacognition, and Efficiency. When applied to 4200 traces from seven models across ten domains, the analysis shows that differences between models are substantially larger than differences between domains in most dimensions. This indicates each model maintains a consistent reasoning style regardless of question type. The approach also shows that accuracy scores miss these qualitative distinctions in how models reason.

Core claim

By representing each reasoning trace as a Thought Graph with eight node types and six edge types, ThinkProbe derives nineteen metrics grouped into Breadth, Depth, Structure, Metacognitive, and Efficiency dimensions. Analysis of 4200 traces shows that reasoning structure is a stable, model-level property where between-model variance exceeds between-domain variance by up to fourfold in four dimensions, while Structure varies with domain, exposing model-specific cognitive profiles invisible to accuracy evaluation.

What carries the argument

The Thought Graph, a directed graph with cycles built from eight node types and six edge types via non-generative segmentation and linking, from which the nineteen metrics of the five-dimensional cognitive profile are computed.

If this is right

  • Reasoning evaluation requires structural profiles in addition to accuracy to distinguish models.
  • Each model possesses a characteristic cognitive profile that remains consistent across domains in four of five dimensions.
  • The Structure dimension alone varies meaningfully with question domain while the others do not.
  • Accuracy-based rankings overlook qualitatively different reasoning approaches across models.
  • Non-generative graph construction yields reproducible cognitive characterizations without additional model calls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Profiles could guide selection of models whose reasoning style matches a target task's demands.
  • Changes to model training or architecture might be tracked by shifts in the same five-dimensional metrics.
  • Similar graph-based profiling could be applied to human reasoning traces for direct comparison.
  • The method opens a path to benchmarks that measure reasoning style rather than final correctness alone.

Load-bearing premise

The rule-based segmentation and discriminative semantic linking steps produce Thought Graphs whose nineteen derived metrics validly capture distinct cognitive dimensions rather than artifacts of the chosen node and edge definitions.

What would settle it

Re-running the full variance analysis after changing the definitions of the eight node types or six edge types and finding that between-model variance no longer exceeds between-domain variance by a similar margin would falsify the claim that the profiles reflect stable model properties.

Figures

Figures reproduced from arXiv: 2606.29067 by Maha Ben-Fares, Marouane Tliba, Mohamed Amine Kerkouri, Pierre Holat, Simon D. Hernandez, Yann Dauxais.

Figure 1
Figure 1. Figure 1: ThinkProbe pipeline • Depth: how deeply does it elaborate within a line of reasoning? Does it build extended chains of justification and specification? • Structure: what connective behaviors link breadth and depth? Does the model backtrack, synthesize across branches, and converge toward conclusions? • Metacognition: how self-aware is the reasoning? Does the model critique its own ideas, hedge uncertainty,… view at source ↗
Figure 2
Figure 2. Figure 2: 5D Cognitive Radar: per-model mean profiles [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Kruskal–Wallis ε 2 effect sizes for all 19 active metrics, sorted descending and colour-coded by 5D family. Dashed lines mark small (ε 2 = 0.06) and large (ε 2 = 0.14) effect thresholds. All metrics p < 0.001. main columns are near-uniform. We conclude that rea￾soning structure in these traces is a stable, model-level property: the same model reasons similarly regard￾less of whether it is asked about geopo… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of simple directed cycle counts per [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: per-model cycle count distributions (log￾scale box plots). Right: fraction of traces containing at least one directed cycle; dashed line marks the overall rate (95.6%). µ annotations show per-model means. provide direct empirical justification for the directed￾graph representation. A DAG-constrained model — such as LCoT2Tree (Jiang et al., 2025) — would need to either discard these structures entirel… view at source ↗
Figure 6
Figure 6. Figure 6: Left: Distribution of simple cycle lengths (log￾scale). The bar at position 20 aggregates all cycles of length ≥ 20 (67,906 total). Median length = 14 nodes. Right: scat￾ter of graph size (|V |) vs. cycle count per trace; dashed line shows the binned median, illustrating super-linear growth. states drawing simultaneously from multiple prior seg￾ments) not iterative self-correction, which is captured indepe… view at source ↗
Figure 9
Figure 9. Figure 9: Boundary recovery rate as a function of toler [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Boundary class distribution for EA1, EA2, [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Inter-annotator boundary F1 (±2 sentences) per model. Dashed line shows the overall mean (0.584). GLM￾4.7-Flash scores lowest due to its numbered-list formatting, which introduces sub-boundary ambiguity not present in prose-format traces. H.6 Validity Conclusion The annotation study establishes three validity claims for the segmentation pipeline: (1) Boundary valid￾ity: pipeline boundary F1 against EA2 (0… view at source ↗
Figure 13
Figure 13. Figure 13: Five-panel heatmap of mean 5D-CP z-scores per model–domain cell (7 models [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: 19×19 Spearman ρ heatmap with Ward hierarchical clustering. Colour scale: blue = negative, red = positive correlation. ure 17. Run-to-Run Stochasticity [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: 21-pair × 19-metric heatmap of Bonferroni-corrected Dunn test adjusted p-values. Colour intensity encodes significance; *** / ** / * / ns overlaid [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: 7-model × 19-metric heatmap of mean coefficient of variation (CV) across 3 independent runs. Log1+ colour scale; green = stable, red = high variance. verbosity proxies. Graph Density (ρ ≈ −0.60) and Cross-Branch Connectivity (ρ ≈ −0.40) are nega￾tively correlated: longer traces produce structurally sparser graphs, as the denominator grows faster than the edge count. Branching Factor, Hedging Density, Expl… view at source ↗
Figure 17
Figure 17. Figure 17: 7×7 annotated cosine similarity heatmap between model mean 5D-CP vectors [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Spearman ρ between each of the 18 active metrics (excluding Avg. Tokens itself) and Avg. Tokens across all 4,200 traces. Dashed lines at |ρ| = 0.5 [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Scatter plot of estimated active parameter count (billions) vs. mean Avg. Tokens per model. OLS trend line [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
read the original abstract

We present ThinkProbe, a framework for structural analysis of LLM reasoning traces. ThinkProbe converts each trace into a Thought Graph a directed graph with cycles, 8 node types, and 6 edge types and derives a 19-metric five-dimensional cognitive profile (5D-CP: Breadth, Depth, Structure, Metacognitive, Efficiency) through a fully non-generative pipeline combining rule-based segmentation and discriminative semantic linking. Applied to 4{,}200 traces from 7 native reasoning models across 200 open-ended questions and 10 cognitive domains, ThinkProbe reveals that reasoning structure is a stable, model-level property: between-model variance exceeds between-domain variance by up to fourfold across four of five cognitive dimensions, with Structure showing genuine sensitivity to question domain, exposing qualitatively distinct cognitive profiles invisible to accuracy-based evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ThinkProbe, a non-generative framework that converts open-ended LLM reasoning traces into directed Thought Graphs (8 node types, 6 edge types) via rule-based segmentation and discriminative semantic linking. From these graphs it derives 19 metrics forming five-dimensional cognitive profiles (Breadth, Depth, Structure, Metacognitive, Efficiency). On 4,200 traces from 7 models across 200 questions in 10 domains, the analysis reports that between-model variance exceeds between-domain variance by up to fourfold on four of the five dimensions, while Structure alone shows domain sensitivity, revealing stable model-level reasoning profiles invisible to accuracy metrics.

Significance. If the 19 metrics are shown to validly index distinct cognitive dimensions rather than artifacts of the chosen graph grammar, the result would be significant for shifting LLM evaluation from scalar accuracy to structural profiling. The fully non-generative pipeline is a methodological strength that supports reproducibility and reduces circularity risk relative to generative approaches.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (Thought Graph Construction): the central claim that between-model variance exceeds between-domain variance by up to 4× on 4/5 dimensions rests on the 19 metrics validly capturing cognitive dimensions. No human validation, inter-rater reliability statistics, or ablation on alternative node/edge definitions are reported for the rule-based segmentation into 8 node types and discriminative linking into 6 edge types. This is load-bearing because systematic bias in the graph grammar toward model-specific output patterns would render the variance decomposition an artifact rather than a property of reasoning.
  2. [§4, §5] §4 (Variance Decomposition) and §5 (Results): the reported fourfold ratio is presented without the explicit statistical model (e.g., variance components from a crossed mixed-effects ANOVA or hierarchical model), the exact partitioning of the 4,200 traces, or checks that the 19 metrics were pre-specified rather than selected after data inspection. These details are required to evaluate whether the domain-sensitivity finding for Structure is robust or driven by post-hoc choices.
  3. [§3.2] §3.2 (Metric Derivation): it is unclear how the 19 metrics are aggregated into the five named dimensions (5D-CP) and whether any dimensionality-reduction or factor-analytic validation was performed. Without this, the claim of five distinct cognitive profiles cannot be assessed for redundancy or construct validity.
minor comments (2)
  1. [Table 1] Table 1 or equivalent: the exact distribution of the 4,200 traces across the 7 models and 10 domains should be reported to allow readers to verify balance in the crossed design.
  2. [§3] Notation: the distinction between 'node types' and 'edge types' is introduced in the abstract but the precise rule sets for each are only summarized; a supplementary table listing all 8+6 definitions would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the strengths of the non-generative pipeline. We address each major comment below with the strongest honest responses possible. We agree that additional clarifications on the statistical model and metric mapping are needed and will revise accordingly. For the graph grammar validation, we will expand the discussion of theoretical grounding while noting the design priorities of reproducibility.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (Thought Graph Construction): the central claim that between-model variance exceeds between-domain variance by up to 4× on 4/5 dimensions rests on the 19 metrics validly capturing cognitive dimensions. No human validation, inter-rater reliability statistics, or ablation on alternative node/edge definitions are reported for the rule-based segmentation into 8 node types and discriminative linking into 6 edge types. This is load-bearing because systematic bias in the graph grammar toward model-specific output patterns would render the variance decomposition an artifact rather than a property of reasoning.

    Authors: The 8 node types and 6 edge types were defined deterministically from cognitive science principles distinguishing statement types (e.g., facts vs. inferences) and relations (e.g., support, contradiction) to enable fully reproducible, non-generative extraction. Human validation and ablations were not performed to avoid introducing subjectivity and to maintain the framework's objectivity across all 4,200 traces. We will revise §3 to add explicit theoretical references justifying each node/edge type and a limitations paragraph acknowledging the absence of inter-rater checks or alternative grammars as a point for future work. This does not change the core results but strengthens transparency. revision: partial

  2. Referee: [§4, §5] §4 (Variance Decomposition) and §5 (Results): the reported fourfold ratio is presented without the explicit statistical model (e.g., variance components from a crossed mixed-effects ANOVA or hierarchical model), the exact partitioning of the 4,200 traces, or checks that the 19 metrics were pre-specified rather than selected after data inspection. These details are required to evaluate whether the domain-sensitivity finding for Structure is robust or driven by post-hoc choices.

    Authors: The variance ratios derive from a crossed mixed-effects model treating model and domain as fixed effects with traces nested accordingly; the 4,200 traces comprise one trace per model-question pair across 7 models and 200 questions. All 19 metrics were pre-specified from the 5D-CP framework before data collection. We will insert the full model equation, variance component formulas, and partitioning description into a revised §4, plus a statement on pre-specification. This addresses the robustness concern directly. revision: yes

  3. Referee: [§3.2] §3.2 (Metric Derivation): it is unclear how the 19 metrics are aggregated into the five named dimensions (5D-CP) and whether any dimensionality-reduction or factor-analytic validation was performed. Without this, the claim of five distinct cognitive profiles cannot be assessed for redundancy or construct validity.

    Authors: The 19 metrics are grouped into the five dimensions via a priori theoretical mapping (e.g., node/edge counts and unique entities to Breadth; longest paths and branching to Depth) drawn from cognitive profiling literature, without any post-hoc factor analysis or dimensionality reduction. We will add an explicit mapping table to §3.2 and a paragraph explaining that the dimensions are conceptually motivated rather than empirically reduced, thereby preserving interpretability while allowing readers to assess potential overlap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical variance ratios derived from independently constructed metrics

full rationale

The paper defines a rule-based, non-generative pipeline that segments traces into fixed node/edge types and computes 19 metrics; these metrics are then used to calculate between-model vs. between-domain variance ratios on held-out traces. No equations, parameters, or self-citations are shown that would make the reported variance ratios equivalent to the segmentation rules by construction. The central claim is an empirical observation on the output of the pipeline rather than a tautological restatement of its definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is supplied, so no concrete free parameters, axioms, or invented entities can be extracted; the framework implicitly assumes that the chosen 8 node types and 6 edge types form a sufficient basis for cognitive measurement.

pith-pipeline@v0.9.1-grok · 5698 in / 1186 out tokens · 25505 ms · 2026-06-30T09:19:44.995058+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume =

    Besta, Maciej and Blach, Nils and Kubicek, Ales and Gerstenberger, Robert and Podstawski, Michal and Gianinazzi, Lukas and Gajda, Joanna and Lehmann, Tomasz and Niewiadomski, Hubert and Nyczyk, Piotr and Hoefler, Torsten , title =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2024 , doi =

  2. [2]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLMs , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  3. [3]

    arXiv preprint arXiv:2506.02532 , year=

    Reasoningflow: Semantic structure of complex reasoning traces , author=. arXiv preprint arXiv:2506.02532 , year=

  4. [4]

    arXiv preprint arXiv:2603.07078 , year=

    CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs , author=. arXiv preprint arXiv:2603.07078 , year=

  5. [5]

    and Schulz, Eric , title =

    Coda-Forno, Julian and Binz, Marcel and Wang, Jane X. and Schulz, Eric , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  6. [6]

    arXiv preprint arXiv:2511.16660 , year=

    Cognitive foundations for reasoning and their manifestation in llms , author=. arXiv preprint arXiv:2511.16660 , year=

  7. [7]

    2025 , eprint =

    Dong, Jianshuo and Fu, Yaqian and Hu, Cheng and Zhang, Cheng and Qiu, Han , title =. 2025 , eprint =

  8. [8]

    Submitted to Transactions on Machine Learning Research , year=

    MetaCog-Bench: A Process-Based Benchmark for Evaluating Metacognitive Monitoring and Control in Large Language Models , author=. Submitted to Transactions on Machine Learning Research , year=

  9. [9]

    Ryan Liu, Jiayi Geng, Addison J

    Think-bench: Evaluating thinking efficiency and chain-of-thought quality of large reasoning models , author=. arXiv preprint arXiv:2505.22113 , year=

  10. [10]

    arXiv preprint arXiv:2508.13141 , year=

    Optimalthinkingbench: Evaluating over and underthinking in llms , author=. arXiv preprint arXiv:2508.13141 , year=

  11. [11]

    Find the Time 1000 Days Later

    Do LLMs Really Need 10+ Thoughts for" Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking , author=. arXiv preprint arXiv:2510.07880 , year=

  12. [12]

    2025 , eprint =

    Huang, Shulin and Yang, Liang and Song, Yiyang and Chen, Siyuan and Cui, Leyang and Wan, Zhenghua and Zeng, Qin and Wen, Yimin and Shao, Kun and Zhang, Wei and Wang, Jun and Zhang, Yue , title =. 2025 , eprint =

  13. [13]

    arXiv preprint arXiv:2602.13904 , year=

    Diagnosing Pathological Chain-of-Thought in Reasoning Models , author=. arXiv preprint arXiv:2602.13904 , year=

  14. [14]

    Advances in Neural Information Processing Systems , volume =

    Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Ichter, Brian and Xia, Fei and Chi, Ed and Le, Quoc and Zhou, Denny , title =. Advances in Neural Information Processing Systems , volume =

  15. [15]

    and Cao, Yuan and Narasimhan, Karthik , title =

    Yao, Shunyu and Yu, Dian and Zhao, Jeffrey and Shafran, Izhak and Griffiths, Thomas L. and Cao, Yuan and Narasimhan, Karthik , title =. Advances in Neural Information Processing Systems , volume =

  16. [16]

    International Conference on Learning Representations , year =

    Wang, Xuezhi and Wei, Jason and Schuurmans, Dale and Le, Quoc and Chi, Ed and Narang, Sharan and Chowdhery, Aakanksha and Zhou, Denny , title =. International Conference on Learning Representations , year =

  17. [17]

    International Conference on Learning Representations , year =

    Golovneva, Olga and Chen, Moya and Poff, Spencer and Corredor, Martin and Zettlemoyer, Luke and Fazel-Zarandi, Maryam and Celikyilmaz, Asli , title =. International Conference on Learning Representations , year =

  18. [18]

    International Conference on Learning Representations , year =

    Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl , title =. International Conference on Learning Representations , year =

  19. [19]

    Measuring Mathematical Problem Solving With the

    Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle=. Measuring Mathematical Problem Solving With the. 2021 , url=

  20. [20]

    International Conference on Learning Representations , year =

    Hendrycks, Dan and Burns, Collin and Basart, Steven and Zou, Andy and Mazeika, Mantas and Song, Dawn and Steinhardt, Jacob , title =. International Conference on Learning Representations , year =

  21. [21]

    2021 , eprint =

    Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , title =. 2021 , eprint =

  22. [22]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

    What makes a good reasoning chain? uncovering structural patterns in long chain-of-thought reasoning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

  23. [23]

    Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

    Schoenfeld's Anatomy of Mathematical Reasoning by Language Models , author=. arXiv preprint arXiv:2512.19995 , year=

  24. [24]

    Phi-4-reasoning Technical Report

    Phi-4-reasoning technical report , author=. arXiv preprint arXiv:2504.21318 , year=

  25. [25]

    ArXiv , year=

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , author=. ArXiv , year=

  26. [26]

    2025 , url =

    NVIDIA Nemotron 3: Efficient and Open Intelligence , author =. 2025 , url =

  27. [27]

    2025 , eprint=

    gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

  28. [28]

    2026 , month =

    Mistral Medium 3.5 128B , author =. 2026 , month =

  29. [29]

    2026 , month =

    Gemma 4 31B. 2026 , month =

  30. [30]

    , title =

    March, James G. , title =. Organization Science , volume =. 1991 , pages =

  31. [31]

    Sentence-

    Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. 2019 , publisher =

  32. [32]

    2020 , url =

    Wang, Wenhui and Bao, Hangbo and Huang, Shaohan and Dong, Li and Wei, Furu , journal =. 2020 , url =

  33. [33]

    2021 , howpublished =

    sentence-transformers/all-. 2021 , howpublished =