pith. machine review for the scientific record. sign in

arxiv: 2603.06635 · v2 · submitted 2026-02-24 · 💻 cs.LG

Recognition: no theorem link

Graph Property Inference in Small Language Models: Effects of Representation and Reasoning Strategy

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:04 UTC · model grok-4.3

classification 💻 cs.LG
keywords languagemodelsinferencegraphpropertyreasoningsmallacross
0
0 comments X

The pith

Small instruction-tuned language models cannot reliably estimate graph-theoretic properties from textual encodings, though adjacency-list formats and multi-branch reasoning reduce errors relative to edge lists and single-path inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work tests small language models on tasks that require reading a graph description and then computing properties like node degrees, shortest paths, or centrality scores. Graphs are fed to the models in two text formats: simple lists of edges or adjacency lists that group connections per node. The models are also prompted to reason either in one straight pass or by generating multiple reasoning branches before combining answers. Across several models and many graph examples, the estimates show large errors that exceed the natural variation in the true property values. The ordering of graphs by the model's estimates also matches the true ordering only weakly. Switching to adjacency-list input lowers the error and improves the ordering match. Using multiple reasoning branches gives a smaller but still positive lift. The overall pattern is that these models do not master the task without extra training, yet some input and prompting choices help more than others.

Core claim

small language models fail to achieve reliable graph property estimation: normalized errors consistently exceed the intrinsic dispersion of target properties, and rank correlations remain weak across all configurations. However, the failure is structured rather than uniform. Adjacency-list encodings consistently reduce error and improve ordinal consistency relative to edge-lists, and multi-branch reasoning yields measurable aggregate gains across configurations.

Load-bearing premise

The tested models, graph properties, and textual encodings are representative of the broader space of small language models and structured reasoning tasks; the observed patterns would hold under different model families or property definitions.

Figures

Figures reproduced from arXiv: 2603.06635 by Michal Podstawski.

Figure 1
Figure 1. Figure 1: Macro-averaged metrics across inference strategies. (a) Range-normalized error, (b) standard-deviation-normalized error with the NRMSEstd = 1.0 reference line, and (c) Spearman rank correlation. Solid lines denote adjacency-list encoding; dashed lines denote edge-list encoding. Llama-3.2-3B Adj Llama-3.2-3B Edge Phi-4-mini Adj Phi-4-mini Edge Qwen2.5-3B Adj Qwen2.5-3B Edge 0.000 0.100 0.200 0.300 0.400 N R… view at source ↗
Figure 2
Figure 2. Figure 2: Inference strategy improvement relative to baseline prompting. (a) Reduction in NRMSEstd (positive values indicate lower error). (b) Increase in Spearman ρ (positive values indicate improved rank consistency). 4.4 Summary of Findings Overall, the results demonstrate three consistent patterns: – Estimation remains unreliable: all configurations produce NRMSEstd above 1.0 and low rank correlations, indicatin… view at source ↗
read the original abstract

Recent progress in language modeling has expanded the range of tasks that can be approached through natural language interfaces, including problems that require structured reasoning. However, it remains unclear how effectively limited-capacity language models can infer formal properties of relational structures when those structures are presented in textual form. We conduct a systematic study of graph-theoretic property inference in small instruction-tuned language models, isolating the roles of input representation and reasoning strategy. Across a diverse set of local and global graph metrics evaluated on three models, we find that small language models fail to achieve reliable graph property estimation: normalized errors consistently exceed the intrinsic dispersion of target properties, and rank correlations remain weak across all configurations. However, the failure is structured rather than uniform. Adjacency-list encodings consistently reduce error and improve ordinal consistency relative to edge-lists, and multi-branch reasoning yields measurable aggregate gains across configurations. These results show that without task-specific fine-tuning or architectural adaptation, graph property inference in pretrained small language models remains fundamentally unreliable, but that representational organization and inference design produce consistent differences. The findings characterize the conditions under which structured inference degrades and identify which design choices yield improvements even under constrained model capacity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a systematic empirical evaluation of graph property inference in three small instruction-tuned language models. It isolates the effects of textual graph encodings (adjacency-list versus edge-list) and reasoning strategies (single-step versus multi-branch), measuring performance via normalized error and rank correlation on a set of local and global graph metrics. The central claim is that these models exhibit fundamentally unreliable inference—normalized errors exceed intrinsic property dispersion and rank correlations remain weak—yet the unreliability is structured, with adjacency-list encodings and multi-branch reasoning yielding consistent aggregate improvements.

Significance. If the reported error and correlation patterns hold under the tested conditions, the work provides concrete evidence that small pretrained language models lack reliable capacity for structured graph reasoning without task-specific adaptation. By quantifying the benefits of specific representational and procedural choices, it supplies actionable guidance for prompt and interface design in capacity-constrained settings and highlights the gap between natural-language interfaces and formal graph tasks.

major comments (2)
  1. [Abstract] Abstract and concluding section: the assertion that graph property inference 'remains fundamentally unreliable' for pretrained small language models generalizes beyond the three evaluated models and the chosen local/global metrics. The manuscript should either qualify the scope explicitly to the tested configurations or provide additional cross-family validation to support the broader claim.
  2. [Results] Results section (error and correlation tables): the claim that normalized errors consistently exceed intrinsic dispersion is load-bearing for the unreliability conclusion, yet the exact normalization procedure, property-specific dispersion calculation, and handling of graph-size variation are not fully detailed; a sensitivity check across alternative normalizations would strengthen the result.
minor comments (2)
  1. [Methods] The description of multi-branch reasoning would benefit from an explicit algorithmic outline or pseudocode to clarify branching criteria and aggregation method.
  2. [Figures] Figure captions should state the number of graphs, property definitions, and exact model sizes used in each panel to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and precision.

read point-by-point responses
  1. Referee: [Abstract] Abstract and concluding section: the assertion that graph property inference 'remains fundamentally unreliable' for pretrained small language models generalizes beyond the three evaluated models and the chosen local/global metrics. The manuscript should either qualify the scope explicitly to the tested configurations or provide additional cross-family validation to support the broader claim.

    Authors: We agree that the phrasing risks over-generalization. In the revised manuscript we have explicitly qualified the abstract and conclusion to the three evaluated models (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.2, Gemma-2-9B-it) and the specific local and global metrics tested. We now state that the observed unreliability holds under these tested conditions and note that broader validation across model families remains an important direction for future work. revision: yes

  2. Referee: [Results] Results section (error and correlation tables): the claim that normalized errors consistently exceed intrinsic dispersion is load-bearing for the unreliability conclusion, yet the exact normalization procedure, property-specific dispersion calculation, and handling of graph-size variation are not fully detailed; a sensitivity check across alternative normalizations would strengthen the result.

    Authors: We appreciate the request for greater methodological transparency. The primary normalization divides absolute error by the standard deviation of each property across the full graph corpus; dispersion is computed identically per property. Graph-size effects are handled by reporting results in three size-stratified bins (small/medium/large). In the revision we have added a dedicated subsection in Methods that fully specifies these steps. We have also included a sensitivity analysis in the appendix that repeats the key comparisons under min-max and interquartile-range normalizations; the central finding that normalized errors exceed dispersion remains consistent. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with direct performance comparisons

full rationale

The paper reports an empirical study that measures normalized errors and rank correlations for graph property inference across fixed model configurations, input encodings (adjacency-list vs edge-list), and reasoning strategies (multi-branch). All claims rest on experimental deltas between these configurations rather than any derivation, fitted parameter, or self-citation chain. No equations, ansatzes, or uniqueness theorems appear; the central finding that errors exceed intrinsic dispersion is a direct observational result, not a reduction to inputs by construction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmarking study with no new mathematical axioms, free parameters, or invented entities; relies on standard machine-learning evaluation assumptions.

axioms (1)
  • domain assumption Standard assumptions in LLM benchmarking that the chosen models and metrics are representative and that textual graph encodings preserve the underlying structure
    Invoked implicitly when generalizing from the three tested models and the selected local/global metrics to broader claims about small language models.

pith-pipeline@v0.9.0 · 5494 in / 1267 out tokens · 128663 ms · 2026-05-15T20:04:37.682726+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 7 internal anchors

  1. [1]

    Graph Property Inference in Small Language Models 11 In: Advances in Neural Information Processing Systems (NeurIPS), pp

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., Zhou, D.: Chain-of-thought prompting elicits reasoning in large language models. Graph Property Inference in Small Language Models 11 In: Advances in Neural Information Processing Systems (NeurIPS), pp. 24824–24837 (2022)

  2. [2]

    In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

    Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., Hoefler, T.: Graph of thoughts: Solving elaborate problems with large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence (2024)

  3. [3]

    In: Advances in Neural Information Processing Systems (NeurIPS), vol

    Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 22199–22213 (2022)

  4. [4]

    In: International Conference on Learning Representations (ICLR) (2023)

    Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., Zhou, D.: Self-consistency improves chain of thought reasoning in language models. In: International Conference on Learning Representations (ICLR) (2023)

  5. [5]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  6. [6]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Srivastava, A., Rastogi, A., Rao, A., et al.: Beyond the imitation game: Quan- tifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022)

  7. [7]

    In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2024)

    Ren,X.,Tang,J.,Yin,D.,Chawla,N.,Huang,C.:ASurveyofLargeLanguageMod- els for Graphs. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2024)

  8. [8]

    arXiv preprint arXiv:2507.03637 (2025)

    Da Ros, F., Soprano, M., Di Gaspero, L., Roitero, K.: Large language models for combinatorial optimization: A systematic review. arXiv preprint arXiv:2507.03637 (2025)

  9. [9]

    In: International Conference on Learning Representations (ICLR) (2025)

    Dai, X., Qu, H., Shen, Y., Zhang, B., Wen, Q., Fan, W., Li, D., Tang, J., Shan, C.: How do large language models understand graph patterns? A benchmark for graph pattern comprehension. In: International Conference on Learning Representations (ICLR) (2025)

  10. [10]

    In: International Conference on Learning Representations (ICLR) (2024)

    Zhou, H., Bradley, A., Littwin, E., Razin, N., Saremi, O., Susskind, J., Bengio, S., Nakkiran, P.: What algorithms can transformers learn? A study in length general- ization. In: International Conference on Learning Representations (ICLR) (2024)

  11. [11]

    arXiv preprint arXiv:2510.08808 (2025)

    Podstawski, M.: TinyGraphEstimator: Adapting lightweight language models for graph structure inference. arXiv preprint arXiv:2510.08808 (2025)

  12. [12]

    Generalization Boundaries of Fine-Tuned Small Language Models for Graph Structural Inference

    Podstawski, M.: Generalization Boundaries of Fine-Tuned Small Language Models for Graph Structural Inference. arXiv preprint arXiv:2604.18092 (2026)

  13. [13]

    In: Advances in Neural Information Processing Systems (NeurIPS), vol

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, L., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)

  14. [14]

    arXiv preprint arXiv:1901.00596 (2019)

    Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Yu, P.S.: A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596 (2019)

  15. [15]

    Xu, K., Hu, W., Leskovec, J., Jegelka, S.: How powerful are graph neural networks? arXiv preprint arXiv:1810.00826 (2018)

  16. [16]

    34 (2021)

    Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., Liu, T.-Y.: Do transformers really perform badly for graph representation? In: Advances in Neural Information Processing Systems (NeurIPS), vol. 34 (2021)

  17. [17]

    Fatemi, B., Halcrow, J., Perozzi, B.: Talk like a Graph: Encoding Graphs for Large LanguageModels.In:InternationalConferenceonLearningRepresentations(ICLR) (2024)

  18. [18]

    Available at: https://llm- stats.com/leaderboards/llm-leaderboard (Accessed: February 2026) 12 M

    llm-stats.com: LLM Leaderboard. Available at: https://llm- stats.com/leaderboards/llm-leaderboard (Accessed: February 2026) 12 M. Podstawski

  19. [19]

    Available at: https://github.com/noamgat/lm-format-enforcer (Accessed: April 2026)

    lm-format-enforcer. Available at: https://github.com/noamgat/lm-format-enforcer (Accessed: April 2026)

  20. [20]

    The Llama 3 Herd of Models

    Llama Team: The Llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  21. [21]

    Qwen2.5 Technical Report

    Qwen Team: Qwen2.5 technical report. arXiv preprint arXiv:2412.15115 (2025)

  22. [22]

    Phi-4 Technical Report

    Phi Team: Phi-4 Technical Report. arXiv preprint arXiv:2412.08905 (2024)

  23. [23]

    Available at: https://www.anthropic.com/claude (Accessed: April 2026)

    Anthropic, Claude. Available at: https://www.anthropic.com/claude (Accessed: April 2026)