Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning

Junqing Yu; Wei Yang; Xinglang Zhang; Yunyao Zhang; Zeliang Chen; Zikai Song

arxiv: 2601.02902 · v2 · submitted 2026-01-06 · 💻 cs.AI · cs.CL· cs.LO

Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning

Xinglang Zhang , Yunyao Zhang , Zeliang Chen , Junqing Yu , Wei Yang , Zikai Song This is my paper

Pith reviewed 2026-05-16 17:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LO

keywords logical reasoningphase transitionslarge language modelsneuro-symbolic methodscurriculum tuningreasoning collapselogical depthcomplexity thresholds

0 comments

The pith

LLM logical reasoning stays stable until a critical depth then collapses abruptly like a phase transition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models maintain consistent performance on logical tasks as complexity grows, but only up to a sharp threshold. Beyond that critical logical depth, accuracy drops suddenly rather than declining gradually. This pattern resembles physical phase transitions, such as water freezing at a set temperature. The authors develop a training approach called Neuro-Symbolic Curriculum Tuning to push back this threshold by aligning language and symbols during learning. The result matters for applications like mathematical reasoning and legal judgment, where unexpected failures at higher complexity can be costly.

Core claim

Rather than degrading smoothly, logical reasoning performance in LLMs remains stable within a regime but collapses abruptly beyond a critical logical depth, mirroring physical phase transitions. Neuro-Symbolic Curriculum Tuning adaptively aligns natural language with logical symbols to establish a shared representation and reshapes training dynamics around phase-transition boundaries to progressively strengthen reasoning at increasing logical depths.

What carries the argument

Controlled scaling of logical depth in benchmarks to reveal abrupt collapse points, paired with Neuro-Symbolic Curriculum Tuning that aligns language and symbols while targeting training around those boundaries.

If this is right

Average accuracy improves by 1.26 points in naive prompting and 3.95 points in chain-of-thought prompting across five benchmarks.
Generalization to unseen logical compositions becomes stronger after the curriculum tuning.
Reasoning collapse is reduced at high complexity levels.
More reliable performance emerges for high-stakes logical tasks in mathematics and law.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reasoning ability in models may consist of distinct stable regimes separated by sharp boundaries rather than smooth scaling.
Similar abrupt transitions could appear in related tasks such as multi-step planning or theorem proving.
Training schedules might be improved by first mapping critical depths on small models before scaling up.

Load-bearing premise

That increases in logical complexity can be isolated and controlled without confounding changes in sequence length, token distribution, or benchmark-specific artifacts.

What would settle it

Run accuracy tests on logic problems where logical depth increases step by step but total input length and token statistics stay fixed; a smooth decline in performance instead of an abrupt drop at one depth would falsify the phase-transition claim.

read the original abstract

Symbolic logical reasoning is a critical yet underexplored capability of large language models (LLMs), providing reliable and verifiable decision-making in high-stakes domains such as mathematical reasoning and legal judgment. In this study, we present a systematic analysis of logical reasoning under controlled increases in logical complexity, and reveal a previously unrecognized phenomenon, which we term Logical Phase Transitions: rather than degrading smoothly, logical reasoning performance remains stable within a regime but collapses abruptly beyond a critical logical depth, mirroring physical phase transitions such as water freezing beyond a critical temperature threshold. Building on this insight, we propose Neuro-Symbolic Curriculum Tuning, a principled framework that adaptively aligns natural language with logical symbols to establish a shared representation, and reshapes training dynamics around phase-transition boundaries to progressively strengthen reasoning at increasing logical depths. Experiments on five benchmarks show that our approach effectively mitigates logical reasoning collapse at high complexity, yielding average accuracy gains of +1.26 in naive prompting and +3.95 in CoT, while improving generalization to unseen logical compositions. Code and data are available at https://github.com/AI4SS/Logical-Phase-Transitions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript analyzes logical reasoning capabilities in LLMs under increasing complexity and reports a phenomenon termed Logical Phase Transitions, in which performance remains stable up to a critical logical depth before collapsing abruptly, analogous to physical phase changes. It introduces Neuro-Symbolic Curriculum Tuning to align natural language with logical symbols and reshape training around these boundaries, claiming average accuracy gains of +1.26 under naive prompting and +3.95 under CoT across five benchmarks, plus improved generalization to unseen compositions. Code and data are released.

Significance. If the phase-transition behavior can be isolated from confounders, the work would offer a useful empirical characterization of LLM reasoning limits and a concrete curriculum-based mitigation strategy. The public release of code and data is a positive contribution to reproducibility.

major comments (2)

[Abstract] Abstract: the central claim of an abrupt collapse at a critical logical depth is presented without any reported definition of logical depth, length-matched controls, error bars, statistical tests for discontinuity, or ablation of curriculum components; this directly undermines evaluation of whether the observed threshold is a true phase transition or an artifact of sequence length or token statistics.
[Abstract] The weakest assumption—that logical complexity can be varied independently of sequence length and token distribution—is not addressed by any described controls or fixed-length variants in the reported experiments; without these, the phase-transition analogy rests on an unverified isolation.

minor comments (1)

[Abstract] The abstract states gains on 'five benchmarks' but does not name them or report per-benchmark breakdowns; adding this would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful for the referee's insightful comments, which have helped us improve the clarity and rigor of our presentation. We have made substantial revisions to address the concerns about the abstract and experimental controls, including adding definitions, controls, and statistical analyses. We believe the revised manuscript now provides a more robust characterization of Logical Phase Transitions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of an abrupt collapse at a critical logical depth is presented without any reported definition of logical depth, length-matched controls, error bars, statistical tests for discontinuity, or ablation of curriculum components; this directly undermines evaluation of whether the observed threshold is a true phase transition or an artifact of sequence length or token statistics.

Authors: We thank the referee for highlighting these omissions in the abstract. While the main text (Section 3) defines logical depth as the maximum nesting level of logical connectives in the parsed formula, we agree that the abstract should be self-contained. In the revision, we have expanded the abstract to include a brief definition of logical depth. Additionally, we have added length-matched controls in the experimental setup (new Figure 2), reporting results where sequence length is held constant while varying logical depth. Error bars from multiple runs are now included in all figures, and we have applied a statistical discontinuity test (using the PELT algorithm for change point detection) to confirm the abrupt nature of the collapse, with p-values reported. Ablation studies on the curriculum tuning components are provided in the new Appendix D, showing the contribution of each part. revision: yes
Referee: [Abstract] The weakest assumption—that logical complexity can be varied independently of sequence length and token distribution—is not addressed by any described controls or fixed-length variants in the reported experiments; without these, the phase-transition analogy rests on an unverified isolation.

Authors: We acknowledge that the original manuscript did not explicitly describe controls for sequence length and token distribution in the abstract. To address this, we have introduced fixed-length variants in the revised experiments, where we pad or truncate sequences to maintain constant length across different logical depths. Token distribution is controlled by using a fixed vocabulary and generating templates that balance token frequencies. These controls are detailed in Section 4.2, and the results confirm that the phase transition persists even under these constraints, supporting the isolation of logical complexity as the key factor. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations with no derivation chain

full rationale

The paper reports an empirical study of LLM logical reasoning collapse under increasing complexity, observed via benchmarks, and introduces a new Neuro-Symbolic Curriculum Tuning method validated by accuracy gains on five datasets. No equations, fitted parameters, predictions derived from inputs, or self-citations appear as load-bearing steps in the abstract or described experiments. The central claims rest on controlled benchmark results rather than any self-referential reduction or ansatz smuggled through prior work, making the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the empirical observation of abrupt collapse and the effectiveness of the proposed tuning method. No explicit free parameters, axioms, or invented entities are detailed in the abstract; the work assumes logical complexity can be varied independently and that neuro-symbolic alignment is feasible.

pith-pipeline@v0.9.0 · 5512 in / 1181 out tokens · 45288 ms · 2026-05-16T17:24:00.216234+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rather than degrading smoothly, logical reasoning performance remains stable within a regime but collapses abruptly beyond a critical logical depth, mirroring physical phase transitions
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LoCM(ϕ)=f(∑_{o∈O} ω(o) freq(o,ϕ)+γ h(ϕ))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics
cs.SI 2026-04 unverdicted novelty 7.0

IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on rea...
OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction
cs.CV 2026-04 unverdicted novelty 6.0

OmniTrend predicts popularity by combining separate content attractiveness and contextual exposure predictors using cross-modal and exogenous signals.
HotComment: A Benchmark for Evaluating Popularity of Online Comments
cs.AI 2026-04 unverdicted novelty 6.0

HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
cs.MM 2026-04 unverdicted novelty 5.0

A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing
cs.SD 2026-04 unverdicted novelty 5.0

ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.