On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning

Atirek Gupta; Pratik Deshmukh

arxiv: 2605.05438 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI

On Semantic Loss Fine-Tuning Approach for Preventing Model Collapse in Causal Reasoning

Pratik Deshmukh , Atirek Gupta This is my paper

Pith reviewed 2026-05-08 17:15 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords semantic lossmodel collapsecausal reasoningtransformer fine-tuningtransitivityd-separationlogical constraintsfine-tuning stability

0 comments

The pith

A semantic loss function with graph-based constraints prevents transformer models from collapsing to trivial yes/no predictions on causal reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard fine-tuning of transformers on causal reasoning tasks causes complete model collapse, where the model learns to output a constant yes or no answer no matter what the input structure shows. This produces misleadingly high accuracy scores while the model acquires no actual causal reasoning ability. The paper introduces a semantic loss that adds graph-based logical constraints and adjusts the loss weight dynamically during training. With this addition the models reach 70.4 percent accuracy on transitivity tasks and 68.6 percent on d-separation tasks while making predictions that vary with context. Large-scale tests across more than 200,000 samples and adversarial cases show the semantic loss is required to keep predictions stable and structure-dependent.

Core claim

The authors establish that adding a semantic loss function built around graph-based logical constraints together with dynamic lambda scheduling to the fine-tuning objective stops transformer models from collapsing into constant predictions, allowing them to reach context-dependent accuracies of 70.4 percent on transitivity and 68.6 percent on d-separation while standard fine-tuning yields 100 percent collapse.

What carries the argument

The semantic loss function that incorporates graph-based logical constraints to penalize violations of causal structure, modulated by dynamic lambda scheduling to balance the constraint term against the standard loss.

If this is right

Models produce stable predictions that depend on the input causal structure instead of always answering yes or no.
Accuracy on transitivity and d-separation tasks improves by 42.7 percent relative to collapsed baselines.
Adversarial structural reasoning tests show semantic-loss models retain 67-70 percent accuracy while collapsed models drop to 43-71 percent.
The necessity of semantic loss holds across five model variants and more than 200,000 evaluation samples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit logical constraints inside the loss may be needed more generally whenever transformers are fine-tuned on tasks that require multi-step inference rather than surface patterns.
Dynamic scheduling of the constraint strength could serve as a template for avoiding trivial solutions in other reasoning domains.
The method might extend to larger models if the graph constraints can be generated automatically from task descriptions.
The gap between collapsed and non-collapsed accuracy highlights that standard cross-entropy alone is insufficient to elicit causal reasoning from transformers.

Load-bearing premise

The accuracy gains come specifically from the models learning genuine causal relations enforced by the graph constraints rather than from the semantic loss merely blocking the particular trivial constant outputs observed in the baseline runs.

What would settle it

Retraining the same models with the semantic loss but with the graph-based constraints removed or replaced by unrelated logical rules and then measuring whether collapse rates return to the 100 percent baseline level would directly test whether the causal graph constraints are what prevents collapse.

Figures

Figures reproduced from arXiv: 2605.05438 by Atirek Gupta, Pratik Deshmukh.

**Figure 3.** Figure 3: Semantic-loss fine-tuned models: Transitivity view at source ↗

**Figure 4.** Figure 4: Pretrained Gemma-3 270M model on adversarial structural robustness tests (Irrelevant nodes, Broken chains, Long chains) view at source ↗

**Figure 6.** Figure 6: Semantic-loss fine-tuned models (Transitivity view at source ↗

read the original abstract

Standard fine-tuning of transformer models on causal reasoning tasks leads to catastrophic model collapse, where models learn trivial solutions such as always predicting "Yes" or "No" regardless of input structure. We demonstrate that fine-tuning Gemma 270M on transitivity and d-separation tasks without semantic loss results in 100% collapse rate, with models achieving misleadingly high accuracy (73.9%) while learning no causal reasoning. We propose a semantic loss function with graph-based logical constraints and dynamic lambda scheduling that prevents this collapse. Our approach achieves 70.4% accuracy on transitivity tasks and 68.6% on d-separation tasks with stable, context-dependent predictions, representing a 42.7% improvement over collapsed baselines. Adversarial evaluation on 1,000 structural reasoning samples shows semantic models achieve 67-70% accuracy while collapsed models fail catastrophically at 43-71%. We validate our findings through comprehensive benchmarking on 200,000+ evaluation samples across five model variants, demonstrating that semantic loss is essential and not optional, for stable causal reasoning in transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Semantic loss stabilizes causal fine-tuning outputs but the gains appear driven more by penalizing constant predictions than by enforcing specific logical constraints.

read the letter

The paper documents a practical failure mode: fine-tuning Gemma 270M on transitivity and d-separation tasks without extra terms produces 100% collapse to constant answers, even while reporting 73.9% accuracy. Their semantic loss with graph-based constraints and dynamic lambda scheduling reduces collapse and yields 70.4% and 68.6% accuracy on the two tasks, plus 67-70% on adversarial structural samples versus much lower scores for the collapsed baseline. The 200k+ evaluation samples across five model variants give the empirical section decent scale for this kind of study. That part is straightforward and useful for anyone who has watched small models stop reasoning after fine-tuning on logical data. The central weakness is the lack of an ablation that removes the graph constraints and keeps only a generic penalty against uniform outputs. Standard accuracy is actually a bit lower than the collapsed baseline, so the reported 42.7% improvement rests entirely on the adversarial gap. Without that control it remains possible that any sufficiently strong anti-constant regularizer would produce the same stability, leaving the claim that the method enforces causal semantics unsupported. The abstract also omits data-split details, hyperparameter search, and the exact formula behind the improvement percentage. This work is aimed at researchers fine-tuning transformers for causal or logical tasks who already encounter collapse. A reader in that niche can extract a concrete method and numbers worth testing, though they will need to add their own ablations. It deserves peer review because the problem is real and the proposed fix has measurable effects, even if the current evidence does not yet isolate the contribution of the causal graph terms.

Referee Report

4 major / 1 minor

Summary. The manuscript claims that standard fine-tuning of transformer models, such as Gemma 270M, on causal reasoning tasks like transitivity and d-separation leads to catastrophic model collapse, characterized by 100% collapse rate and trivial constant predictions despite high accuracy (73.9%). It introduces a semantic loss function using graph-based logical constraints and dynamic lambda scheduling to prevent collapse, reporting accuracies of 70.4% on transitivity and 68.6% on d-separation tasks, a 42.7% improvement, and superior performance (67-70%) on adversarial evaluations compared to collapsed baselines (43-71%), validated on over 200,000 samples across five model variants.

Significance. Should the results hold and the gains be attributable to the enforcement of causal semantics rather than generic regularization, this work could offer a valuable technique for stable fine-tuning of LLMs on reasoning tasks. The scale of the benchmarking (200,000+ samples) and multi-variant testing provide a solid empirical foundation, highlighting semantic loss as potentially essential for avoiding collapse in causal domains.

major comments (4)

[Abstract] The 42.7% improvement over collapsed baselines is reported, but standard task accuracy (70.4%) is below the baseline's 73.9%, indicating that the improvement is entirely in the adversarial gap; the calculation of this percentage and its interpretation as evidence of causal reasoning should be clarified and justified.
[Experimental Evaluation] Details on data splits, statistical significance testing, hyperparameter search procedures, and the specific implementation of dynamic lambda scheduling are absent, which prevents verification of the reported accuracies, collapse rates, and the claim that semantic loss is 'essential and not optional'.
[Method] There is no ablation study isolating the contribution of the graph-based logical constraints from a simpler penalty on uniform or constant predictions; without this, it is unclear whether the observed stability arises from causal semantics or merely from discouraging the specific collapse modes seen in baselines.
[Adversarial Evaluation] The adversarial results on 1,000 structural reasoning samples support better performance, but the manuscript should address whether the graph constraints were optimized with knowledge of the evaluation distribution, as this could introduce circularity in the reported gains.

minor comments (1)

[Abstract] Specify the five model variants used in the benchmarking for better reproducibility.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating revisions where appropriate to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] The 42.7% improvement over collapsed baselines is reported, but standard task accuracy (70.4%) is below the baseline's 73.9%, indicating that the improvement is entirely in the adversarial gap; the calculation of this percentage and its interpretation as evidence of causal reasoning should be clarified and justified.

Authors: We agree that the reported standard accuracy is marginally lower than the collapsed baseline. The 42.7% figure is the average relative improvement in adversarial accuracy, computed as the mean of ((semantic_adv_acc - baseline_adv_acc) / baseline_adv_acc) across the transitivity and d-separation tasks and model variants. This metric is chosen because raw accuracy on these tasks is misleading due to collapse; the semantic loss enables context-dependent predictions that hold under structural perturbations. We interpret the gains as evidence of causal reasoning precisely because the models avoid trivial constant outputs on adversarial samples. We will revise the abstract to explicitly define the calculation and frame the improvement in terms of robustness rather than raw accuracy. revision: yes
Referee: [Experimental Evaluation] Details on data splits, statistical significance testing, hyperparameter search procedures, and the specific implementation of dynamic lambda scheduling are absent, which prevents verification of the reported accuracies, collapse rates, and the claim that semantic loss is 'essential and not optional'.

Authors: We acknowledge these implementation details were omitted. In the revised version we will add an appendix section specifying: (i) data splits of 70/15/15 with no overlap between train/validation/test and stratified by causal structure; (ii) statistical significance via paired t-tests (p < 0.001) on collapse rates and accuracies over 5 random seeds; (iii) hyperparameter search via grid search over learning rates {1e-5, 3e-5, 5e-5}, batch sizes {16, 32}, and initial lambda values {0.01, 0.05, 0.1}; and (iv) dynamic lambda scheduling that increments lambda by 0.05 every 5 epochs when validation collapse rate exceeds 20%. These additions will enable full verification and support the claim that semantic loss is required for stability. revision: yes
Referee: [Method] There is no ablation study isolating the contribution of the graph-based logical constraints from a simpler penalty on uniform or constant predictions; without this, it is unclear whether the observed stability arises from causal semantics or merely from discouraging the specific collapse modes seen in baselines.

Authors: We agree that an explicit ablation would strengthen the argument. The graph constraints encode domain-specific causal rules (transitivity, d-separation) rather than a generic anti-constant term; a simple penalty on uniform predictions would not enforce logical consistency across variable chains. Nevertheless, the current manuscript lacks this comparison. We will add an ablation study in the revised manuscript that replaces the full semantic loss with a baseline penalty on constant or uniform outputs and shows that only the graph-based version yields stable, non-collapsed behavior on both standard and adversarial sets. revision: yes
Referee: [Adversarial Evaluation] The adversarial results on 1,000 structural reasoning samples support better performance, but the manuscript should address whether the graph constraints were optimized with knowledge of the evaluation distribution, as this could introduce circularity in the reported gains.

Authors: The graph constraints are derived directly from the formal definitions of transitivity and d-separation in causal graphs and are applied identically to all training, validation, and evaluation samples. They were not tuned or selected using any information from the 1,000 adversarial samples. The adversarial set perturbs causal structures in ways unseen during training, but the logical rules remain fixed and task-general. We will add a clarifying paragraph in the Method section stating that constraint formulation is independent of the evaluation distribution to eliminate any appearance of circularity. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical claims rest on experimental benchmarks rather than definitional reduction.

full rationale

The paper proposes a semantic loss incorporating graph-based logical constraints and dynamic lambda scheduling, then reports accuracies (70.4% transitivity, 68.6% d-separation) and adversarial scores from 200,000+ evaluation samples across model variants. No equations, derivations, or self-citations are shown that reduce the claimed prevention of collapse or the accuracy gains to fitted inputs, renamed patterns, or prior author results by construction. The central result is presented as an empirical outcome of applying the loss during fine-tuning, with collapse demonstrated in baselines; this structure is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the premise that graph-derived logical constraints can be turned into a differentiable loss that enforces genuine causal reasoning rather than merely altering surface statistics; dynamic lambda is treated as a tunable schedule whose values are not derived from first principles.

free parameters (1)

lambda schedule
Dynamic weighting between standard cross-entropy and semantic loss; values chosen during training to balance the two terms.

axioms (1)

domain assumption Graph-based logical constraints derived from transitivity and d-separation accurately encode the causal reasoning requirements of the target tasks.
Invoked when the semantic loss is defined; no independent justification supplied in the abstract.

pith-pipeline@v0.9.0 · 5490 in / 1374 out tokens · 51299 ms · 2026-05-08T17:15:14.090308+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 2 canonical work pages

[1]

Teaching Transformers Causal Reason- ing through Axiomatic Training.arXiv preprint arXiv:2407.07612, 2024

Aniket Vashishtha, Abhinav Kumar, Atharva Pandey, Abbavaram Gowtham Reddy, Kabir Ahuja, Vineeth N Balasubramanian, and Amit Sharma. Teaching Transformers Causal Reason- ing through Axiomatic Training.arXiv preprint arXiv:2407.07612, 2024

work page arXiv 2024
[2]

Cambridge University Press, 2nd edition, 2009

Judea Pearl.Causality: Models, Reasoning, and In- ference. Cambridge University Press, 2nd edition, 2009

2009
[3]

A Semantic Loss Func- tion for Deep Learning with Symbolic Knowledge

Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den Broeck. A Semantic Loss Func- tion for Deep Learning with Symbolic Knowledge. InInternational Conference on Machine Learning (ICML), 2018

2018
[4]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Google DeepMind. Gemma: Open Models Based on Gemini Research and Technology. Technical report, Google DeepMind, 2024

2024
[5]

Generative Adver- sarial Networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adver- sarial Networks. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2014

2014
[6]

A Simple Framework for Con- trastive Learning of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A Simple Framework for Con- trastive Learning of Visual Representations. InInter- national Conference on Machine Learning (ICML), pages 1597–1607, 2020

2020
[7]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training Language Models to Follow In- structions with Human Feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022
[8]

CLADDER: A Benchmark to Assess Causal Reasoning Capabil- ities of Language Models

Jinfa Huang, Yongqi Leng, Weitong Zhang, Xinyu Yang, Xiaowu Zhang, and Dahua Lin. CLADDER: A Benchmark to Assess Causal Reasoning Capabil- ities of Language Models. InAdvances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2023

2023
[9]

".join( [ f

Stephanie Long, Tibor Schuster, and Alexan- dre Pich ´e. Can Large Language Models Dis- tinguish Cause from Effect?arXiv preprint arXiv:2310.17961, 2023. A Implementation Details A.1 Data Generation Pipeline We implement a comprehensive synthetic data generation framework for causal reasoning tasks, consisting of two primary modules: a base generator for ...

work page arXiv 2023
[10]

Generate a main chain of lengthℓ main ∼ U(3,5)with standard parameters
[11]

Addk∼ U(1,3)disconnected chains, each of length ℓirrel ∼ U(2,4)
[12]

Ensure node name uniqueness across all chains through rejection sampling (maximum 10 attempts)
[13]

Query exclusively about nodes within the main chain:v i, vj ∈V main
[14]

A causes B. B causes C. X causes Y. P causes Q. Q causes R

Premise contains edges from all chains:E=E main ∪ Eirrel,1 ∪. . .∪E irrel,k Example structure: Premise: "A causes B. B causes C. X causes Y. P causes Q. Q causes R." [main] [---irrelevant chains---] Hypothesis: "Does A cause C?" Label: "Yes" This tests whether models erroneously incorporate ir- relevant nodes into reasoning or correctly isolate the querie...
[15]

Generatek∼ U(2,3)completely disconnected chains
[16]

Each chain has lengthℓ i ∼ U(2,4)
[17]

Enforce strict node name disjointness:V i ∩V j =∅ fori̸=j
[18]

Query across different components: selectv i ∈V a andv j ∈V b wherea̸=b
[19]

No” since no path exists between disconnected components Example structure: Premise:

Label is always “No” since no path exists between disconnected components Example structure: Premise: "A causes B. B causes C. X causes Y. P causes Q." [chain 1] [chain 2] [chain 3] Hypothesis: "Does A cause Y?" Label: "No" This evaluates whether models incorrectly hallucinate transitive connections across graph boundaries or properly recognize component ...
[20]

Generate sequential chains withℓ∼ U(7,12), ex- ceeding training maximum of 6
[21]

Use standard edge generation without flipping:E= {(vi, vi+1)|i∈[1, ℓ−1]}
[22]

Doesv 1 causev ℓ?

Query endpoint causation: “Doesv 1 causev ℓ?”
[23]

Yes” requiringℓ−1transitive steps Example structure: Premise:

Label is always “Yes” requiringℓ−1transitive steps Example structure: Premise: "A causes B. B causes C. C causes D. D causes E. E causes F. F causes G. G causes H. H causes I. I causes J." [9-hop chain, exceeds training max] Hypothesis: "Does A cause J?" Label: "Yes" This probes compositional generalization: whether models can chain reasoning beyond train...
[24]

Catastrophic Collapse (V1 models): • Transitivity V1: TN = 0 across all tasks, indicating exclusive ”Yes” predictions • D-separation V1: TP near-zero with massive FN counts, indicating exclusive ”No” predictions • These patterns are input-independent, confirming prediction bias collapse
[25]

Heuristic-Based Predictions (Standard Gemma): • Task-specific patterns (e.g., 0% branching accuracy = all ”No”) • Moderate TP/TN values with significant FP/FN er- rors • Performance varies dramatically by task type 11 Table 7: Standard Gemma: Confusion matrices across standard evaluation tasks Metric Length Branch Rev Shuff LongN True Positive 5716 0 80 8...

2031
[26]

For example, D-separation V1 achieves 73.9% average accuracy (Table 2) while exhibiting severe FN bias (8,111 false negatives on length task)

Structural Reasoning (Semantic models): • All four values (TP/TN/FP/FN) non-zero and sub- stantial • TP and TN values proportional to label distributions • Consistent error patterns across tasks, not task- specific collapse Critical diagnostic insight:Accuracy alone cannot detect collapse. For example, D-separation V1 achieves 73.9% average accuracy (Tabl...

[1] [1]

Teaching Transformers Causal Reason- ing through Axiomatic Training.arXiv preprint arXiv:2407.07612, 2024

Aniket Vashishtha, Abhinav Kumar, Atharva Pandey, Abbavaram Gowtham Reddy, Kabir Ahuja, Vineeth N Balasubramanian, and Amit Sharma. Teaching Transformers Causal Reason- ing through Axiomatic Training.arXiv preprint arXiv:2407.07612, 2024

work page arXiv 2024

[2] [2]

Cambridge University Press, 2nd edition, 2009

Judea Pearl.Causality: Models, Reasoning, and In- ference. Cambridge University Press, 2nd edition, 2009

2009

[3] [3]

A Semantic Loss Func- tion for Deep Learning with Symbolic Knowledge

Jingyi Xu, Zilu Zhang, Tal Friedman, Yitao Liang, and Guy Van den Broeck. A Semantic Loss Func- tion for Deep Learning with Symbolic Knowledge. InInternational Conference on Machine Learning (ICML), 2018

2018

[4] [4]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Google DeepMind. Gemma: Open Models Based on Gemini Research and Technology. Technical report, Google DeepMind, 2024

2024

[5] [5]

Generative Adver- sarial Networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative Adver- sarial Networks. InAdvances in Neural Information Processing Systems (NeurIPS), pages 2672–2680, 2014

2014

[6] [6]

A Simple Framework for Con- trastive Learning of Visual Representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A Simple Framework for Con- trastive Learning of Visual Representations. InInter- national Conference on Machine Learning (ICML), pages 1597–1607, 2020

2020

[7] [7]

Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training Language Models to Follow In- structions with Human Feedback. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

2022

[8] [8]

CLADDER: A Benchmark to Assess Causal Reasoning Capabil- ities of Language Models

Jinfa Huang, Yongqi Leng, Weitong Zhang, Xinyu Yang, Xiaowu Zhang, and Dahua Lin. CLADDER: A Benchmark to Assess Causal Reasoning Capabil- ities of Language Models. InAdvances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2023

2023

[9] [9]

".join( [ f

Stephanie Long, Tibor Schuster, and Alexan- dre Pich ´e. Can Large Language Models Dis- tinguish Cause from Effect?arXiv preprint arXiv:2310.17961, 2023. A Implementation Details A.1 Data Generation Pipeline We implement a comprehensive synthetic data generation framework for causal reasoning tasks, consisting of two primary modules: a base generator for ...

work page arXiv 2023

[10] [10]

Generate a main chain of lengthℓ main ∼ U(3,5)with standard parameters

[11] [11]

Addk∼ U(1,3)disconnected chains, each of length ℓirrel ∼ U(2,4)

[12] [12]

Ensure node name uniqueness across all chains through rejection sampling (maximum 10 attempts)

[13] [13]

Query exclusively about nodes within the main chain:v i, vj ∈V main

[14] [14]

A causes B. B causes C. X causes Y. P causes Q. Q causes R

Premise contains edges from all chains:E=E main ∪ Eirrel,1 ∪. . .∪E irrel,k Example structure: Premise: "A causes B. B causes C. X causes Y. P causes Q. Q causes R." [main] [---irrelevant chains---] Hypothesis: "Does A cause C?" Label: "Yes" This tests whether models erroneously incorporate ir- relevant nodes into reasoning or correctly isolate the querie...

[15] [15]

Generatek∼ U(2,3)completely disconnected chains

[16] [16]

Each chain has lengthℓ i ∼ U(2,4)

[17] [17]

Enforce strict node name disjointness:V i ∩V j =∅ fori̸=j

[18] [18]

Query across different components: selectv i ∈V a andv j ∈V b wherea̸=b

[19] [19]

No” since no path exists between disconnected components Example structure: Premise:

Label is always “No” since no path exists between disconnected components Example structure: Premise: "A causes B. B causes C. X causes Y. P causes Q." [chain 1] [chain 2] [chain 3] Hypothesis: "Does A cause Y?" Label: "No" This evaluates whether models incorrectly hallucinate transitive connections across graph boundaries or properly recognize component ...

[20] [20]

Generate sequential chains withℓ∼ U(7,12), ex- ceeding training maximum of 6

[21] [21]

Use standard edge generation without flipping:E= {(vi, vi+1)|i∈[1, ℓ−1]}

[22] [22]

Doesv 1 causev ℓ?

Query endpoint causation: “Doesv 1 causev ℓ?”

[23] [23]

Yes” requiringℓ−1transitive steps Example structure: Premise:

Label is always “Yes” requiringℓ−1transitive steps Example structure: Premise: "A causes B. B causes C. C causes D. D causes E. E causes F. F causes G. G causes H. H causes I. I causes J." [9-hop chain, exceeds training max] Hypothesis: "Does A cause J?" Label: "Yes" This probes compositional generalization: whether models can chain reasoning beyond train...

[24] [24]

Catastrophic Collapse (V1 models): • Transitivity V1: TN = 0 across all tasks, indicating exclusive ”Yes” predictions • D-separation V1: TP near-zero with massive FN counts, indicating exclusive ”No” predictions • These patterns are input-independent, confirming prediction bias collapse

[25] [25]

Heuristic-Based Predictions (Standard Gemma): • Task-specific patterns (e.g., 0% branching accuracy = all ”No”) • Moderate TP/TN values with significant FP/FN er- rors • Performance varies dramatically by task type 11 Table 7: Standard Gemma: Confusion matrices across standard evaluation tasks Metric Length Branch Rev Shuff LongN True Positive 5716 0 80 8...

2031

[26] [26]

For example, D-separation V1 achieves 73.9% average accuracy (Table 2) while exhibiting severe FN bias (8,111 false negatives on length task)

Structural Reasoning (Semantic models): • All four values (TP/TN/FP/FN) non-zero and sub- stantial • TP and TN values proportional to label distributions • Consistent error patterns across tasks, not task- specific collapse Critical diagnostic insight:Accuracy alone cannot detect collapse. For example, D-separation V1 achieves 73.9% average accuracy (Tabl...