pith. sign in

arxiv: 2605.24873 · v1 · pith:SWHMLI55new · submitted 2026-05-24 · 💻 cs.CL · cs.AI· cs.LG

Towards a Universal Causal Reasoner

Pith reviewed 2026-06-30 12:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords causal reasoninglarge language modelsdata generationPearl's Causal Laddersupervised finetuningfaithfulness metricsmedical reasoninglegal reasoning
0
0 comments X

The pith

Finetuning LLMs on 66.6K UniCo instances yields 22.9% gains on 18 causal query types and 20.2% better faithfulness in real-world tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UniCo, a data generation framework that creates examples for all 18 causal query types across Pearl's Causal Ladder and renders them in symbolic, code, and natural-language forms to match real-world usage. It grounds every answer in exact causal inference and removes cases with reasoning shortcuts. After supervised finetuning on the resulting 66.6K instances, three open models show large lifts on both the original query types and seven external causal benchmarks, plus markedly more faithful reasoning traces when applied to medical understanding, legal decisions, and tabular reasoning. The work therefore claims that causality-centered training produces both stronger causal performance and a broader causal mindset in general tasks.

Core claim

UniCo supplies scalable, high-quality training data that covers every rung of Pearl's Causal Ladder in multiple surface forms; supervised finetuning on this data produces models whose causal reasoning improves 22.9% on the 18 in-distribution types, 8.1% on out-of-distribution benchmarks, and 20.2% in faithfulness metrics on medical, legal, and tabular problems.

What carries the argument

UniCo, a data-generation pipeline that enumerates 18 causal query types, translates symbolic instances into code and natural language, and filters outputs using exact causal inference to eliminate shortcuts.

If this is right

  • Average 22.9% improvement across all 18 in-distribution causal query types after finetuning.
  • 8.1% higher performance than prior causal data frameworks on seven established out-of-distribution benchmarks.
  • 20.2% average increase in faithfulness of reasoning traces on medical, legal, and tabular tasks.
  • Causality-centered training equips models with a causal mindset that appears in general reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same data-generation approach could be applied to produce causal training sets for additional domains such as scientific hypothesis testing.
  • Models trained this way may show reduced reliance on spurious correlations even on tasks that do not explicitly mention causality.
  • Extending UniCo to generate multi-step or counterfactual chains could further test whether the causal mindset scales to longer reasoning horizons.

Load-bearing premise

That gains measured on the 18 query types and existing benchmarks will transfer to faithful causal reasoning in arbitrary open-ended real-world tasks without extra safeguards or domain adaptation.

What would settle it

A new collection of open-ended medical, legal, or tabular scenarios in which UniCo-trained models produce incorrect causal inferences at rates no lower than the base models.

Figures

Figures reproduced from arXiv: 2605.24873 by Chenhao Tan, Dylan Zhang, Hao Peng, Jiawei Zhang, Qirun Dai, Xiao Liu.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples illustrating the three representation forms. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: UNICO transforms small Qwen3 models into better causal reasoners than GPT￾5.4-mini across the causal ladder [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Base model accuracy (%) on UNICO’s test set by representation forms, causal levels, and difficulty. 4.2 Why Diversity Matters Performance gap between representation forms and causal levels. Models are highly sensitive to both how a causal question is presented and which level of the causal ladder it resides on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation results of Qwen3-4B finetuned on different components of the training set. Each [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reasoning faithfulness scores across three real-world domains for all model–domain [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Despite the importance of causal reasoning, training LLMs to reason causally remains underexplored. Existing data efforts mostly focus on benchmarking LLMs on specific aspects of causality, making them less suitable for training generalizable causal reasoners. To address this, we propose UniCo, a data generation framework that both (1) addresses 18 causal query types across Pearl's Causal Ladder and (2) translates natively symbolic examples into code and natural language forms to simulate real-world use cases where causal terms are not explicitly specified. To ensure data quality, UniCo grounds answers with exact causal inference and filters cases with reasoning shortcuts. Upon supervised finetuning with 66.6K UniCo-generated instances, Qwen3-4B, Qwen3-8B and Olmo-3-7B-Instruct achieve an average of 22.9% improvements across all 18 in-distribution query types, and 8.1% over state-of-the-art causal data generation frameworks on 7 established causal benchmarks outside the training distribution. More importantly, in real-world medical understanding, legal decision, and tabular reasoning, UniCo-trained models consistently display more faithful reasoning traces, outperforming the base models by an average of 20.2% in faithfulness metrics. These suggest that causality-centered training not only strengthens causal reasoning, but also equips LLMs with a causal mindset in general reasoning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes UniCo, a data generation framework that produces 66.6K training instances covering 18 causal query types across Pearl's Causal Ladder. Symbolic examples are translated into natural language and code to simulate real-world scenarios; answers are grounded via exact causal inference and cases with reasoning shortcuts are filtered. Supervised fine-tuning of Qwen3-4B, Qwen3-8B and Olmo-3-7B-Instruct on this data yields reported average gains of 22.9% across the 18 in-distribution query types, 8.1% over prior causal data frameworks on 7 external benchmarks, and 20.2% in faithfulness metrics on real-world medical, legal and tabular tasks.

Significance. If the data-generation claims hold, the work would supply a scalable resource for training LLMs with a causal mindset that generalizes beyond benchmarks, with direct relevance to high-stakes domains. The evaluation design—held-out benchmarks plus separate real-world tasks—avoids circularity and provides evidence of out-of-distribution transfer; the broad coverage of Pearl's ladder is a further strength.

major comments (3)
  1. [Abstract] Abstract: The central claim that performance gains arise from 'exact causal inference' grounding and 'reasoning shortcut' filtering is load-bearing, yet the manuscript supplies no operational definition, algorithm, pseudocode or post-translation validation procedure for either step. Without these details the 22.9%, 8.1% and 20.2% deltas cannot be attributed to causal understanding rather than data volume or generic instruction effects.
  2. [Results] Results section (reporting on 66.6K instances and percentage improvements): All quantitative claims are presented without error bars, confidence intervals, dataset statistics (e.g., per-query-type counts), or ablation studies that isolate the filtering/grounding components from simply scaling instruction data.
  3. [Evaluation on real-world tasks] Real-world task evaluation: The assertion of 'more faithful reasoning traces' on medical, legal and tabular problems rests on faithfulness metrics whose computation, inter-annotator agreement and grounding against verifiable causal structures are not specified, undermining the claim that gains reflect a causal mindset rather than surface-level improvements.
minor comments (1)
  1. [Abstract] The abstract refers to 'state-of-the-art causal data generation frameworks' without explicit citations in the provided text; these should be listed with precise references.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which will help improve the clarity and rigor of our manuscript. We provide point-by-point responses to the major comments below and commit to revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that performance gains arise from 'exact causal inference' grounding and 'reasoning shortcut' filtering is load-bearing, yet the manuscript supplies no operational definition, algorithm, pseudocode or post-translation validation procedure for either step. Without these details the 22.9%, 8.1% and 20.2% deltas cannot be attributed to causal understanding rather than data volume or generic instruction effects.

    Authors: We acknowledge the referee's concern regarding the lack of detailed operational definitions in the abstract. While the full manuscript describes the data generation process in Section 3, we agree that explicit algorithms and pseudocode are necessary to substantiate the claims. In the revised manuscript, we will add pseudocode for the exact causal inference grounding step, which uses the do-operator and intervention on the causal graph for each query type, and for the reasoning shortcut filtering, which identifies and removes instances where the answer is determinable via non-causal heuristics such as keyword matching. We will also include examples of post-translation validation. These changes will strengthen the link between the reported performance gains and the causal components of UniCo. revision: yes

  2. Referee: [Results] Results section (reporting on 66.6K instances and percentage improvements): All quantitative claims are presented without error bars, confidence intervals, dataset statistics (e.g., per-query-type counts), or ablation studies that isolate the filtering/grounding components from simply scaling instruction data.

    Authors: We agree that the results would benefit from additional statistical rigor and ablations. We will update the Results section to include error bars based on multiple training runs, a breakdown of the 66.6K instances by query type, and ablation experiments that compare models trained on the full UniCo dataset versus subsets without the grounding or filtering steps. This will help demonstrate that the gains are not solely due to increased data volume. revision: yes

  3. Referee: [Evaluation on real-world tasks] Real-world task evaluation: The assertion of 'more faithful reasoning traces' on medical, legal and tabular problems rests on faithfulness metrics whose computation, inter-annotator agreement and grounding against verifiable causal structures are not specified, undermining the claim that gains reflect a causal mindset rather than surface-level improvements.

    Authors: The referee raises a valid point about the specification of the faithfulness evaluation. We will revise the corresponding section to provide a detailed description of the faithfulness metric computation, including the criteria used by annotators, the inter-annotator agreement scores (which we will compute if not already reported), and how the metrics are anchored to verifiable causal structures in each domain. This will clarify that the improvements reflect enhanced causal reasoning rather than superficial changes. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper proposes the UniCo data-generation framework, generates 66.6K instances covering 18 causal query types, performs supervised finetuning, and reports accuracy/faithfulness gains on held-out in-distribution query types plus 7 external causal benchmarks and separate real-world medical/legal/tabular tasks. All reported deltas are measured on quantities defined outside the training loop (standard benchmarks, human faithfulness annotations). No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. The filtering and grounding steps are described at a high level but do not reduce any performance claim to a tautology by construction. The evaluation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the assumption that synthetic data generated from symbolic causal graphs can be filtered to remove shortcuts and will transfer to real-world faithfulness; no free parameters are explicitly fitted beyond the choice of 66.6K instances and model sizes.

axioms (2)
  • domain assumption Supervised finetuning on synthetic causal examples improves both in-distribution causal query performance and out-of-distribution faithfulness in downstream tasks
    Invoked when claiming 22.9% and 20.2% gains after finetuning.
  • domain assumption Exact causal inference can be used to ground answers and filter reasoning shortcuts in generated data
    Stated as the quality-control mechanism in the abstract.
invented entities (1)
  • UniCo data generation framework no independent evidence
    purpose: To produce training instances covering 18 causal query types in multiple surface forms
    Newly introduced method whose independent evidence is the reported performance gains.

pith-pipeline@v0.9.1-grok · 5791 in / 1577 out tokens · 39861 ms · 2026-06-30T12:21:08.922366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    URLhttps://openreview.net/forum?id=DZjbL9BuHs. Poster. K. Lu and T. M. Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. T. Olmo, A. Ettinger, A. Bertsch, B. Kuehl, D. Graham, D. Heineman, D. Groeneveld, F. Brahman, F. Timbers, H. Ivison, et al. Olmo...

  2. [2]

    Qwen3 Technical Report

    Preprint available on arXiv. K. Xiong, X. Ding, Y . Cao, Y . Yan, L. Du, Y . Zhang, J. Gao, J. Liu, B. Qin, and T. Liu. Com2: A causal-guided benchmark for exploring complex commonsense reasoning in large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16119–16140, 2...

  3. [3]

    counterfactual

    Com2 [Xiong et al., 2025] is a causal-guided benchmark for complex commonsense reasoning, where examples are constructed around causal event graphs and scenario modifications. After quality inspection, we decide to only use its “counterfactual” and “decision” subsets for their advanced verifiability, resulting in991examples altogether. The reported metric isF1

  4. [4]

    causal judgment

    BBEH[Kazemi et al., 2025] is a broad reasoning benchmark designed to extend BIG-Bench Hard with more difficult tasks that probe similar reasoning capabilities. We only use its causal understanding subset, which includes200examples: 142 under the “causal judgment” task and 58 under the “necessary and sufficient conditions” task. The reported metric isaccuracy

  5. [5]

    We include both two subsets: a comprehensive V1 subset and a backdoor-only V2 subset

    CounterBench[Chen et al., 2025b] evaluates counterfactual reasoning under formal causal rules, with questions designed around diverse causal structures and counterfactual query forms. We include both two subsets: a comprehensive V1 subset and a backdoor-only V2 subset. This gives 1,200examples altogether, and the reported metric isaccuracy

  6. [6]

    We only take its test split, with1,162examples

    Corr2Cause[Jin et al., 2024] tests whether models can infer causal relations from correlational statements, aiming to isolate causal inference from commonsense retrieval. We only take its test split, with1,162examples. Due to strong label imbalance, we report theF1metric

  7. [7]

    We take all of its examples, resulting in 10,112examples altogether

    CLadder[Jin et al., 2023] assesses formal causal reasoning in natural language across graph-based association, intervention, and counterfactual queries. We take all of its examples, resulting in 10,112examples altogether. The reported metric isaccuracy

  8. [8]

    We only take its if-else test split in the code domain, with500examples altogether

    Executable Counterfactuals[Vashishtha et al., 2026] operationalizes counterfactual reasoning through executable code and math problems that require explicit counterfactual reasoning steps. We only take its if-else test split in the code domain, with500examples altogether. Since each question may have multiple answers, the reported metric isF1

  9. [9]

    We only take CaLM- Lite, a publicly available lightweight version

    CaLM[Chen et al., 2024] is a comprehensive causal evaluation benchmark that organizes causal targets, adaptations, metrics, and error analyses across a broad design space. We only take CaLM- Lite, a publicly available lightweight version. Moreover, we exclude subsets that use data from the other six benchmarks, leaving3,900examples altogether. The reporte...

  10. [10]

    For each question, the sampling budget for each model is 2

    SFT response curation.We curate SFT responses with rejection sampling based on an ensemble of three strong open-source LLMs [Zhang et al., 2025]: Qwen3-32B, Olmo-3.1-32B-Instruct, and Qwen3.5-27B. For each question, the sampling budget for each model is 2. If multiple sampled responses lead to the correct final answer, we randomly select one of them. If n...

  11. [11]

    SFT training.We use LlamaFactory [Zheng et al., 2024] for Qwen3-4B and Qwen3-8B, and use the Axolotl Framework3 for Olmo-3-7B-Instruct. Notably, for all Qwen3 experiments throughout this work, we follow prior paradigms [Hübotter et al., 2026] by adopting the instruct mode (i.e., 3https://docs.axolotl.ai/ 26 setting enable_thinking=False when applying the ...

  12. [12]

    medical understanding

    Evaluation.We use the vLLM framework [Kwon et al., 2023] for evaluation. Throughout all experiments in this work, we adopt temperature=0.7, top_p=0.8 for Qwen3 models and temperature=0.6, top_p=0.95 for Olmo-3 models, following their respective recommended practices. For proprietary models such as GPT-5.4-mini (Table 2), they are evaluated with no extra t...

  13. [13]

    Map each symbolic node in the causal graph to a real-world entity, as listed below: ```json <entity_interpretation_json> ```

  14. [14]

    In light of this, you should articulate the question under the provided context in a highly natural manner like a real piece of narrative

    The ultimate goal of such conversion is to make it necessary for test takers to carefully read through the natural language question in order to understand all the causal and probabilistic relationships among entities, instead of easily spotting them at first glance. In light of this, you should articulate the question under the provided context in a high...

  15. [15]

    force",

    Note that the conversion only alters how the causal question is expressed, but the underlying causal semantics must be preserved exactly. More specifically, ALL the provided causal relationships between entities and ALL the listed probability conditions MUST still occur in the converted question, so that it still has the same final numerical answer as the...

  16. [16]

    Output ONLY the converted natural- language question text itself

    You should be moderately concise and NOT verbose. Output ONLY the converted natural- language question text itself. Do NOT include any preamble, explanation, commentary, quotation marks, or markdown formatting around the question. 31 For longer examples, we also use a three-step variant that decomposes Prompt 1 into two calls: first assign only real-world...