pith. sign in

arxiv: 2505.18931 · v4 · submitted 2025-05-25 · 💻 cs.AI · cs.CL· cs.LG

Can Large Language Models Infer Causal Relationships from Real-World Text?

Pith reviewed 2026-05-19 14:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords causal inferencelarge language modelsbenchmarkreal-world textcausal relationshipsLLM evaluation
0
0 comments X

The pith

Large language models struggle to infer causal relationships from real-world academic texts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops the first benchmark for causal relationship inference using actual academic papers rather than synthetic or simplified texts. Experiments reveal that current LLMs perform poorly on this task, with the best model achieving an average F1 score of just 0.535 across diverse text complexities and domains. A reader would care because causal inference is essential for understanding documents in science, news, and daily reasoning, and poor performance here suggests LLMs are not yet ready for reliable real-world causal analysis. The benchmark breaks down results by factors like text length and explicitness to pinpoint where improvements are needed.

Core claim

We show that LLMs face significant challenges in inferring causal relationships from real-world text. We develop a benchmark drawn from real-world academic literature, which includes diverse texts with respect to length, complexity (different levels of explicitness, number of causal events and relationships), and domain. To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task. Our experiments on this dataset show that LLMs face significant challenges in inferring causal relationships from real-world text, with the best-performing model achieving an average F1 score of only 0.535.

What carries the argument

ReCITE benchmark of annotated real-world academic texts for evaluating causal inference in LLMs

If this is right

  • Current LLMs are not yet reliable for extracting causal structures from complex documents.
  • Performance is affected by text length, number of relations, and level of explicitness.
  • The new dataset enables more accurate assessment of progress in causal reasoning capabilities.
  • Insights from the analysis can guide development of better causal reasoning methods in LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improved performance on real texts could enable LLMs to assist in scientific literature review and hypothesis generation.
  • The reliance on synthetic data in prior work may have overstated LLM causal abilities.
  • Future work might explore hybrid systems combining LLMs with structured causal models to boost accuracy.

Load-bearing premise

The selected academic texts and human annotations represent a faithful sample of real-world causal inference tasks.

What would settle it

A model achieving significantly higher than 0.535 average F1 on the benchmark or inconsistent annotations upon re-evaluation would challenge the findings of significant challenges.

Figures

Figures reproduced from arXiv: 2505.18931 by Aman Chadha, Oleg Pavlov, Raha Moraffah, Ryan Saklad.

Figure 1
Figure 1. Figure 1: An example causal graph illustrating the dif [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Heatmap of average model F1 scores across explicitness bins, where 100% means every node is ei￾ther explicitly or implicitly mentioned in the text. This shows explicitness has a large impact on performance, and LLMs struggle to infer causality when explicit ref￾erences are sparse. Explicitness A key challenge in real-world causal reasoning is that causal events are not al￾ways explicitly stated in the text… view at source ↗
Figure 3
Figure 3. Figure 3: Radar chart of domain-specific F1 scores. Effect of Domain and Data Diversity. To as￾sess whether semantic domain affects performance, we cluster paper embeddings using k-means (k=4, selected by silhouette score) and analyze F1 by cluster. HDBSCAN found no natural clusters, clas￾sifying all papers as noise. Cluster differences are not statistically significant (ANOVA F = 1.03, p = 0.38, η 2 = 0.012; Kruska… view at source ↗
Figure 5
Figure 5. Figure 5: Causal subgraph annotated with R1’s perfor [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Verbatim excerpts from R1’s reasoning trace. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of graph topology metrics. Top row: node count, edge count, density. Bottom row: cycle [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: UMAP projection of ReCITE papers (black) [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: F1 score by reasoning trace length quartile, computed per-model. Most models show modest perfor￾mance gains with longer reasoning traces, with Claude Opus 4.5 improving from 0.50 (Q1) to 0.58 (Q4). How￾ever, Llama 3.1 8B shows no improvement across quar￾tiles, and large error bars indicate substantial within￾quartile variance for all models. This suggests that while extended reasoning provides some benefit… view at source ↗
Figure 9
Figure 9. Figure 9: QwQ token length distribution. Performance [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of reasoning trace lengths (in [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: Relationship between nodes and explicit￾ness. This chart plots the count of ground-truth nodes for each sample against its explicitness. With a very small positive correlation (R2 = 0.002), node count has minimal impact on explicitness. I Inter-Annotator Agreement Details To ensure the accuracy of the benchmark ground￾truth graphs, we measured inter-annotator agree￾ment by having a second annotator indepe… view at source ↗
Figure 11
Figure 11. Figure 11: Relationship between text length and ex￾plicitness. This shows sample character count against explicitness. There is a modest positive correlation (R2 = 0.171), indicating that longer texts tend to be more explicit. This helps to explain why models perform slightly better on longer samples, as increased explicit￾ness makes causal edges easier to identify [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 14
Figure 14. Figure 14: Screenshot of the annotator form. incorrectly added (false positives). There were no instances of edges that had flipped directions, nor nodes that were entirely missed. This yields: Precision = T P T P + F P = 857 857 + 5 = 0.994, Recall = T P T P + F N = 857 857 + 22 = 0.975, F1 = 0.984, SHD = 27, κ = 0.99 Cohen’s κ statistic reflects near-perfect agreement at the edge level and is defined as: κ = po − … view at source ↗
Figure 16
Figure 16. Figure 16: Average model F1 scores across strict explic￾itness bins, where 100% means every node is explicitly mentioned in the text and 0% means no nodes are ex￾plicitly mentioned. $1000. This affordability is largely attributed to the use of prompt caching for the LLM judge. While the initial processing of the lengthy source texts incurs a significant input token cost for the judge, this cost is a one-time expense… view at source ↗
Figure 15
Figure 15. Figure 15: Level of Explicitnessstrict under the strict criterion. For each node v in a sample V , we assign a score of 1 if it is explicitly mentioned (E) and 0 if it is either implicit (I) or absent (A), then average over all nodes in the sample. Under this alternative measure of explicitness, the “strict” definition is expected to be more dif￾ficult, as it counts only directly mentioned nodes toward a sample’s ex… view at source ↗
Figure 17
Figure 17. Figure 17: Example output from the Qwen-2.5-7B, a base model. It outputs irrelevant foreign-language text instead of producing a valid causal graph. We list some other interesting or informative outputs from the base model as a reference. Another Endlessly Repeating Base Model Output <think> editText editText <think> editText editText <think> editText editText <think> editText In rare cases, the model generated vali… view at source ↗
Figure 18
Figure 18. Figure 18: Top: Subgraph generated by a domain expert, [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
read the original abstract

Understanding and inferring causal relationships from texts is a core aspect of human cognition and is essential for advancing large language models (LLMs) towards artificial general intelligence. Existing work evaluating LLM causal reasoning primarily relies on synthetic or simplified texts with explicitly stated causal relationships. These texts typically feature short passages and few causal relations, failing to reflect the complexities of real-world reasoning. In this paper, we investigate whether LLMs are capable of inferring causal relationships from real-world texts. We develop a benchmark drawn from real-world academic literature, which includes diverse texts with respect to length, complexity (different levels of explicitness, number of causal events and relationships), and domain. To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task. Our experiments on this dataset show that LLMs face significant challenges in inferring causal relationships from real-world text, with the best-performing model achieving an average F$_1$ score of only 0.535. Through systematic analysis across aspects of real-world text (explicitness, number of causal events and relationships, length of text, domain), our benchmark offers targeted insights for further research into advancing LLM causal reasoning. Our code and dataset can be found at https://github.com/Ryan-Saklad/ReCITE .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper constructs a new benchmark dataset from real-world academic literature to evaluate LLMs on inferring causal relationships in texts that vary in length, explicitness, number of causal events/relations, and domain. Experiments across multiple LLMs report that the best-performing model reaches only an average F1 score of 0.535, with further breakdowns showing performance variation by text properties. The authors release the code and dataset and position the work as the first real-world benchmark for this task.

Significance. If the benchmark construction and evaluation protocol are sound, the result would demonstrate that current LLMs still struggle with causal inference on authentic, complex texts rather than only on synthetic examples. The open release of code and data is a clear strength that supports reproducibility and allows the community to extend or re-analyze the benchmark.

major comments (2)
  1. [Section 3 (Benchmark Construction)] Section 3 (Benchmark Construction): The paper does not report inter-annotator agreement for the human annotations of causal events and relations. Without IAA metrics, it is impossible to determine whether the ground-truth labels contain substantial ambiguity or noise, which would undermine the interpretation of the 0.535 F1 as evidence of LLM limitations rather than annotation difficulty.
  2. [Section 4 (Experiments and Analysis)] Section 4 (Experiments and Analysis): No human expert performance baseline is provided on the same extraction task. The central claim that LLMs 'face significant challenges' therefore lacks calibration; an F1 of 0.535 could reflect inherent task difficulty in long academic texts with implicit relations rather than model-specific shortcomings. Adding a human baseline would directly test this.
minor comments (1)
  1. [Abstract] Abstract: The statement that the benchmark is 'the first-ever real-world dataset' should be accompanied by a brief comparison to any prior causal extraction datasets from scientific text to strengthen the novelty claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements where appropriate.

read point-by-point responses
  1. Referee: [Section 3 (Benchmark Construction)] Section 3 (Benchmark Construction): The paper does not report inter-annotator agreement for the human annotations of causal events and relations. Without IAA metrics, it is impossible to determine whether the ground-truth labels contain substantial ambiguity or noise, which would undermine the interpretation of the 0.535 F1 as evidence of LLM limitations rather than annotation difficulty.

    Authors: We agree that reporting inter-annotator agreement is important for validating annotation quality. The manuscript describes the annotation protocol and guidelines in Section 3 but does not include quantitative IAA statistics. In the revision we will add Cohen's kappa and percentage agreement figures computed on the annotations, which demonstrate substantial agreement. This will strengthen the claim that the 0.535 F1 reflects LLM limitations rather than label noise. revision: yes

  2. Referee: [Section 4 (Experiments and Analysis)] Section 4 (Experiments and Analysis): No human expert performance baseline is provided on the same extraction task. The central claim that LLMs 'face significant challenges' therefore lacks calibration; an F1 of 0.535 could reflect inherent task difficulty in long academic texts with implicit relations rather than model-specific shortcomings. Adding a human baseline would directly test this.

    Authors: We acknowledge that a human baseline would help calibrate task difficulty. Our experiments focus on LLM performance, but we agree a direct comparison is valuable. In the revised version we will report human expert performance on the causal extraction task for a representative subset of the benchmark, enabling readers to interpret the 0.535 F1 score relative to human-level results on the same real-world texts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation is self-contained

full rationale

The paper constructs a new dataset from real-world academic texts and reports empirical LLM performance metrics (F1 scores) on causal inference extraction. No derivation chain, equations, or predictions are present that reduce to fitted inputs, self-definitions, or author self-citations by construction. The central claim rests on observed results from the benchmark rather than any load-bearing self-referential step or renamed known result. The benchmark is presented as novel without invoking prior author theorems to force uniqueness or outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on standard assumptions of LLM prompting and human annotation reliability; no free parameters, invented entities, or non-standard axioms are described in the abstract.

axioms (1)
  • domain assumption Human annotations of causal relations in academic text constitute a reliable ground truth.
    Implicit in the construction of the benchmark and F1 scoring.

pith-pipeline@v0.9.0 · 5764 in / 1113 out tokens · 48103 ms · 2026-05-19T14:07:50.389980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

    cs.CL 2026-04 unverdicted novelty 8.0

    METER benchmark reveals LLMs decline sharply in causal reasoning proficiency from association to intervention to counterfactual levels due to distraction by irrelevant facts and loss of faithfulness to provided context.

  2. Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence

    cs.CL 2026-05 unverdicted novelty 4.0

    The authors introduce a validation framework showing LLMs can pull causal links from disaster social media but require checks against post-event evidence to avoid relying on model priors.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    Hallucination of multimodal large language models: A survey.Preprint, arXiv:2404.18930. T. Bratanic. 2024. Building knowledge graphs with llm graph transformer: A deep dive into langchain’s implementation of graph construction with llms. To- wards Data Science. Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengy- ing Xu, Xingyu Zeng, Rui Zhao, Shengjie Zha...

  2. [2]

    Arthur C

    A theory of causal learning in children: Causal maps and bayes nets.Psychological Review, 111(1):3–32. Arthur C. Graesser, Murray Singer, and Tom Trabasso

  3. [3]

    The Llama 3 Herd of Models

    Constructing inferences during narrative text comprehension.Psychological Review, 101(3):371– 395. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mi- tra, Archie Sravank...

  4. [4]

    The node (or the concept behind it) is explicitly mentioned in the text - This can be verbatim, or though use of a synonym - It is sufficient to be mentioned in the text; it is irrelevant if it is mentioned to be in the causal graph or not

  5. [5]

    The node is mentioned indirectly or implicitly in the text

  6. [6]

    scores": {

    The node is unmentioned in the text, even if related concepts are discussed Be conservative when determining the degree of explicitness for each node. Output only the JSON code block with your answer, without commentary, reasoning, explanation, or any other text. You must include the name of each node in the graph verbatim, even when the graph is very lar...

  7. [7]

    The exact wording of the original paper must be preserved verbatim

    Modify the text only when absolutely necessary. The exact wording of the original paper must be preserved verbatim. - Do not correct spelling or grammar, even if it is incorrect - The response will be rejected if even a single word is edited or removed un- necessarily; most of the response should effectively be copy-pasted from the original text - Your re...

  8. [8]

    - Convert sections and sub-sections into headings and subheadings

    Correct any broken text from the PDF processing and convert it into a well-structured md file. - Convert sections and sub-sections into headings and subheadings

  9. [9]

    [20, 22]

    Remove the following information in entirety: - Images, figures, and any other visual elements - References and Citations, including when in-line. E.g., "[20, 22]" would be removed. - Acknowledgments - Authorship information - Appendices - Page numbers Remember; your only output is the processed text in full, with no thinking, reasoning, or other commenta...

  10. [10]

    start_string: The beginning of the text to replace

  11. [11]

    end_string: The end of the text to replace

  12. [12]

    normalizations

    replacement: The text to insert instead You can call normalize multiple times to make several targeted replacements in the document. All three parameters are required for each call. - By default, normalize will locate the *first* occurrence of the start_string. As a workaround for when the same text ap- pears verbatim multiple times, use a slightly longer...

  13. [13]

    It is very important to be thorough and not take shortcuts, even when it seems tedious, redundant, or unnecessary

    Follow each direction carefully, com- pletely, and in-order a. It is very important to be thorough and not take shortcuts, even when it seems tedious, redundant, or unnecessary. Do this for each node or edge you are evaluating; there is no time limit. Be sure to fully to fully think through each node or edge you are tasked with evaluating fully before mov...

  14. [14]

    Ground-Truth Graph Evaluation - Explicitly identify and quote ALL potentially corresponding nodes from ground-truth graph - Apply these labels where applicable: Presence Labels (select one): - PRESENCE_STRONG_MATCH: Core concept matches a ground-truth node with only minor, inconsequential differences - PRESENCE_WEAK_MATCH: Core concept shares meaning with...

  15. [15]

    Ground-Truth Text Evaluation - Explicitly quote ALL relevant supporting text from source - Apply these labels where applicable: Evidence Labels (select one): - PRESENCE_STRONG_MATCH: Core concept appears in text with only minor, inconsequential differences - PRESENCE_WEAK_MATCH: Core concept shares significant meaning with text but has notable differences...

  16. [16]

    Ground-Truth Graph Evaluation - Explicitly identify and quote ALL potentially corresponding edges from ground-truth graph - Apply these labels where applicable: Presence Labels (select one): - PRESENCE_STRONG_MATCH: Edge connects highly similar concepts as in ground-truth - PRESENCE_WEAK_MATCH: Edge connects somewhat similar concepts as in ground-truth - ...

  17. [17]

    Ground-Truth Text Evaluation - Explicitly quote ALL relevant supporting text that describes causal relationships - Apply these labels where applicable: Evidence Labels (select one): - PRESENCE_GRAPH_ONLY: Causal relationship present in ground-truth graph (always select this if present) - PRESENCE_EXPLICIT: Causal relation- ship directly stated in text (on...