ClimateCause: Complex and Implicit Causal Structures in Climate Reports

Andrea Rocci; Liesbeth Allein; Marie-Francine Moens; Nataly Pineda-Casta\~neda

arxiv: 2604.14856 · v1 · submitted 2026-04-16 · 💻 cs.CL · cs.AI

ClimateCause: Complex and Implicit Causal Structures in Climate Reports

Liesbeth Allein , Nataly Pineda-Casta\~neda , Andrea Rocci , Marie-Francine Moens This is my paper

Pith reviewed 2026-05-10 11:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords climate reportscausal structuresimplicit causalitynested causalitycausal discoverylarge language modelscausal reasoningreadability

0 comments

The pith

ClimateCause dataset annotates complex implicit causal structures in climate reports and shows LLMs struggle more with chain reasoning than correlations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ClimateCause as a manually annotated dataset drawn from science-for-policy climate reports that captures higher-order causal structures including implicit and nested relations. Cause-effect expressions are normalized into individual relations and labeled for correlation, relation type, and spatiotemporal context to support graph construction. The dataset is applied to quantify statement readability according to the semantic complexity of the underlying causal graphs. Benchmarking of large language models on correlation inference versus causal chain reasoning tasks identifies the latter as a particular difficulty. A reader would care because grasping climate change depends on navigating these intricate causal networks rather than isolated direct links.

Core claim

ClimateCause is created through expert annotation of higher-order causal structures from climate reports, with cause-effect expressions normalized and disentangled into individual relations annotated for correlation, type, and context. This enables construction of causal graphs that include implicit and nested elements. The resource supports readability measurement based on causal graph complexity and reveals through LLM benchmarking that causal chain reasoning poses a greater challenge than correlation inference.

What carries the argument

The ClimateCause dataset of expert-annotated higher-order causal structures from climate reports, including implicit and nested causality, with normalized cause-effect expressions labeled for correlation, relation type, and spatiotemporal context to enable graph-based analysis.

If this is right

The dataset enables more rigorous evaluation of models on complex causality beyond explicit direct relations.
Readability of climate statements can be quantified using the semantic complexity of their underlying causal graphs.
Causal discovery methods can incorporate annotations for correlation, type, and context to build richer graphs from policy documents.
Targeted improvements in language models can focus on multi-step causal chain reasoning for domain-specific texts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar annotation schemes for implicit causality could be applied to reports in other policy domains to test model limits on nested reasoning.
Integrating causal graph complexity metrics into text analysis tools might aid efforts to make climate information more accessible.
Model training that emphasizes chain reasoning on annotated graphs like these could address gaps in handling real-world causal networks.

Load-bearing premise

Expert annotators can consistently and accurately identify and label implicit, nested, and higher-order causal structures in the source climate reports to create reliable ground truth.

What would settle it

A study finding low agreement among multiple experts annotating the same climate report passages for these complex causal relations, or an LLM achieving comparable performance on causal chain reasoning tasks to correlation inference without using ClimateCause-style data.

Figures

Figures reproduced from arXiv: 2604.14856 by Andrea Rocci, Liesbeth Allein, Marie-Francine Moens, Nataly Pineda-Casta\~neda.

**Figure 1.** Figure 1: A sample from the ClimateCause dataset, showcasing the complex causal graphs and fine-grained annotations it contains. Mostafazadeh et al., 2016; Dunietz et al., 2017; Romanou et al., 2023; Tan et al., 2022; Vo et al., 2025; Pineda and Allein, 2025a,b). They primarily capture explicit direct cause-effect relations and omit those that are implicitly reported through word and sentence semantics; e.g., “anth… view at source ↗

**Figure 2.** Figure 2: Mean readability scores of statements in Po [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

read the original abstract

Understanding climate change requires reasoning over complex causal networks. Yet, existing causal discovery datasets predominantly capture explicit, direct causal relations. We introduce ClimateCause, a manually expert-annotated dataset of higher-order causal structures from science-for-policy climate reports, including implicit and nested causality. Cause-effect expressions are normalized and disentangled into individual causal relations to facilitate graph construction, with unique annotations for cause-effect correlation, relation type, and spatiotemporal context. We further demonstrate ClimateCause's value for quantifying readability based on the semantic complexity of causal graphs underlying a statement. Finally, large language model benchmarking on correlation inference and causal chain reasoning highlights the latter as a key challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClimateCause adds a focused dataset for implicit and nested causality in climate reports, but the missing annotation reliability details limit how far the claims can be trusted.

read the letter

The main point is that this paper introduces ClimateCause, a manually annotated dataset drawn from climate science-for-policy reports that targets higher-order, implicit, and nested causal structures rather than the usual explicit direct links. They normalize cause-effect expressions, disentangle them into separate relations, add labels for correlation, relation type, and spatiotemporal context, then build graphs from that. They also apply the graphs to measure readability via causal complexity and run LLM tests that flag causal chain reasoning as harder than basic correlation inference.

Referee Report

1 major / 1 minor

Summary. The paper introduces ClimateCause, a manually expert-annotated dataset of higher-order causal structures from science-for-policy climate reports, including implicit and nested causality. Cause-effect expressions are normalized and disentangled into individual relations for graph construction, with annotations for correlation, relation type, and spatiotemporal context. The work shows the dataset's utility for quantifying readability via causal graph complexity and benchmarks LLMs on correlation inference versus causal chain reasoning, identifying the latter as a key challenge.

Significance. If the annotations are reliable, ClimateCause could address a gap in causal discovery datasets by targeting complex, implicit structures in climate policy texts. The readability metric and LLM benchmarking provide concrete demonstrations of the dataset's potential value, particularly in exposing model weaknesses on chain reasoning that could inform targeted improvements in scientific NLP.

major comments (1)

[Dataset Construction / Methods] The validity of the dataset, readability quantification, and all LLM benchmarking results rests on the expert annotations of implicit, nested, and higher-order causal structures. No inter-annotator agreement statistics, number of annotators, adjudication protocol, or disagreement resolution details are provided (see dataset construction description in the abstract and implied methods). For this class of subjective structures, low agreement would mean performance gaps cannot be confidently attributed to model limitations rather than label noise.

minor comments (1)

[Abstract] The abstract refers to 'unique annotations for cause-effect correlation, relation type, and spatiotemporal context' without specifying the exact label inventory, annotation guidelines, or examples of how nested relations are represented in the graphs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights a key aspect of dataset reliability that requires clarification. We address the major comment point by point below.

read point-by-point responses

Referee: [Dataset Construction / Methods] The validity of the dataset, readability quantification, and all LLM benchmarking results rests on the expert annotations of implicit, nested, and higher-order causal structures. No inter-annotator agreement statistics, number of annotators, adjudication protocol, or disagreement resolution details are provided (see dataset construction description in the abstract and implied methods). For this class of subjective structures, low agreement would mean performance gaps cannot be confidently attributed to model limitations rather than label noise.

Authors: We agree that the absence of these details in the current manuscript is a limitation, as they are essential for validating annotations of complex, implicit causal structures. The manuscript describes the dataset as 'manually expert-annotated' but does not provide the requested statistics or protocols. In the revised version, we will expand the Methods section (and update the abstract if needed) to include the number of annotators, inter-annotator agreement metrics, and the full adjudication protocol for resolving disagreements. This will allow readers to assess whether performance gaps in LLM benchmarking can be confidently attributed to model limitations. revision: yes

Circularity Check

0 steps flagged

No circularity; dataset introduction with independent empirical evaluation

full rationale

The paper introduces ClimateCause as a new expert-annotated dataset of causal structures from climate reports and demonstrates its use via readability metrics and LLM benchmarking. No mathematical derivations, parameter fitting, self-referential predictions, or load-bearing self-citations appear in the provided text. All claims rest on newly created annotations and direct empirical comparisons to external LLM performance, satisfying the self-contained criterion with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that expert annotations faithfully capture the intended causal structures; no free parameters or new invented entities are introduced.

axioms (1)

domain assumption Expert manual annotations can be performed consistently and accurately enough to serve as reliable ground truth for implicit, nested, and higher-order causal structures in the source reports.
The entire dataset and all downstream claims depend on this premise.

pith-pipeline@v0.9.0 · 5414 in / 1172 out tokens · 52779 ms · 2026-05-10T11:15:34.656781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

The Berkeley Framenet Project. In36th An- nual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, pages 86–90, Montreal, Quebec, Canada. Association for Compu- tational Linguistics. Ralf Barkemeyer, Suraje Dessai, Beatriz Monge-Sanz, Barbara Gabriella Renzi, and Giulio Napolitano

work page
[2]

Andrew M

Linguistic analysis of IPCC summaries for policymakers and associated coverage.Nature Cli- mate Change, 6(3):311–316. Andrew M. Bean, Ryan Othniel Kearns, Angelika Ro- manou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cor- nelius Emde, Thomas Foster, Anna Gausen, ...

work page 2025
[3]

Roberto Ceraolo, Dmitrii Kharlapenko, Am ´elie Rey- mond, Rada Mihalcea, Bernhard Sch ¨olkopf, Mrin- maya Sachan, and Zhijing Jin

Public understanding of climate change termi- nology.Climatic Change, 167(3):37. Roberto Ceraolo, Dmitrii Kharlapenko, Am ´elie Rey- mond, Rada Mihalcea, Bernhard Sch ¨olkopf, Mrin- maya Sachan, and Zhijing Jin. 2024. Analyzing Human Questioning Behavior and Causal Curios- ity through Natural Queries. InCausality and Large Models@ NeurIPS 2024. Jeanne Ste...

work page 2024
[4]

InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland

e-CARE: A New Dataset for Exploring Ex- plainable Causal Reasoning. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland. Association for Computa- tional Linguistics. Jesse Dunietz, Lori Levin, and Jaime Carbonell. 2017. The BECauSE Corpus 2.0: Annotating Causal...

work page arXiv 2017
[5]

InProceedings of the 29th ACM international conference on information & knowledge management, pages 3023–3030

CauseNet: Towards a Causality Graph Ex- tracted from the Web. InProceedings of the 29th ACM international conference on information & knowledge management, pages 3023–3030. Jena D Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. 2021. (Comet-) Atomic 2020: On Sym- bolic and Neural Commonsense Knowled...

work page 2021
[6]

InProceedings of the Fourth Workshop on Events, pages 51–61, San Diego, California

CaTeRS: Causal and Temporal Relation Scheme for Semantic Annotation of Event Structures. InProceedings of the Fourth Workshop on Events, pages 51–61, San Diego, California. Association for Computational Linguistics. Judea Pearl. 2009.Causality. Cambridge university press. Nataly Pineda and Liesbeth Allein. 2025a. Po- larIs3CAUS (Version 1.0) [Dataset]. Na...

work page 2009
[7]

InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15198–15216, Singapore

CRAB: Assessing the Strength of Causal Re- lationships Between Real-world Events. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15198–15216, Singapore. Association for Computational Linguis- tics. Maarten Sap, Ronan Le Bras, Emily Allaway, Chan- dra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan R...

work page 2023
[8]

Causal relation extraction:Decide whether the statement expresses at least one causal re- lation. If so, identify all the cause-effect pairs, specify their context mentioned in the state- ment, highlight the terms in the statement that Section Title 2Current Status and Trends 2.1 Observed Changes, Impacts and Attribution 2.1.1Observed Warming and its Caus...

work page 2023
[9]

Standardization and characterization:Stan- dardize the phrasing of the cause and effect by formulating the events into noun phrases and characterize the relation type and correlation of the causal relation

work page
[10]

human-caused climate change

Complex causal structures:Identify and la- bel causal structures present in the statement, such as common cause/effect. B.1 Annotation Setting The annotators were presented with statements from the IPCC reports, where each statement (rang- ing from single sentences to full paragraphs) was shown on a line in an excel file. The features they had to annotate...

work page
[11]

Manually identify the causal relations that both annotators retrieved

work page
[12]

(b) Mark and correct violations against an- notation guidelines

Compare annotations for these relations: (a) Mark and resolve incorrect spans. (b) Mark and correct violations against an- notation guidelines. (c) Mark disagreement between annotators

work page
[13]

They possi- bly missed these due to a high level of cognitive load of the task

Look at the causal relations that one annotator annotated but the second did not: (a) Include relations that the second anno- tator did not include but that are very similar to other relation they annotated before, e.g., anthropogenic. They possi- bly missed these due to a high level of cognitive load of the task. (b) Include relations that the missed ann...

work page
[14]

E Readability of IPCC Reports IPCC reports are known for low readability (Barke- meyer et al., 2016)

For remaining unresolved relations: (a) Consult third annotator, then majority vote. E Readability of IPCC Reports IPCC reports are known for low readability (Barke- meyer et al., 2016). We examine whether state- ments in the ClimateCause dataset are low in read- ability and whether they are less readable than those in related climate causality datasets (...

work page 2016
[15]

Persistent and region- specific barriers also continue to hamper the economic and political feasibility of deploying AFOLU mitigation options

The experiments were sent to the Batch API, where in total 34,303 requests were made, such that the model handled 9,880,620 completion tokens (Costs: $5.052 input tokens; $11.4 output tokens). G.3 Evaluation Metrics Precision Precision= T P T P+F P (8) Recall Recall= T P T P+F N (9) F1-score F1= 2×precision×recall precision+recall (10) G.4 Breakdown of Re...

work page arXiv 2027

[1] [1]

The Berkeley Framenet Project. In36th An- nual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, pages 86–90, Montreal, Quebec, Canada. Association for Compu- tational Linguistics. Ralf Barkemeyer, Suraje Dessai, Beatriz Monge-Sanz, Barbara Gabriella Renzi, and Giulio Napolitano

work page

[2] [2]

Andrew M

Linguistic analysis of IPCC summaries for policymakers and associated coverage.Nature Cli- mate Change, 6(3):311–316. Andrew M. Bean, Ryan Othniel Kearns, Angelika Ro- manou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cor- nelius Emde, Thomas Foster, Anna Gausen, ...

work page 2025

[3] [3]

Roberto Ceraolo, Dmitrii Kharlapenko, Am ´elie Rey- mond, Rada Mihalcea, Bernhard Sch ¨olkopf, Mrin- maya Sachan, and Zhijing Jin

Public understanding of climate change termi- nology.Climatic Change, 167(3):37. Roberto Ceraolo, Dmitrii Kharlapenko, Am ´elie Rey- mond, Rada Mihalcea, Bernhard Sch ¨olkopf, Mrin- maya Sachan, and Zhijing Jin. 2024. Analyzing Human Questioning Behavior and Causal Curios- ity through Natural Queries. InCausality and Large Models@ NeurIPS 2024. Jeanne Ste...

work page 2024

[4] [4]

InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland

e-CARE: A New Dataset for Exploring Ex- plainable Causal Reasoning. InProceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 432–446, Dublin, Ireland. Association for Computa- tional Linguistics. Jesse Dunietz, Lori Levin, and Jaime Carbonell. 2017. The BECauSE Corpus 2.0: Annotating Causal...

work page arXiv 2017

[5] [5]

InProceedings of the 29th ACM international conference on information & knowledge management, pages 3023–3030

CauseNet: Towards a Causality Graph Ex- tracted from the Web. InProceedings of the 29th ACM international conference on information & knowledge management, pages 3023–3030. Jena D Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, and Yejin Choi. 2021. (Comet-) Atomic 2020: On Sym- bolic and Neural Commonsense Knowled...

work page 2021

[6] [6]

InProceedings of the Fourth Workshop on Events, pages 51–61, San Diego, California

CaTeRS: Causal and Temporal Relation Scheme for Semantic Annotation of Event Structures. InProceedings of the Fourth Workshop on Events, pages 51–61, San Diego, California. Association for Computational Linguistics. Judea Pearl. 2009.Causality. Cambridge university press. Nataly Pineda and Liesbeth Allein. 2025a. Po- larIs3CAUS (Version 1.0) [Dataset]. Na...

work page 2009

[7] [7]

InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15198–15216, Singapore

CRAB: Assessing the Strength of Causal Re- lationships Between Real-world Events. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15198–15216, Singapore. Association for Computational Linguis- tics. Maarten Sap, Ronan Le Bras, Emily Allaway, Chan- dra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan R...

work page 2023

[8] [8]

Causal relation extraction:Decide whether the statement expresses at least one causal re- lation. If so, identify all the cause-effect pairs, specify their context mentioned in the state- ment, highlight the terms in the statement that Section Title 2Current Status and Trends 2.1 Observed Changes, Impacts and Attribution 2.1.1Observed Warming and its Caus...

work page 2023

[9] [9]

Standardization and characterization:Stan- dardize the phrasing of the cause and effect by formulating the events into noun phrases and characterize the relation type and correlation of the causal relation

work page

[10] [10]

human-caused climate change

Complex causal structures:Identify and la- bel causal structures present in the statement, such as common cause/effect. B.1 Annotation Setting The annotators were presented with statements from the IPCC reports, where each statement (rang- ing from single sentences to full paragraphs) was shown on a line in an excel file. The features they had to annotate...

work page

[11] [11]

Manually identify the causal relations that both annotators retrieved

work page

[12] [12]

(b) Mark and correct violations against an- notation guidelines

Compare annotations for these relations: (a) Mark and resolve incorrect spans. (b) Mark and correct violations against an- notation guidelines. (c) Mark disagreement between annotators

work page

[13] [13]

They possi- bly missed these due to a high level of cognitive load of the task

Look at the causal relations that one annotator annotated but the second did not: (a) Include relations that the second anno- tator did not include but that are very similar to other relation they annotated before, e.g., anthropogenic. They possi- bly missed these due to a high level of cognitive load of the task. (b) Include relations that the missed ann...

work page

[14] [14]

E Readability of IPCC Reports IPCC reports are known for low readability (Barke- meyer et al., 2016)

For remaining unresolved relations: (a) Consult third annotator, then majority vote. E Readability of IPCC Reports IPCC reports are known for low readability (Barke- meyer et al., 2016). We examine whether state- ments in the ClimateCause dataset are low in read- ability and whether they are less readable than those in related climate causality datasets (...

work page 2016

[15] [15]

Persistent and region- specific barriers also continue to hamper the economic and political feasibility of deploying AFOLU mitigation options

The experiments were sent to the Batch API, where in total 34,303 requests were made, such that the model handled 9,880,620 completion tokens (Costs: $5.052 input tokens; $11.4 output tokens). G.3 Evaluation Metrics Precision Precision= T P T P+F P (8) Recall Recall= T P T P+F N (9) F1-score F1= 2×precision×recall precision+recall (10) G.4 Breakdown of Re...

work page arXiv 2027