GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction

Chao Xue; Flora Salim; Lihuan Li; Siyuan Zheng; Yifan Duan

arxiv: 2605.13743 · v3 · pith:27UO5YVAnew · submitted 2026-05-13 · 💻 cs.LG

GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction

Yifan Duan , Siyuan Zheng , Lihuan Li , Chao Xue , Flora Salim This is my paper

Pith reviewed 2026-05-14 19:45 UTC · model grok-4.3

classification 💻 cs.LG

keywords carbon emission predictiongreenhouse gas benchmarkout-of-distribution generalizationtabular foundation modelsremote sensing embeddingsbuilding emissionscompany disclosuresmulti-city transfer

0 comments

The pith

GHGbench shows building carbon emissions are structurally harder to predict than company emissions, with out-of-distribution gaps dominating model differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GHGbench as a unified open benchmark that combines fragmented company and building emission datasets into consistent tracks for prediction tasks. The company track uses over 32,000 records with financial and sectoral signals, while the building track harmonizes nearly 500,000 records from 13 sources across 26 cities with climate and remote-sensing features. Evaluations under in-distribution and cross-region splits reveal that performance drops sharply on new cities or distributions, exceeding the gains from switching model architectures, though a tabular foundation model shows the first significant edge over tuned trees on building data and multimodal embeddings help where tabular methods falter. These patterns matter because accurate entity-level forecasts underpin emission reduction policies and corporate reporting, yet current approaches hit systematic limits on transfer. The benchmark also flags catastrophic city transfer and sector lookup ceilings as recurring failure modes that future work must address.

Core claim

GHGbench establishes that building-level greenhouse gas emission prediction is structurally more difficult than company-level prediction, that the in-distribution to out-of-distribution performance gap substantially exceeds within-model differences across both tracks, that a tabular foundation model is the first baseline to open a paired-bootstrap-significant improvement over tuned gradient-boosted trees on multi-city building tasks, and that multimodal remote-sensing embeddings deliver gains precisely where tabular generalization collapses, while exposing catastrophic city transfer and sector-factor lookup ceilings as systematic limitations.

What carries the argument

The GHGbench benchmark, consisting of a company track with 32,000+ records and a building track with 491,591 harmonized records across 26 metropolitan areas, evaluated on canonical in-distribution and cross-region/city transfer splits using multi-seed paired-bootstrap statistical tests.

Load-bearing premise

Harmonizing 13 heterogeneous building data sources into a single schema produces accurate labels and features without introducing systematic errors that affect the reported generalization gaps.

What would settle it

Re-evaluating the building track on the same splits but with independently sourced and harmonized emission labels from additional cities that removes the paired-bootstrap significance between the tabular foundation model and tuned trees would falsify the central performance claims.

Figures

Figures reproduced from arXiv: 2605.13743 by Chao Xue, Flora Salim, Lihuan Li, Siyuan Zheng, Yifan Duan.

**Figure 2.** Figure 2: Dataset coverage. Left: company-year rows by region. Right: building-year rows by [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Building-track R 2 on the building-grouped split across the nine feature tiers and three ladders defined in §3.2 (full registry in Appendix H). Shaded bands mark proxy-rich and directenergy-proxy tiers. 5.2 Analysis and Findings Sector-factor estimation trails learned models. Predicting emissions by multiplying revenue with the ExioML/EXIOBASE sectoral factor reaches R 2 = 0.222 on the firm-matched compan… view at source ↗

**Figure 4.** Figure 4: Building-track leave-one-city-out on the 26-city cross-country core tier. Cities sorted by RF [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Per-city non-null availability (%) for building-level schema fields. Cells at [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: building-track regression feature-tier ladder, grouped-building split, 3-seed mean [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Tuned LightGBM permutation ∆R 2 on the strict-coverage panel (top six features; error bars: std over five repeats). M Compute and Runtime All experiments were run on a single workstation with 8 × NVIDIA RTX A5000 (24 GB) GPUs and a multi-core CPU; only TabPFN, MLP, and time-series foundation-model inference made use of GPUs. Tree baselines (RandomForest, XGBoost, LightGBM, HistGradientBoosting) ran exclusi… view at source ↗

**Figure 8.** Figure 8: Left: Task B1 strict temporal hold-out R 2 on core_all_cities (single run). Right: Task E1 short-horizon forecasting R 2 . Both panels clipped on the negative side; raw Ridge values annotated [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

**Figure 9.** Figure 9: Sentinel-2 + Clay multimodal extension. Left: Task A grouped [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Paired-bootstrap ∆R 2 between tree-family pairs per feature tier. Stars: pR2 < 0.05 (∗), < 0.01 (∗∗), < 0.001 (∗∗∗) [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

read the original abstract

Open datasets and benchmarks for entity-level carbon-emission prediction remain fragmented across access, scale, granularity, and evaluation. We introduce GHGbench, an open dataset and benchmark for company- and building-level greenhouse-gas prediction. The company track contains 32,000+ company-year records from 12,000+ firms with Scope 1+2 and Scope 3 disclosures and financial/sectoral signals; the building track harmonises 491,591 building-year records from 13 open sources into a single schema across 26 metropolitan areas (10 U.S., 15 Australian, 1 Singaporean), with climate covariates and multimodal remote-sensing embeddings. GHGbench defines canonical splits with in-distribution and cross-region/city transfer as primary tasks and temporal hold-out plus short-horizon forecasting as supplementary appendix evidence; headline baselines span gradient-boosted trees, a tabular foundation model, MLP, FT-Transformer, and multimodal fusion, with an LLM panel as auxiliary, all evaluated under multi-seed paired-bootstrap tests. Three benchmark-level findings emerge: (i) building emissions are structurally harder than company emissions; (ii) the in-distribution to out-of-distribution gap dwarfs any within-model gap across both the company track and the building track, and a tabular foundation model is, to our knowledge, the first baseline to open a paired-bootstrap-significant gap over tuned trees on a multi-city building-emissions task; (iii) multimodal remote-sensing embeddings help precisely where tabular generalisation breaks. GHGbench also exposes catastrophic city transfer and the sector-factor lookup ceiling as systematic failure modes. Code and reconstruction recipes are available at GHGbench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GHGbench gives the field a usable shared benchmark for company and building emissions with clear splits, but the building-track harmonization lacks the checks needed to trust the reported generalization gaps.

read the letter

The main thing here is a new open benchmark that pulls company disclosures and building records into one place with explicit in-distribution, cross-region, and temporal splits. That setup is genuinely helpful for people working on entity-level carbon models because it forces consistent evaluation instead of the usual ad-hoc datasets. The authors release the reconstruction recipes and code, which is the right move, and they run multi-seed paired-bootstrap tests so the model comparisons are at least statistically grounded on the surface. The three headline observations—buildings harder than companies, ID-to-OOD gaps dominating model differences, and remote-sensing embeddings helping where tabular features fail—follow directly from those runs and are worth testing further. The tabular foundation model beating tuned trees on the multi-city building task is a concrete data point rather than hand-waving. The soft spot is the building track. Harmonizing 491k records from 13 sources into one schema is the load-bearing step for all three findings, yet the abstract and stress-test note give no numbers on inter-source label agreement, single-source ablations, or how imputation and emission-factor choices affect the targets. If those steps inject city-specific or reporting-style artifacts, the reported transfer difficulty and multimodal gains could be partly spurious. That concern is not fatal but it is central, and it needs quantitative evidence in the full paper. This is the kind of work a reading group on environmental ML would discuss for the dataset itself rather than the modeling tricks. It deserves a serious referee because the benchmark infrastructure is new and the evaluation protocol is reproducible; reviewers can push on the harmonization validation without dismissing the effort. I would send it out rather than desk-reject.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces GHGbench, a unified benchmark for company- and building-level greenhouse gas emission prediction. The company track aggregates over 32,000 company-year records from 12,000+ firms with Scope 1+2 and Scope 3 disclosures plus financial/sectoral signals. The building track harmonizes 491,591 building-year records from 13 open sources into a single schema across 26 metropolitan areas (10 U.S., 15 Australian, 1 Singaporean), incorporating climate covariates and multimodal remote-sensing embeddings. Canonical splits emphasize in-distribution versus cross-region/city transfer tasks, with temporal hold-out and short-horizon forecasting as supplementary evidence. Baselines include gradient-boosted trees, a tabular foundation model, MLP, FT-Transformer, multimodal fusion, and an auxiliary LLM panel, all evaluated with multi-seed paired-bootstrap tests. Three headline findings are reported: (i) building emissions are structurally harder than company emissions; (ii) ID-to-OOD gaps dwarf within-model differences, with the tabular foundation model achieving the first paired-bootstrap-significant improvement over tuned trees on the multi-city building task; (iii) multimodal remote-sensing embeddings help precisely where tabular generalization breaks. The work also identifies catastrophic city transfer and sector-factor lookup ceilings as systematic failure modes, with code and reconstruction recipes released.

Significance. If the harmonization steps are validated to preserve unbiased labels and features, GHGbench would constitute a valuable contribution by establishing the first large-scale, multi-entity benchmark that systematically tests generalization across cities, regions, and modalities in carbon-emission prediction. The explicit release of code/recipes, use of paired-bootstrap significance testing, and identification of concrete failure modes (city transfer, lookup ceilings) are strengths that support reproducibility and future work. The reported dominance of distribution shift over model choice, together with the utility of remote-sensing embeddings, could usefully inform model design in this application area.

major comments (1)

[Building track harmonization] Building track (abstract and § on data construction): the central claims (i)–(iii) all rest on the harmonized 491k-record building dataset. The manuscript states that 13 heterogeneous sources were unified but reports no quantitative validation of this step—no inter-source label agreement metrics, no ablation on single-source subsets, and no audit of imputation or aggregation-rule effects. Without such checks, systematic differences in reporting standards, emission-factor assumptions, or city-level aggregation could artifactually inflate the reported ID/OOD gaps and multimodal gains, exactly as flagged by the weakest-assumption analysis.

minor comments (2)

The abstract and methods would benefit from a concise table summarizing the 13 building sources, their original schemas, and the exact harmonization rules applied (even if full recipes are in the released code).
Clarify whether the paired-bootstrap tests correct for multiple comparisons across the many model–split combinations reported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of GHGbench for the community. We address the single major comment below and will incorporate the suggested validation steps in the revised manuscript.

read point-by-point responses

Referee: [Building track harmonization] Building track (abstract and § on data construction): the central claims (i)–(iii) all rest on the harmonized 491k-record building dataset. The manuscript states that 13 heterogeneous sources were unified but reports no quantitative validation of this step—no inter-source label agreement metrics, no ablation on single-source subsets, and no audit of imputation or aggregation-rule effects. Without such checks, systematic differences in reporting standards, emission-factor assumptions, or city-level aggregation could artifactually inflate the reported ID/OOD gaps and multimodal gains, exactly as flagged by the weakest-assumption analysis.

Authors: We agree that quantitative validation of the harmonization is necessary to support the central claims. The original manuscript emphasized release of the full reconstruction recipes to enable external audits, but did not include explicit agreement metrics or sensitivity checks. In the revision we will add: (i) pairwise label agreement statistics on the subset of buildings that appear in multiple sources, (ii) performance ablations restricted to single-source city subsets for the largest metropolitan areas, and (iii) sensitivity tables showing how ID/OOD gaps and multimodal gains change under alternative imputation and aggregation rules. These additions will confirm that the reported findings are robust to harmonization choices. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark relies on external data harmonization and standard evaluation protocols

full rationale

The paper constructs GHGbench by aggregating and harmonizing 13 external public building datasets plus company disclosures, defines canonical ID/OOD splits, and evaluates off-the-shelf baselines (trees, tabular foundation models, multimodal fusion) under paired-bootstrap tests. No equations, fitted parameters, or self-citations are used to derive the three headline empirical findings; those findings are direct statistical comparisons on the released data. The harmonization step is presented as a preprocessing recipe whose validity is left to external audit rather than being defined in terms of the reported gaps. This is a standard benchmark paper whose derivation chain is self-contained against external sources and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard supervised learning assumptions and public data sources without introducing new free parameters, axioms beyond common ML practice, or invented entities.

axioms (1)

standard math Standard multi-seed paired-bootstrap statistical tests are appropriate for comparing model performance on this data.
Invoked for all headline baseline comparisons.

pith-pipeline@v0.9.0 · 5604 in / 1183 out tokens · 41216 ms · 2026-05-14T19:45:12.384933+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The building track harmonises 491,591 building-year records from 13 open sources into a single schema... headline baselines span gradient-boosted trees, a tabular foundation model, MLP, FT-Transformer, and multimodal fusion
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Three benchmark-level findings emerge: (i) building emissions are structurally harder... (ii) the in-distribution to out-of-distribution gap dwarfs any within-model gap... (iii) multimodal remote-sensing embeddings help precisely where tabular generalisation breaks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.