Identifying Risk Variables From Raw ESG Data Using Its Hierarchical Structure
Pith reviewed 2026-05-18 21:46 UTC · model grok-4.3
The pith
Raw ESG variables identified via hierarchical structure track financial risk more closely than aggregated scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a framework specifically designed for ESG datasets characterized by a hierarchical data structure and a significantly larger number of variables than observations. We show that raw variables selected by the proposed framework are significantly more relevant to financial risk, measured by logarithmic volatility of return, than aggregated ESG scores. These selected risk variables provide additional insights beyond the traditional financial factors. We validate the robustness of this framework using out-of-sample data and illustrate it with company data from various sectors of the US economy, further identifying the specific ESG risk variables relevant to large and small companies.
What carries the argument
A variable-selection procedure that uses the hierarchical organization of ESG indicators to operate in the regime where variables greatly outnumber observations.
If this is right
- Raw variables chosen by the framework are significantly more relevant to financial risk than aggregated ESG scores.
- The selected variables supply explanatory power beyond traditional financial factors.
- Performance remains stable on out-of-sample periods.
- Distinct ESG risk variables emerge for large versus small firms inside each sector.
Where Pith is reading between the lines
- Portfolio risk models could substitute these targeted raw metrics for broad scores to tighten volatility forecasts.
- Sector-by-sector, size-specific selections point toward differentiated ESG risk factors rather than one-size-fits-all ratings.
- The same hierarchy-driven selection logic might extend to other high-dimensional non-financial datasets that share nested reporting structures.
Load-bearing premise
The hierarchical structure of ESG data can be leveraged to perform effective variable selection when the number of variables greatly exceeds the number of observations.
What would settle it
New data in which the framework-selected raw variables show no statistically significant improvement in relevance to logarithmic return volatility over aggregated ESG scores.
read the original abstract
Environmental, Social, and Governance (ESG) data provides non-financial insights into corporations. In this study, we aim to identify relevant ESG raw variables to assess financial risk, measured by logarithmic volatility of return. We propose a framework specifically designed for ESG datasets characterized by a hierarchical data structure and a significantly larger number of variables than observations. We show that raw variables selected by the proposed framework are significantly more relevant to financial risk than aggregated ESG scores. Furthermore, these selected risk variables provide additional insights beyond the traditional financial factors. We validate the robustness of this framework using out-of-sample data. We illustrate our framework using company data from various sectors of the US economy. We further identify the specific ESG risk variables relevant to large and small companies within each sector.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a framework that exploits the hierarchical structure of ESG data to select raw variables relevant to financial risk, proxied by logarithmic volatility of returns, in a high-dimensional regime where the number of variables greatly exceeds the number of observations. It claims that these selected raw variables are significantly more relevant to risk than conventional aggregated ESG scores, supply incremental explanatory power beyond standard financial factors, and remain robust in out-of-sample tests. The approach is illustrated on US firm-level data across sectors, with separate identification of risk variables for large versus small companies within each sector.
Significance. If the selection procedure is shown to be non-circular and the reported outperformance holds under the stated validation, the work would strengthen the case for moving from coarse ESG aggregates to granular raw indicators in risk models. The explicit use of hierarchy to address the p ≫ n problem and the out-of-sample checks are constructive elements that could improve both predictive accuracy and interpretability in ESG-integrated portfolio risk management.
major comments (2)
- [Methods] The manuscript does not provide the exact encoding of the hierarchical structure (e.g., how levels or groups are defined and incorporated into the selection criterion or penalty) nor the precise algorithm used for variable selection. Without these details, it is impossible to verify that the reported superiority over aggregated scores is not an artifact of the particular implementation or data-handling choices.
- [Results / Validation] The out-of-sample validation is asserted but the concrete metrics (e.g., incremental R², mean-squared-error reduction, or statistical tests comparing selected raw variables against aggregated scores) are not reported with sufficient granularity or baseline specifications to support the central claim of significantly higher relevance.
minor comments (2)
- [Notation] Notation for the hierarchical levels and the risk target (log-volatility) should be introduced consistently in the first methods subsection and used uniformly thereafter.
- [Data] The description of the data sample (number of firms, time span, sector breakdown, and any exclusion rules) appears only in the empirical illustration; moving a concise summary to the data section would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help improve the clarity and rigor of our work. We address each major comment below and will incorporate the suggested enhancements in the revised manuscript.
read point-by-point responses
-
Referee: [Methods] The manuscript does not provide the exact encoding of the hierarchical structure (e.g., how levels or groups are defined and incorporated into the selection criterion or penalty) nor the precise algorithm used for variable selection. Without these details, it is impossible to verify that the reported superiority over aggregated scores is not an artifact of the particular implementation or data-handling choices.
Authors: We agree that additional explicit details are needed for full reproducibility. In the revised version, we will add a new subsection (and supplementary appendix) that precisely defines the hierarchical encoding: ESG data levels are structured as Category (E/S/G) > Subcategory (e.g., Emissions, Labor Practices) > Specific Indicator, with groups incorporated via a hierarchical group-lasso penalty where the penalty term is applied at each level with weights proportional to group size. We will also provide the full variable-selection algorithm, including the optimization objective, cross-validation procedure for tuning parameters, and pseudocode. revision: yes
-
Referee: [Results / Validation] The out-of-sample validation is asserted but the concrete metrics (e.g., incremental R², mean-squared-error reduction, or statistical tests comparing selected raw variables against aggregated scores) are not reported with sufficient granularity or baseline specifications to support the central claim of significantly higher relevance.
Authors: We acknowledge that the current presentation of out-of-sample results could be more granular. In the revision, we will expand the validation section to report incremental R², percentage MSE reduction, and formal statistical tests (including p-values from paired t-tests and Diebold-Mariano tests) for the selected raw variables versus aggregated ESG scores and standard financial-factor baselines. We will also clarify the exact train/test splits, rolling-window scheme, and sector-specific baseline specifications used. revision: yes
Circularity Check
No significant circularity; framework is data-driven against external target
full rationale
The paper proposes a hierarchical variable selection framework for p>>n ESG data and validates selected raw variables against an external risk measure (log-volatility of returns) with out-of-sample checks. Relevance is measured relative to this independent financial target rather than by construction from fitted parameters or self-citations. The derivation chain remains self-contained; any self-citations are non-load-bearing and do not reduce the central claim to its inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ESG datasets exhibit a hierarchical structure that can be exploited for variable selection in high-dimensional settings
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a novel Hierarchical Variable Selection (HVS) algorithm... Step 1: Perform a Stepwise regression with raw variables within each category. Use the AIC criterion... Step 3: Perform a Ridge regression...
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the hierarchical structure of the LSEG dataset... tree structure with significantly more variables than observations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.