arxiv: 2605.09310 · v1 · submitted 2026-05-10 · 💻 cs.AI · q-fin.PM

Recognition: no theorem link

Beyond ESG Scores: Learning Dynamic Constraints for Sequential Portfolio Optimization

Xin Li , Yan Ke , Longbing Cao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:19 UTC · model grok-4.3

classification 💻 cs.AI q-fin.PM

keywords ESG-aware portfolio optimizationdynamic constraintsmultimodal learningsequential decision makingconstrained optimizationportfolio managementsustainable investingaction-conditioned models

0 comments

The pith

Dynamic ESG constraints learned from multimodal evidence reduce tail budget pressure in sequential portfolio optimization without harming returns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that appending static ESG scores to policy observations or rewards fails for sequential portfolio decisions because such scores are noisy, provider-dependent, low-frequency, and misaligned with trading timing. It instead treats ESG as a set of learnable constraints imposed separately from the financial policy. A Multimodal Action-Conditioned Constraint Field (MACF) is trained on point-in-time multimodal evidence and contemplated portfolio transitions to produce mechanism-specific ESG cost functions. These costs are then converted by MACF-X adapters into native constrained-optimization interfaces using a slack- and uncertainty-aware pressure layer. Experiments across multiple interfaces show lower tail ESG budget pressure than static-score or noise baselines while financial performance remains competitive, with ablations confirming that dynamic evidence and three-head decomposition are required for the gains.

Core claim

The central claim is that ESG can be operationalized as dynamic, mechanism-specific constraints learned by a Multimodal Action-Conditioned Constraint Field (MACF) from point-in-time multimodal evidence and action-conditioned transitions, then adapted via MACF-X into standard optimizer interfaces through a shared slack- and uncertainty-aware pressure layer. This separation leaves the underlying financial policy unchanged yet produces materially lower tail ESG budget pressure than static ESG-score proxies, which perform indistinguishably from score-shuffled noise baselines.

What carries the argument

The Multimodal Action-Conditioned Constraint Field (MACF) that learns mechanism-specific ESG costs from multimodal evidence and contemplated transitions, paired with MACF-X adapters that map those costs and uncertainties into native constrained-optimization interfaces via a slack- and uncertainty-aware pressure layer.

If this is right

Dynamic multimodal inputs and three-head decomposition are necessary; static ESG scores alone add no value beyond noise.
The same MACF costs can be routed through multiple constraint-integration interfaces without retraining the financial policy.
Tail ESG budget pressure can be reduced while preserving competitive risk-adjusted returns.
ESG is better handled as an explicit constraint dimension than as an alpha factor inside the reward or observation.
Ablation results indicate that mechanism-specific cost learning, not merely additional data volume, drives the observed improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Portfolio systems could incorporate real-time news, regulatory filings, and satellite imagery to update ESG costs intraday rather than at discrete rating dates.
The separation of constraint learning from the financial policy may generalize to other hard-to-quantify objectives such as carbon budgets or liquidity constraints.
If the learned costs prove stable across market regimes, reliance on third-party ESG rating providers could decline in favor of evidence-driven internal models.
The three-head decomposition structure offers a template for learning separate cost, uncertainty, and pressure heads in other sequential constrained-control settings.

Load-bearing premise

Point-in-time multimodal evidence is reliably available, timely, and sufficiently informative to learn mechanism-specific ESG costs that generalize beyond the training distribution.

What would settle it

On a held-out period with fresh multimodal inputs, MACF-X shows no statistically significant reduction in tail ESG budget pressure relative to a static-score baseline or a score-shuffled noise baseline.

Figures

Figures reproduced from arXiv: 2605.09310 by Longbing Cao, Xin Li, Yan Ke.

**Figure 2.** Figure 2: One-step construction of action-conditioned MACF costs. For each asset, the structured [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Point-in-time ESG data construction pipeline. Market data are converted into daily [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

ESG-aware portfolio optimization is increasingly important for sustainable capital allocation, yet most learning-based methods still operationalize ESG by appending static scores to the policy observation or reward. This creates a mismatch for sequential control: ESG scores are noisy, provider-dependent, low-frequency, and temporally misaligned with sequential portfolio decisions, while financial evidence suggests that ESG is better treated as a portfolio preference, risk-exposure, or hedge dimension than as a robust alpha factor. We propose to impose ESG constraints without modifying the financial policy's observation or reward, using a Multimodal Action-Conditioned Constraint Field (MACF) that learns mechanism-specific ESG costs from point-in-time multimodal evidence and contemplated portfolio transitions. We then introduce MACF-X, a family of optimizer-specific adapters that converts MACF costs and uncertainties into native constrained-optimization interfaces through a shared slack- and uncertainty-aware pressure layer. Across multiple constraint-integration interfaces, MACF-X reduces tail ESG budget pressure while maintaining competitive financial performance. Ablations show that this improvement depends on dynamic evidence inputs and three-head decomposition, while static ESG-score proxies are nearly indistinguishable from score-shuffled noise baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper separates ESG constraint learning from the financial policy using multimodal point-in-time evidence, but the abstract leaves the performance claims and data handling unverified.

read the letter

The main point is that the authors treat ESG as a learned dynamic constraint rather than a static score added to observations or rewards. They build a Multimodal Action-Conditioned Constraint Field that pulls mechanism-specific costs from contemporaneous multimodal inputs and contemplated transitions, then supply MACF-X adapters to turn those costs into usable forms for different constrained optimizers. Across interfaces the setup reportedly cuts tail ESG budget pressure without much loss in financial performance, and the ablations tie the gains to the dynamic inputs and three-head structure rather than static proxies or noise baselines. That separation and the emphasis on constraint interfaces rather than policy changes is the clearest practical step forward here. It directly targets the documented mismatch between infrequent, noisy ESG scores and sequential portfolio decisions, and it frames ESG more as a risk or preference dimension than an alpha factor, which aligns with some finance evidence. The architecture keeps the core financial policy untouched, which is a clean design choice. The soft spots are mostly around missing specifics. The abstract supplies no equations, dataset descriptions, training details, or statistical tests, so the reported improvements cannot be checked for robustness or effect size. The stress-test concern about temporal leakage is reasonable on the given text: if the multimodal evidence pipeline or transition sampling allows any post-decision signals, the apparent advantage over baselines could be an artifact rather than evidence of reliable mechanism-aware learning. The claim that static scores behave like shuffled noise also needs more than an ablation summary to hold weight. This paper is aimed at researchers working on constrained reinforcement learning for finance or on practical ESG integration in portfolio systems. A reader already thinking about dynamic constraints or multimodal inputs in sequential decisions could pull useful architectural ideas even if the experiments require closer inspection. It deserves peer review. The core problem is real and the proposed separation has enough structure to be tested properly by referees who can examine the full methods and data.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a Multimodal Action-Conditioned Constraint Field (MACF) to learn mechanism-specific ESG costs from point-in-time multimodal evidence and contemplated portfolio transitions for sequential optimization. It introduces MACF-X adapters that convert these costs into native constrained-optimization interfaces via a slack- and uncertainty-aware pressure layer. The central claim is that this reduces tail ESG budget pressure across multiple interfaces while preserving competitive financial performance, with ablations demonstrating that the gains require dynamic evidence inputs and three-head decomposition (static ESG-score proxies perform similarly to score-shuffled noise baselines).

Significance. If the empirical results hold without temporal leakage, the work offers a meaningful advance in ESG-aware sequential control by decoupling constraint learning from the financial policy's observation and reward. The framework's compatibility with multiple optimizer interfaces and the ablation evidence distinguishing dynamic multimodal inputs from static or noisy baselines are strengths that could influence how ESG factors are operationalized in reinforcement-learning portfolio methods.

major comments (3)

[Abstract] Abstract: the central claims of performance improvement and ablation dependence on dynamic evidence plus three-head decomposition are stated without equations, dataset descriptions, training details, or statistical tests. This prevents verification of the reported reduction in tail ESG budget pressure.
[Methods (MACF and training)] MACF training procedure: the claim that MACF learns mechanism-specific costs from strictly contemporaneous multimodal evidence must be supported by explicit safeguards against lookahead or post-transition signals in the evidence pipeline. Without this, the superiority over static-score and noise baselines could be an in-sample artifact rather than evidence of robust constraint learning.
[Experiments and ablations] Ablation studies: the assertion that static ESG-score proxies are nearly indistinguishable from score-shuffled noise baselines is load-bearing for the argument favoring dynamic inputs. Specific quantitative metrics (e.g., tail-pressure differences, R^{2} values, or statistical significance) from these ablations are required to substantiate the claim.

minor comments (2)

[Abstract] The 'three-head decomposition' is referenced in the ablation discussion but not defined or motivated in the abstract, which reduces clarity for readers.
[MACF-X adapters] Notation for the slack- and uncertainty-aware pressure layer in MACF-X could be introduced with a brief equation or diagram to aid understanding of how costs are converted to native interfaces.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive review and for identifying areas where additional rigor and detail will strengthen the manuscript. We address each major comment below and have revised the paper accordingly where possible.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of performance improvement and ablation dependence on dynamic evidence plus three-head decomposition are stated without equations, dataset descriptions, training details, or statistical tests. This prevents verification of the reported reduction in tail ESG budget pressure.

Authors: We agree that the abstract is high-level and omits supporting details. Due to length constraints, we cannot include full equations or exhaustive training procedures in the abstract. However, we have revised it to reference the multimodal dataset sources, note the use of statistical significance testing for performance differences, and briefly indicate the ablation structure. Full equations, dataset specifications, and training details remain in the Methods and Experiments sections. revision: partial
Referee: [Methods (MACF and training)] MACF training procedure: the claim that MACF learns mechanism-specific costs from strictly contemporaneous multimodal evidence must be supported by explicit safeguards against lookahead or post-transition signals in the evidence pipeline. Without this, the superiority over static-score and noise baselines could be an in-sample artifact rather than evidence of robust constraint learning.

Authors: This concern about temporal leakage is valid and central to the validity of the dynamic-input claim. The original pipeline already restricted evidence to strictly point-in-time multimodal inputs available at the portfolio decision timestamp, with no post-transition or future signals. We have added an explicit subsection in Methods that details the timestamp alignment procedure, data filtering rules, and validation checks confirming absence of lookahead. These additions directly address the possibility of in-sample artifacts. revision: yes
Referee: [Experiments and ablations] Ablation studies: the assertion that static ESG-score proxies are nearly indistinguishable from score-shuffled noise baselines is load-bearing for the argument favoring dynamic inputs. Specific quantitative metrics (e.g., tail-pressure differences, R^{2} values, or statistical significance) from these ablations are required to substantiate the claim.

Authors: We concur that the ablation claim requires quantitative backing. The revised Experiments section now includes a table reporting the specific metrics: tail ESG budget pressure differences (mean and standard deviation across runs), R^{2} values for the static-proxy versus noise baselines, and p-values from paired statistical tests. These numbers confirm the near-indistinguishability and thereby support the necessity of dynamic multimodal inputs. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablation results rest on data comparisons, not definitional reductions

full rationale

The manuscript introduces MACF and MACF-X as a modeling approach for dynamic ESG constraints, then reports performance via cross-interface experiments and ablations that contrast dynamic multimodal inputs against static-score and shuffled-noise baselines. No equations or derivation steps are presented whose outputs are forced by construction from the inputs (e.g., no fitted parameter renamed as a prediction, no self-citation chain supplying a uniqueness theorem, no ansatz smuggled via prior work). The central claims are therefore falsifiable empirical statements rather than tautological restatements of the method itself, yielding a self-contained analysis with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the existence of learnable mechanism-specific ESG costs from multimodal evidence and on the assumption that these costs can be converted into native optimizer constraints without side effects on financial performance; no explicit free parameters, axioms, or invented entities are stated in the abstract.

invented entities (1)

MACF no independent evidence
purpose: Learns mechanism-specific ESG costs from point-in-time multimodal evidence for use as dynamic constraints
Introduced as the core new component that separates ESG handling from the financial policy

pith-pipeline@v0.9.0 · 5493 in / 1415 out tokens · 51214 ms · 2026-05-12T04:19:49.976165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

Who cares wins: Connecting financial markets to a changing world

United Nations Global Compact. Who cares wins: Connecting financial markets to a changing world. https://www.unglobalcompact.org/docs/issues_doc/Financial_markets/ who_cares_who_wins.pdf, 2004

work page 2004
[2]

AI in finance: Challenges, techniques, and opportunities.ACM Computing Surveys, 55(3), 2022

Longbing Cao. AI in finance: Challenges, techniques, and opportunities.ACM Computing Surveys, 55(3), 2022. doi: 10.1145/3502289

work page doi:10.1145/3502289 2022
[3]

Environmental, social, and governance (ESG) and artificial intelligence in finance: State-of-the-art and research takeaways.Artificial Intelligence Review, 57:76, 2024

Tristan Lim. Environmental, social, and governance (ESG) and artificial intelligence in finance: State-of-the-art and research takeaways.Artificial Intelligence Review, 57:76, 2024. doi: 10.1007/s10462-024-10708-3. 9

work page doi:10.1007/s10462-024-10708-3 2024
[4]

AI in ESG for financial institutions: An industrial survey.arXiv preprint arXiv:2403.05541, 2024

Jun Xu. AI in ESG for financial institutions: An industrial survey.arXiv preprint arXiv:2403.05541, 2024. doi: 10.48550/arXiv.2403.05541

work page doi:10.48550/arxiv.2403.05541 2024
[5]

Responsible investing: The ESG-efficient frontier.Journal of Financial Economics, 142(2):572–597, 2021

Lasse Heje Pedersen, Shaun Fitzgibbons, and Lukasz Pomorski. Responsible investing: The ESG-efficient frontier.Journal of Financial Economics, 142(2):572–597, 2021

work page 2021
[6]

Stambaugh, and Lucian A

L ’uboš Pástor, Robert F. Stambaugh, and Lucian A. Taylor. Sustainable investing in equilibrium. Journal of Financial Economics, 142(2):550–571, 2021

work page 2021
[7]

Brian Jacobsen, Wai Lee, and Chi T. Ma. Factor-neutral sustainable investing.The Journal of Portfolio Management, 45(6):6–17, 2019

work page 2019
[8]

The wages of social responsibility — where are they? a critical review of ESG investing.Review of Financial Economics, 26:25–35, 2015

Gerhard Halbritter and Gregor Dorfleitner. The wages of social responsibility — where are they? a critical review of ESG investing.Review of Financial Economics, 26:25–35, 2015

work page 2015
[9]

Aggregate confusion: The divergence of ESG ratings.Review of Finance, 26(6):1315–1344, 2022

Florian Berg, Julian Kölbel, and Roberto Rigobon. Aggregate confusion: The divergence of ESG ratings.Review of Finance, 26(6):1315–1344, 2022

work page 2022
[10]

Christensen, George Serafeim, and Anywhere Sikochi

Dane M. Christensen, George Serafeim, and Anywhere Sikochi. Why is corporate virtue in the eye of the beholder? the case of ESG ratings.The Accounting Review, 97(1):147–175, 2022

work page 2022
[11]

Rewriting history II: The (un)predictable past of ESG ratings

Florian Berg, Kornelia Fabisik, and Zacharias Sautner. Rewriting history II: The (un)predictable past of ESG ratings. Technical Report 708/2020, ECGI Finance Working Paper, 2021

work page 2020
[12]

Kölbel, Anna Pavlova, and Roberto Rigobon

Florian Berg, Julian F. Kölbel, Anna Pavlova, and Roberto Rigobon. ESG confusion and stock returns: Tackling the problem of noise. Technical Report 30562, NBER Working Paper, 2022

work page 2022
[13]

Goldberg, and Pete Hand

Michael Branch, Lisa R. Goldberg, and Pete Hand. A guide to ESG portfolio construction.The Journal of Portfolio Management, 45(4):61–66, 2019

work page 2019
[14]

Integrating ESG in portfolio construction.The Journal of Portfolio Management, 45(4):67–81, 2019

Roy Henriksson, Joshua Livnat, Patrick Pfeifer, and Michael Stumpp. Integrating ESG in portfolio construction.The Journal of Portfolio Management, 45(4):67–81, 2019. doi: 10.3905/ jpm.2019.45.4.067

work page 2019
[15]

Eccles, and Andreas Feiner

Tim Verheyden, Robert G. Eccles, and Andreas Feiner. ESG for all? the impact of ESG screening on return, risk, and diversification.Journal of Applied Corporate Finance, 28(2): 47–55, 2016. doi: 10.1111/jacf.12174

work page doi:10.1111/jacf.12174 2016
[16]

On imposing ESG constraints of portfolio selection for sustainable investment and comparing the efficient frontiers in the weight space.SAGE Open, 10(4): 2158244020975070, 2020

Yue Qi and Xiaolin Li. On imposing ESG constraints of portfolio selection for sustainable investment and comparing the efficient frontiers in the weight space.SAGE Open, 10(4): 2158244020975070, 2020. doi: 10.1177/2158244020975070

work page doi:10.1177/2158244020975070 2020
[17]

Social responsibility portfo- lio optimization incorporating ESG criteria.Journal of Management Science and Engineering, 6(1):75–85, 2021

Li Chen, Lipei Zhang, Jun Huang, Helu Xiao, and Zhongbao Zhou. Social responsibility portfo- lio optimization incorporating ESG criteria.Journal of Management Science and Engineering, 6(1):75–85, 2021. doi: 10.1016/j.jmse.2021.02.005

work page doi:10.1016/j.jmse.2021.02.005 2021
[18]

Charl Maree and Christian W. Omlin. Balancing profit, risk, and sustainability for portfolio man- agement. In2022 IEEE Symposium on Computational Intelligence for Financial Engineering and Economics (CIFEr), pages 1–8, 2022. doi: 10.1109/CIFEr52523.2022.9776048

work page doi:10.1109/cifer52523.2022.9776048 2022
[19]

Deep reinforcement learning and mean-variance strategies for responsible portfolio optimization.arXiv preprint arXiv:2403.16667, 2024

Fernando Acero, Parisa Zehtabi, Nicolas Marchesotti, Michael Cashmore, Daniele Magazzeni, and Manuela Veloso. Deep reinforcement learning and mean-variance strategies for responsible portfolio optimization.arXiv preprint arXiv:2403.16667, 2024. doi: 10.48550/arXiv.2403. 16667

work page doi:10.48550/arxiv.2403 2024
[20]

Garrido-Merchán, Sol Mora-Figueroa, and María Coronado Vaca

Eduardo C. Garrido-Merchán, Sol Mora-Figueroa, and María Coronado Vaca. Multi-objective bayesian optimization of deep reinforcement learning for environmental, social, and gover- nance (ESG) financial portfolio management.Intelligent Systems in Accounting, Finance and Management, 32(2):e70008, 2025. doi: 10.1002/isaf.70008

work page doi:10.1002/isaf.70008 2025
[21]

Constrained policy optimization

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. InProceedings of the 34th International Conference on Machine Learning, pages 22–31, 2017

work page 2017
[22]

First-order constrained optimization in policy space

Yiming Zhang, Quan Vuong, Keith Ross, and Slobodan Petrovic. First-order constrained optimization in policy space. InAdvances in Neural Information Processing Systems, 2020. 10

work page 2020
[23]

CRPO: A new approach for safe reinforcement learning with convergence guarantee

Tengyu Xu, Yingbin Liang, and Guanghui Lan. CRPO: A new approach for safe reinforcement learning with convergence guarantee. InInternational Conference on Machine Learning, 2021

work page 2021
[24]

Is the ESG portfolio less turbulent than a market benchmark portfolio? Risk Management, 24(1):1–33, 2022

Abdessamad Ouchen. Is the ESG portfolio less turbulent than a market benchmark portfolio? Risk Management, 24(1):1–33, 2022. doi: 10.1057/s41283-021-00077-4

work page doi:10.1057/s41283-021-00077-4 2022
[25]

Chapman and Hall/CRC, Boca Raton, FL, 1999

Eitan Altman.Constrained Markov Decision Processes. Chapman and Hall/CRC, Boca Raton, FL, 1999

work page 1999
[26]

Mankowitz, and Shie Mannor

Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. Reward constrained policy optimization. InInternational Conference on Learning Representations, 2019

work page 2019
[27]

Responsive safety in reinforcement learning by PID lagrangian methods

Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by PID lagrangian methods. InProceedings of the 37th International Conference on Machine Learning, pages 9133–9143, 2020

work page 2020
[28]

Interior-point policy optimization under constraints

Yongshuai Liu, Jiaxin Ding, and Xin Liu. Interior-point policy optimization under constraints. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4940–4947, 2020

work page 2020
[29]

Penalized proximal policy optimization for safe reinforcement learning

Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Xueqian Wang, Bo Yuan, and Dacheng Tao. Penalized proximal policy optimization for safe reinforcement learning. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 3744–3750, 2022. doi: 10.24963/ijcai.2022/520

work page doi:10.24963/ijcai.2022/520 2022
[30]

Trust region policy optimization

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InProceedings of the 32nd International Conference on Machine Learning, pages 1889–1897, 2015

work page 2015
[31]

Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, and Peter J. Ramadge. Projection-based constrained policy optimization. InInternational Conference on Learning Representations, 2020

work page 2020
[32]

Embedding safety into RL: A new take on trust region methods

Nikola Milosevic, Johannes Müller, and Nico Scherf. Embedding safety into RL: A new take on trust region methods. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 44199–44224. PMLR, 2025. A Point-in-time ESG data construction pipeline We construct the structured MACF input...

work page 2025