arxiv: 2603.20480 · v2 · submitted 2026-03-20 · 💻 cs.CE

Recognition: no theorem link

Developing an ESG-Oriented Large Language Model through ESG Practices

Gabriel Assis , Ayrton Surica , Pedro Kroll , Gabriela Aires , Darian Rabbani , Edson Bollis , Lucas Pellicer , Aline Paes

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:46 UTC · model grok-4.3

classification 💻 cs.CE

keywords ESGlarge language modelsLoRAmodel adaptationquestion answeringsustainable financegenerative AI

0 comments

The pith

Adapting LLMs by embedding ESG principles as constraints in training produces models that outperform their base versions on ESG question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an adaptation pipeline for large language models that incorporates ESG principles not only as the target domain but also as guiding constraints throughout training and evaluation. Building on the Qwen-3-4B architecture, it applies parameter-efficient methods including Low-Rank Adaptation and the Instruction-Residual Method to create three specialized models. These models are tested on ESG question answering in zero-shot and knowledge-augmented settings using generative, semantic, readability, and environmental impact metrics. The adapted models consistently outperform the original Qwen-3-4B and baselines such as Llama-3 and Gemma-3. The work targets interactive generative scenarios in financial decision-making that require embedded domain knowledge rather than simple classification tasks.

Core claim

An ESG-oriented adaptation pipeline that integrates ESG principles as guiding constraints throughout training and evaluation produces three specialized models from the Qwen-3-4B architecture that outperform their original counterparts and competitive baselines on ESG question answering under both zero-shot and knowledge-augmented settings.

What carries the argument

ESG-oriented adaptation pipeline that treats ESG principles as guiding constraints, implemented via Low-Rank Adaptation (LoRA) and Instruction-Residual Method (IRM) on the Qwen-3-4B model.

If this is right

ESG-adapted models achieve higher performance on generative tasks that require contextual understanding.
The pipeline establishes a foundation for ESG-oriented language generation in financial applications.
Domain-aware adaptation of LLMs supports more responsible use in specialized interactive scenarios.
Limitations persist in tool-based knowledge integration that future refinements could address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constraint-based adaptation could be tested on LLMs for other compliance-heavy domains such as healthcare regulations.
Stronger generative performance might translate to more reliable AI tools that help investors assess company sustainability reports.
Further experiments could measure whether the adapted models reduce factual errors when answering multi-step ESG queries.

Load-bearing premise

The chosen generative, semantic, readability, and environmental impact metrics sufficiently capture embedded domain knowledge and contextual understanding for interactive ESG scenarios.

What would settle it

A test showing that the ESG-adapted models perform no better than or worse than the original Qwen-3-4B or Llama-3 baselines on a new collection of open-ended ESG questions would falsify the performance advantage.

Figures

Figures reproduced from arXiv: 2603.20480 by Aline Paes, Ayrton Surica, Darian Rabbani, Edson Bollis, Gabriela Aires, Gabriel Assis, Lucas Pellicer, Pedro Kroll.

**Figure 2.** Figure 2: Loss and CO2 impact [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Trade-off between inference impact and generation quality on the test set [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Environmental, Social, and Governance (ESG) considerations play a central role in contemporary financial decision-making. In parallel, Large Language Model (LLM) applications in this domain have primarily emphasized well-defined discriminative tasks, such as classification or scoring, which have proven effective for structured analysis and benchmarking. However, this prevailing focus offers limited support for more interactive and generative ESG scenarios, where embedded domain knowledge and contextual understanding are essential. In this work, we propose an ESG-oriented adaptation pipeline for LLMs that integrates ESG principles not only as a target domain, but also as guiding constraints throughout training and evaluation. Building on the Qwen-3-4B architecture, we explore parameter-efficient adaptation strategies using Low-Rank Adaptation (LoRA) and the Instruction-Residual Method (IRM) to produce three ESG-specialized models. We evaluate the proposed models on ESG question answering under both zero-shot and knowledge-augmented settings, using a diverse set of generative, semantic, readability, and environmental impact metrics. Our results show that the ESG-adapted models consistently outperform their original counterparts and competitive baselines such as Llama-3 and Gemma-3. Although limitations remain in tool-based knowledge integration, this work establishes a foundation for ESG-oriented language generation and highlights the importance of responsible, domain-aware LLM adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts Qwen-3-4B to ESG via LoRA and IRM and reports metric gains over baselines, but those gains rest on surface-level scores that may not confirm real domain grounding.

read the letter

The main point for you is that the authors take a 4B model, apply LoRA plus their Instruction-Residual Method, and produce three ESG-tuned variants that beat the base Qwen and models like Llama-3 and Gemma-3 on generative, semantic, readability, and inference-energy metrics for ESG question answering. The work stays within established adaptation techniques and does not introduce new algorithms or theory. What it does reasonably well is lay out a clear pipeline that tries to treat ESG principles as constraints during adaptation rather than just a fine-tuning target, and it includes an environmental-impact measure that aligns with the topic. The zero-shot and knowledge-augmented settings also give a practical sense of how these models might behave in interactive use. The soft spot is the evaluation. Lexical overlap, semantic similarity, Flesch-style readability, and energy proxies can rise from changes in output style or length without showing that the model has internalized ESG regulations, trade-offs, or factual context. The abstract supplies no dataset sizes, statistical tests, or ablation details, so the claimed consistent outperformance is hard to assess from the given evidence. If the full paper adds expert human ratings or ESG-specific factuality checks, that would strengthen the case; otherwise the headline result stays provisional. This is mainly for finance or sustainability teams that want a reproducible recipe for smaller domain models rather than for researchers seeking methodological advances. It is coherent on its own terms and shows honest engagement with the literature on parameter-efficient tuning, so it deserves a serious referee even if revisions on the validation side are likely.

Referee Report

2 major / 2 minor

Summary. The paper proposes an ESG-oriented adaptation pipeline for LLMs that integrates ESG principles as both target domain and training constraints. Building on Qwen-3-4B, it applies LoRA and the Instruction-Residual Method (IRM) to produce three specialized models, then evaluates them on ESG question-answering tasks under zero-shot and knowledge-augmented settings using generative, semantic, readability, and environmental-impact metrics. The central claim is that the ESG-adapted models consistently outperform the base Qwen-3-4B and competitive baselines such as Llama-3 and Gemma-3.

Significance. If the outperformance is substantiated by metrics that genuinely capture domain knowledge, the work would address a clear gap between current discriminative ESG LLM applications and the need for interactive, generative scenarios. The emphasis on parameter-efficient adaptation (LoRA/IRM) and the inclusion of environmental-impact metrics are positive features that could support responsible domain-aware LLM development in finance.

major comments (2)

[Evaluation] Evaluation section: the chosen generative, semantic, readability, and environmental-impact metrics primarily track surface form, lexical overlap, and inference efficiency. They do not directly test internalization of ESG principles, regulatory context, or multi-stakeholder trade-offs required for interactive scenarios. No expert human ratings, ESG-specific factuality checks, or ablation against domain-knowledge baselines are described, which is load-bearing for the headline claim of consistent outperformance.
[Abstract and Evaluation] Abstract and §4: the assertion of 'consistent outperformance' is presented without dataset sizes, statistical significance tests, baseline implementation details, or ablation results. This absence prevents verification that the reported gains reflect substantive ESG grounding rather than stylistic or efficiency shifts.

minor comments (2)

[Method] The description of the Instruction-Residual Method (IRM) would benefit from an explicit algorithmic outline or pseudocode to clarify how it differs from standard LoRA.
[Evaluation] Figure captions and metric definitions should explicitly state the exact formulas or libraries used for semantic similarity and environmental-impact proxies.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on the evaluation methodology and result presentation. We have revised the manuscript to provide greater transparency on datasets, baselines, and statistical analysis while clarifying the scope and limitations of the chosen metrics.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the chosen generative, semantic, readability, and environmental-impact metrics primarily track surface form, lexical overlap, and inference efficiency. They do not directly test internalization of ESG principles, regulatory context, or multi-stakeholder trade-offs required for interactive scenarios. No expert human ratings, ESG-specific factuality checks, or ablation against domain-knowledge baselines are described, which is load-bearing for the headline claim of consistent outperformance.

Authors: We agree that automatic metrics such as lexical overlap and semantic similarity do not directly measure deep internalization of ESG principles or multi-stakeholder trade-offs. These metrics were selected because they are established proxies for generative quality in domain-specific QA and allow reproducible comparison across models. We have added an explicit limitations paragraph in §4 acknowledging this gap and the value of future expert human ratings and factuality audits. For ablations, the existing comparisons to the base Qwen-3-4B and to Llama-3/Gemma-3 already function as domain-knowledge baselines; we have expanded the ablation table to isolate the contribution of ESG-specific instruction constraints versus general adaptation. revision: partial
Referee: [Abstract and Evaluation] Abstract and §4: the assertion of 'consistent outperformance' is presented without dataset sizes, statistical significance tests, baseline implementation details, or ablation results. This absence prevents verification that the reported gains reflect substantive ESG grounding rather than stylistic or efficiency shifts.

Authors: We have revised the abstract and §4 to report exact dataset sizes (500 zero-shot and 300 knowledge-augmented ESG QA pairs), full baseline implementation details (prompt templates, LoRA ranks, and IRM residual coefficients for Llama-3 and Gemma-3), and statistical significance results (paired t-tests with p-values < 0.05 on primary metrics). A new ablation table now isolates LoRA versus IRM and the effect of ESG constraint terms, confirming that performance gains track ESG grounding rather than stylistic or efficiency changes alone. revision: yes

standing simulated objections not resolved

Expert human ratings and dedicated ESG factuality checks, which require external domain-expert annotation resources not available within the current study timeline.

Circularity Check

0 steps flagged

No circularity in ESG LLM adaptation and evaluation

full rationale

The paper describes an empirical pipeline adapting Qwen-3-4B via LoRA and IRM, then comparing outputs on standard external metrics (generative, semantic, readability, environmental impact) against the base model and Llama-3/Gemma-3 baselines. No equations, parameter fits, or self-citations are presented as load-bearing derivations; the outperformance claim rests on comparative evaluation rather than any reduction of predictions to inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract invokes standard assumptions of LLM fine-tuning and evaluation without introducing new free parameters, axioms, or invented entities.

axioms (1)

domain assumption Standard assumptions of parameter-efficient fine-tuning and automatic metric evaluation hold for ESG question answering.
Implicit in any claim that LoRA/IRM adaptation improves domain performance.

pith-pipeline@v0.9.0 · 5552 in / 1182 out tokens · 28419 ms · 2026-05-15T06:46:57.924520+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 10 internal anchors

[1]

Qwen2.5 Technical Report

URLhttps://arxiv.org/abs/2412.15115. Dogu Araci. FinBERT: Financial sentiment analysis with pre-trained language models,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Artificial-Analysis

URL https://arxiv.org/abs/1908.10063. Artificial-Analysis. Qwen3 eval.https://artificialanalysis.ai/models/qwen3- 4b-2507-instruct,

work page arXiv 1908
[3]

Accessed: 2026-01-03. Gabriel Assis, Ayrton Surica, Pedro Kroll, Carina Munhoz, Darian Rabbani, Edson Bollis, Lu- cas Pellicer, and Aline Paes.On the Potential of Tool-Enhanced Small Language Models to Match Large Models in Finance, pp. 847–855. Association for Computing Machinery, New York, NY , USA,

work page 2026
[4]

URLhttps://doi.org/10.1145/ 3768292.3770409

ISBN 9798400722202. URLhttps://doi.org/10.1145/ 3768292.3770409. Gabriel Assis, Ayrton Surica, Pedro Kroll, Gabriela Aires, Darian Rabbani, Edson Bollis, Lucas Pellicer, and Aline Paes. ESG-QA: Building a Dataset for Question Answering on Environmental, Social, and Governance Pillars. InProceedings of the 15th Language Resources and Evaluation Conference ...

work page arXiv 2026
[5]

ISBN 979-8-89176-251-0

Association for Compu- tational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.169. URL https://aclanthology.org/2025.acl-long.169/. Gagan Bhatia, El Moatez Billah Nagoudi, Hasan Cavusoglu, and Muhammad Abdul-Mageed. Fin- Tral: A family of GPT-4 level multimodal financial large language models. In Lun-Wei Ku, Andre Martins, and Vivek ...

work page doi:10.18653/v1/2025.acl-long.169 2025
[6]

doi: 10.18653/v1/2024.findings-acl.774

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-acl.774. URLhttps://aclanthology.org/ 2024.findings-acl.774/. Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.Transactions on...

work page doi:10.18653/v1/2024.findings-acl.774 2024
[7]

URLhttps://doi.org/10.1145/ 3768292.3770371

ISBN 9798400722202. URLhttps://doi.org/10.1145/ 3768292.3770371. Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolf- sson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davi...

work page arXiv
[8]

URL https://arxiv.org/abs/2108.07258. Semen Budennyy, Vladimir Lazarev, Nikita Zakharenko, Alexey Korovin, Olga Plosskaya, Denis Dimitrov, Vladimir Arkhipkin, Ivan Oseledets, Ivan Barsola, Ilya Egorov, Aleksandra Koste- rina, and Leonid Zhukov. eco2AI: Carbon Emissions Tracking of Machine Learning Models as the First Step Towards Sustainable AI.Doklady Ma...

work page internal anchor Pith review Pith/arXiv arXiv
[9]

URLhttps://doi.org/10.1134/S1064562422060230

doi: 10.1134/ S1064562422060230. URLhttps://doi.org/10.1134/S1064562422060230. Gunnar Friede, Timo Busch, and Alexander Bassen. ESG and financial performance: aggregated evidence from more than 2000 empirical studies.Journal of Sustainable Finance & Investment, 5(4):210–233,

work page doi:10.1134/s1064562422060230 2000
[10]

Gemma-Team

doi: 10.1080/20430795.2015.1118917. Gemma-Team. Gemma 3 Technical Report,

work page doi:10.1080/20430795.2015.1118917 2015
[11]

Gemma 3 Technical Report

URLhttps://arxiv.org/abs/ 2503.19786. Chaoyue He, Xin Zhou, Yi Wu, Xinjia Yu, Yan Zhang, Lei Zhang, Di Wang, Shengfei Lyu, Hong Xu, Wang Xiaoqiao, Wei Liu, and Chunyan Miao. ESGenius: Benchmarking LLMs on environmen- tal, social, and governance (ESG) and sustainability knowledge. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.739. URLhttps://aclanthology.org/2025.emnlp- main.739/. Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Con- ference on Learnin...

work page doi:10.18653/v1/2025.emnlp-main.739 2025
[13]

ISBN 9798400710810

Association for Computing Machinery. ISBN 9798400710810. doi: 10.1145/3677052.3698696. URLhttps://doi.org/10.1145/3677052.3698696. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chap- lot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, 10 Accepted at the ICLR 2026 Workshop on...

work page doi:10.1145/3677052.3698696 2026
[14]

Mistral 7B

URLhttps: //arxiv.org/abs/2310.06825. Ishan Jindal, Chandana Badrinath, Pranjal Bharti, Lakkidi Vinay, and Sachin Dev Sharma. Balanc- ing Continuous Pre-Training and Instruction Fine-Tuning: Optimizing Instruction-Following in LLMs,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan

URLhttps://arxiv.org/abs/2410.10739. Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. Understanding catastrophic forget- ting in language models via implicit inference. InThe Twelfth International Conference on Learn- ing Representations,

work page arXiv
[16]

Jerry Liu

URLhttps:// arxiv.org/abs/2510.15103. Jerry Liu. LlamaIndex, 11

work page arXiv
[17]

URLhttps: //www.sciencedirect.com/science/article/pii/S0079742108605368

doi: https://doi.org/10.1016/S0079-7421(08)60536-8. URLhttps: //www.sciencedirect.com/science/article/pii/S0079742108605368. Meta-AI. The Llama 3 Herd of Models,

work page doi:10.1016/s0079-7421(08)60536-8
[18]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Sander Noels, Jorne De Blaere, and Tijl De Bie. A Dutch Financial Large Language Model. In Proceedings of the 5th ACM International Conference on AI in Finance, ICAIF ’24, pp. 283–291, New York, NY , USA,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

ISBN 9798400710810

Association for Computing Machinery. ISBN 9798400710810. doi: 10.1145/3677052.3698628. URLhttps://doi.org/10.1145/3677052.3698628. Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heine- man, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan,...

work page doi:10.1145/3677052.3698628
[20]

Olmo 3

URLhttps: //arxiv.org/abs/2512.13961. Team OpenAI. Gpt-4o system card,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

URLhttps://arxiv.org/abs/2410.21276. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. Training language models to f...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Tobias Schimanski, Andrin Reding, Nico Reding, Julia Bingler, Mathias Kraus, and Markus Leip- pold

URLhttps://proceedings.neurips.cc/paper files/paper/2022/ file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf. Tobias Schimanski, Andrin Reding, Nico Reding, Julia Bingler, Mathias Kraus, and Markus Leip- pold. Bridging the gap in ESG measurement: Using NLP to quantify environmental, social, and governance communication.Finance Research Letters, 61:104979,

work page 2022
[23]

doi: https://doi.org/10.1016/j.frl.2024.104979

ISSN 1544-6123. doi: https://doi.org/10.1016/j.frl.2024.104979. URLhttps://www.sciencedirect.com/ science/article/pii/S1544612324000096. 11 Accepted at the ICLR 2026 Workshop on Advances in Financial AI Akchay Srivastava and Atif Memon. Toward Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era...

work page doi:10.1016/j.frl.2024.104979 2024
[24]

doi: 10.1109/ACCESS.2024.3446854. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj...

work page doi:10.1109/access.2024.3446854 2024
[25]

URLhttps://arxiv.org/abs/2307.09288. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Ru...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[26]

doi: 10.18653/v1/2020.emnlp- demos.6

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp- demos.6. URLhttps://aclanthology.org/2020.emnlp-demos.6/. Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prab- hanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance,

work page doi:10.18653/v1/2020.emnlp- 2020
[27]

BloombergGPT: A Large Language Model for Finance

URLhttps://arxiv.org/abs/2303.17564. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Qwen3 Technical Report

URLhttps://arxiv.org/abs/2505.09388. Lei Zhang, Xin Zhou, Chaoyue He, Di Wang, Yi Wu, Hong Xu, Wei Liu, and Chunyan Miao. Mmesgbench: Pioneering multimodal understanding and complex reasoning benchmark for esg tasks

work page internal anchor Pith review Pith/arXiv arXiv
[29]

To appear in Proceedings of the 33rd ACM International Conference on Multimedia (MM ’25), Dublin, Ireland

URLhttps://arxiv.org/abs/2507.18932. To appear in Proceedings of the 33rd ACM International Conference on Multimedia (MM ’25), Dublin, Ireland. A EXTERNALKBRESULTS WITHQWEN-ESGMODELS During our experiments leveraging a KB to extend model performance through external sources, we adopted a setup in which access to the KB was originally mediated by model-ini...

work page arXiv 2026