arxiv: 2604.06826 · v1 · submitted 2026-04-08 · 💻 cs.CL · cs.AI

Recognition: unknown

Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models

Paula Dodig , Boshko Koloski , Katarina Sitar \v{S}u\v{s}tar , Senja Pollak , Matthew Purver

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords ESG sentiment analysisSlovene news datasetsentiment detectionlarge language modelsSloBERTaenvironmental social governancelow-resource language

0 comments

The pith

The first public Slovene ESG sentiment dataset lets models automatically label news for environmental, social and governance views on companies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates and releases the first dataset of Slovene news articles annotated for sentiment toward ESG topics. It uses LLM-assisted filtering of the MaCoCu news collection followed by human labeling of company-related content as positive, negative or neutral on each ESG dimension. Models are tested across categories, showing large language models reach the highest scores on environmental and social detection while a fine-tuned monolingual model leads on governance. A case study then applies the strongest model to track selected companies over multiple years. If accurate, this resource makes automated ESG monitoring possible in a low-resource language where manual ratings have been scarce.

Core claim

The authors present the first publicly available Slovene ESG sentiment dataset derived from news, constructed via LLM-assisted filtering and human annotation of company mentions. Benchmarking shows Gemma3-27B achieving 0.61 F1-macro on environmental classification, gpt-oss 20B reaching 0.45 F1-macro on social classification, and fine-tuned SloBERTa attaining 0.54 F1-macro on governance classification. The best-performing model is further demonstrated in a longitudinal case study of ESG aspects for selected companies.

What carries the argument

The Slovene ESG sentiment dataset, built by LLM-assisted selection of relevant news followed by human labeling of environmental, social and governance sentiment for companies.

If this is right

Automated detection of ESG sentiment becomes feasible for Slovene-language corporate news without full manual review.
Large language models can be used directly for environmental and social aspects while a fine-tuned SloBERTa model handles governance.
Long-term tracking of individual companies' ESG coverage in news is now practical using the released classifier.
The dataset supplies training data that can support further model development for other low-resource languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Investors or regulators working with smaller Central European firms could integrate the dataset to supplement existing ESG ratings.
Periodic updates to the dataset with fresh news would allow ongoing monitoring of how media sentiment on ESG topics evolves.
Similar LLM-plus-human pipelines might be tested for other languages or for finer-grained subtopics within each ESG category.

Load-bearing premise

The LLM-assisted filtering plus human annotation process produces accurate labels that represent true ESG sentiment in Slovene news without major selection bias or annotation mistakes.

What would settle it

An independent re-annotation of a random subset of the dataset by multiple Slovene-speaking experts that yields agreement below 65 percent with the published labels or a clear drop in model F1 scores when the models are tested on newly collected news from the same sources.

Figures

Figures reproduced from arXiv: 2604.06826 by Boshko Koloski, Katarina Sitar \v{S}u\v{s}tar, Matthew Purver, Paula Dodig, Senja Pollak.

**Figure 2.** Figure 2: Methodology pipeline ity tasks through contrastive learning on large-scale multilingual corpora. Paraphrase-Multilingual-MiniLM-L12-v22 : A distilled sentence transformer architecture based on the MiniLM framework (Wang et al., 2020), providing computationally efficient 384-dimensional embeddings. Gemma-Embed3 (google/embeddinggemma300m): A task-agnostic embedding model built on Google’s Gemma architectu… view at source ↗

**Figure 3.** Figure 3: Normalized sentiment scores from media mentions (2010 2025) governance-related events. The announcement of environmental remediation in 2017 led to a significant improvement in E and S sentiment, while the subsequent slow remediation process, coupled with a lawsuit by the European Commission over delays in closing the landfills and the question at the EU level regarding whether the raw material for core p… view at source ↗

read the original abstract

Environmental, Social, and Governance (ESG) considerations are increasingly integral to assessing corporate performance, reputation, and long-term sustainability. Yet, reliable ESG ratings remain limited for smaller companies and emerging markets. We introduce the first publicly available Slovene ESG sentiment dataset and a suite of models for automatic ESG sentiment detection. The dataset, derived from the MaCoCu Slovene news collection, combines large language model (LLM)-assisted filtering with human annotation of company-related ESG content. We evaluate the performance of monolingual (SloBERTa) and multilingual (XLM-R) models, embedding-based classifiers (TabPFN), hierarchical ensemble architectures, and large language models. Results show that LLMs achieve the strongest performance on Environmental (Gemma3-27B, F1-macro: 0.61) and Social aspects (gpt-oss 20B, F1-macro: 0.45), while fine-tuned SloBERTa is the best model on Governance classification (F1-macro: 0.54). We then show in a small case study how the best-preforming classifier (gpt-oss) can be applied to investigate ESG aspects for selected companies across a long time frame.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper releases the first Slovene ESG sentiment dataset, a real gap-filler for low-resource work, but the LLM-filtered annotation lacks any reported size, agreement, or validation numbers so the F1 scores are hard to interpret.

read the letter

The core contribution is straightforward: the first public dataset of Slovene news labeled for environmental, social, and governance sentiment, pulled from the MaCoCu collection via LLM-assisted filtering followed by human review. That fills an actual hole—no prior work cited has done this for Slovene—and the authors then benchmark a range of models including SloBERTa, XLM-R, TabPFN, hierarchical ensembles, and several LLMs. The results are unsurprising but informative: LLMs edge out on environmental (Gemma3-27B at 0.61 macro F1) and social (gpt-oss 20B at 0.45), while fine-tuned SloBERTa leads on governance (0.54). A short case study shows the best model applied to company timelines, which is a practical touch.

Referee Report

3 major / 2 minor

Summary. The paper introduces the first publicly available Slovene ESG sentiment dataset derived from the MaCoCu news collection via LLM-assisted filtering followed by human annotation. It benchmarks monolingual (SloBERTa), multilingual (XLM-R), embedding-based (TabPFN), hierarchical ensembles, and LLM-based models, reporting that LLMs achieve the highest F1-macro on Environmental (Gemma3-27B: 0.61) and Social (gpt-oss 20B: 0.45) aspects while fine-tuned SloBERTa leads on Governance (0.54); a case study applies the best model to track company-level ESG trends over time.

Significance. If the dataset labels prove reliable, the work fills a clear gap in low-resource language resources for ESG analysis and provides a practical demonstration of model application to longitudinal company data. The empirical benchmarking across model families and the public release of the dataset are strengths that could support follow-on research in multilingual sentiment tasks.

major comments (3)

[Dataset construction] Dataset construction section: the LLM-assisted filtering + human annotation pipeline is described at high level only, with no reported inter-annotator agreement, number of annotators, annotation guidelines, class distribution, or validation of the LLM filter against a held-out gold sample. Because the central claims rest on the accuracy of these labels (e.g., the F1-macro scores of 0.61/0.45/0.54), the absence of these metrics makes the performance differences between models difficult to interpret.
[Results] Results section (model comparison table): the reported F1-macro scores are given without accompanying baselines (random or majority-class), statistical significance tests between models, or per-class precision/recall breakdowns. This is especially relevant for the Social aspect (0.45), where performance is close to chance levels for a three-way classification task.
[Methods] Methods section on annotation: no discussion or quantitative check is provided for possible selection bias introduced by the LLM filtering step before human review. In a low-resource setting this is load-bearing for the claim that the released dataset is representative of Slovene news ESG content.

minor comments (2)

[Abstract] Abstract does not state dataset size, number of documents, or label distribution, which would help readers immediately assess the scale of the contribution.
[Case study] The case-study section would benefit from clearer description of the time window, number of companies examined, and any quantitative metrics beyond qualitative trends.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive and detailed feedback on our manuscript. We appreciate the referee's identification of areas where additional transparency and analysis would strengthen the work. We address each major comment below and will make the corresponding revisions to the manuscript.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: the LLM-assisted filtering + human annotation pipeline is described at high level only, with no reported inter-annotator agreement, number of annotators, annotation guidelines, class distribution, or validation of the LLM filter against a held-out gold sample. Because the central claims rest on the accuracy of these labels (e.g., the F1-macro scores of 0.61/0.45/0.54), the absence of these metrics makes the performance differences between models difficult to interpret.

Authors: We agree that these details are critical for evaluating label quality and interpreting model performance differences. In the revised manuscript we will expand the Dataset construction section to report the number of annotators, inter-annotator agreement, the annotation guidelines (as an appendix), class distributions for each aspect, and the results of a validation of the LLM filter on a held-out gold sample. revision: yes
Referee: [Results] Results section (model comparison table): the reported F1-macro scores are given without accompanying baselines (random or majority-class), statistical significance tests between models, or per-class precision/recall breakdowns. This is especially relevant for the Social aspect (0.45), where performance is close to chance levels for a three-way classification task.

Authors: We acknowledge the value of these additions for a rigorous comparison. In revision we will include random and majority-class baselines, report statistical significance tests between models, and add per-class precision, recall, and F1 breakdowns alongside the macro scores, with particular attention to contextualizing the Social aspect results. revision: yes
Referee: [Methods] Methods section on annotation: no discussion or quantitative check is provided for possible selection bias introduced by the LLM filtering step before human review. In a low-resource setting this is load-bearing for the claim that the released dataset is representative of Slovene news ESG content.

Authors: We agree that potential selection bias from the LLM filtering step requires explicit discussion and quantification. In the revised Methods section we will add a discussion of this issue together with a quantitative check, such as a comparison of ESG content distributions between the filtered set and a random sample from the full MaCoCu collection or reporting of human rejection rates on LLM-filtered items. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset creation and model benchmarking

full rationale

The paper introduces a new Slovene ESG sentiment dataset derived from MaCoCu via LLM-assisted filtering plus human annotation, then benchmarks monolingual, multilingual, embedding-based, ensemble, and LLM classifiers on the resulting labels, reporting F1-macro scores (e.g., Gemma3-27B at 0.61 for Environmental). No equations, derivations, fitted parameters, uniqueness theorems, or ansatzes are present. All claims are grounded in the described data-construction pipeline and experimental results rather than any self-referential reduction; the work is self-contained as standard empirical NLP dataset release and evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the quality of human-annotated labels after LLM filtering and on standard assumptions of supervised classification; no free parameters, invented entities, or non-standard axioms are introduced beyond typical NLP practices.

axioms (1)

domain assumption Human annotation after LLM filtering yields reliable ground-truth labels for ESG sentiment.
The dataset construction explicitly combines LLM-assisted filtering with human annotation as the basis for all reported results.

pith-pipeline@v0.9.0 · 5540 in / 1374 out tokens · 48003 ms · 2026-05-10T17:24:18.252069+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models

Introduction Environmental,Social,andGovernance(ESG)con- siderationshavebecomeessentialintheevaluation of corporate performance and investment potential (Chen et al., 2023). Increased awareness of cor- porate sustainability has led to the integration of ESG metrics into financial and public evaluations of businesses. Despite this momentum, a signifi- cant...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Related Work The increasing relevance of ESG topics has driven the development of computational methods for un- derstanding sustainability discourse in text. Prior research on ESG-related text analysis has focused oncompanyreportsandfinancialdisclosures,lever- aging supervised machine learning to assess senti- mentandtopicrelevance(Nassirtoussietal.,2015)...

2015
[3]

irrelevant

SloESG-News 1.0 dataset To create an appropriate dataset, we extracted arti- cles from the MaCoCu Slovenian dataset (Bañón et al., 2022), filtering for a curated list of Slovenian companies, defined by an expert in ESG focus- ing on companies where at least one of the three aspects E, S or G is strongly present. The data was preprocessed to extract a subs...

2022
[4]

Methodology for ESG modelling Our methods used to classify ESG-related senti- ment on the proposed dataset focus on two dif- ferent perspectives: adapting pre-trained machine learning models (such as BERT and TabPFN) and zero-shotqueryingofLLMs,rangingfromthemono- lingual Slovene model GaMS to the multilingual reasoning model GPT-OSS, as well as building ...

2020
[5]

is a meta-learned classifier that performs approximate Bayesian inference through in-context learning without gradient-based training. Given embeddingmatrixX ∈R n×d andlabelsy, TabPFN producesprobabilisticpredictions p(y∗|X, y, x∗)for test instancesx ∗ through a single forward pass, leveraging patterns learned from synthetic tabular datasets during meta-t...

2020
[6]

ProbabilityExtraction: Eachbasemodelpro- duces probability distributionsPa ∈R n×4 for aspecta∈ {E,S,G}
[7]

Logit Transformation: Convert probabilities to logits to handle extreme values and provide unbounded feature space: La = log(clip(P a, ϵ,1.0))(1) whereϵ= 10 −6 prevents numerical instability
[8]

Concatenation: For each base model, con- catenate aspect logits: Xbase = [LE ||LS ||LG]∈R n×12 (2) This transformation preserves relative probability magnitudes while providing a more stable feature space for meta-learning, avoiding the compression of probabilities near 0 or 1 that can occur in linear scaling. 4.5. Meta-Level Ensemble Architecture We prop...
[9]

Level 1 (Family-Specific Meta-Models): Eachbasefamily itrainsanindependentmeta- MLP: Zi =MLP fami (Xfami )∈R n×12 (4)
[10]

Level 2 (Cross-Family Aggregation): A sec- ond meta-MLP combines family-level outputs: ˆY=MLP final([Z1 ||Z 2 || · · · ||Z k])(5) This architecture allows each family to learn spe- cialized combination strategies (e.g., TabPFN fami- lies may benefit from uncertainty calibration while transformer families may require confidence rescal- ing) before global a...

2011
[11]

Embedding Models + TabPFN: Extract em- beddings from D80, optionally apply SVD di- mensionality reduction, fit TabPFN classifier, predict onD20 andD test
[12]

This nested val- idation promotes generalization and reduces over- fitting to base-model biases

Transformer Models: Fine-tune onD80 with earlystoppingbasedon D20performance,gen- erate final predictions onD20 and Dtest using best checkpoint Thisproducestwosetsofmeta-featuresperbase model: • X(20) meta ∈R |D20|×12: Meta-features for valida- tion samples • X(test) meta ∈R |Dtest|×12: Meta-featuresfortestsam- ples Stage 3: Meta-Model Training.Meta-model...
[13]

Final Tower

Results and Discussion The results presented in Tables 3–5 provide clear evidencethattransformer-basedarchitectures,sup- ported by ensemble and multi-task learning strate- gies, are well suited for ESG sentiment classifica- tion in Slovene news. The consistent performance gains achieved by the multi-task fusion models across all three ESG dimensions indic...
[14]

problematic

Case Study: Qualitative Temporal ESG Evaluation After evaluating the proposed models, we select the gpt-oss-20b model to analyze the sentiment distribution over time for four companies of interest byanalysingalargenewsmediamonitoringdataset for the period 2010-2025. The annual average sentiment score is computed by subtracting the count of negative sentim...

work page arXiv 2010
[15]

Conclusions and Further Work This work presents the first publicly available Slovene ESG dataset and uses it as a resource for training LLM-based, transformer-based classi- fication models and hierarchical ensembling. Be- yond technical performance, these findings have broader implications for sustainability analytics: au- tomated monitoring of ESG sentim...
[16]

Code Availability The source code is publicly available athttps: //github.com/bkolosk1/slo-news-esg
[17]

handle.net/11356/2102

Data Availability The dataset is publicly available athttp://hdl. handle.net/11356/2102. Acknowledgments This work was supported by the Slovenian Re- search and Innovation Agency (ARIS) through the projects EMMA (Embeddings-based Tech- niques for Media Monitoring Applications; L2- 50070), LargeLanguageModelsforDigitalHuman- ities (LLM4DH; GC-0002), and th...
[18]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Bibliographical References S. Angioni et al. 2024. Exploring environmental, social, and governance (esg) discourse in news: An ai-powered investigation through knowledge graph analysis.IEEE Access. Dogu Araci. 2019. FinBERT: Financial sentiment analysis with pre-trained language models. In Proceedings of the 57th Annual Meeting of the Association for Comp...

work page internal anchor Pith review arXiv 2024
[19]

InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)

RoBERTa: A robustly optimized bert pre- training approach. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. Srishti Mehra, Robert Louka, and Yixun Zhang

2019
[20]

Arman Khadjeh Nassirtoussi, Saeed Aghabozorgi, Teh Ying Wah, and David C.L

Esgbert: Language model to help with classification tasks related to companies environ- mental, social, and governance practices.arXiv preprint arXiv:2203.16788. Arman Khadjeh Nassirtoussi, Saeed Aghabozorgi, Teh Ying Wah, and David C.L. Ngo. 2015. Text miningformarketprediction: Asystematicreview. Expert Systems with Applications, 41(16):7653– 7670. Open...

work page arXiv 2015