Recognition: unknown
Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models
Pith reviewed 2026-05-10 17:24 UTC · model grok-4.3
The pith
The first public Slovene ESG sentiment dataset lets models automatically label news for environmental, social and governance views on companies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present the first publicly available Slovene ESG sentiment dataset derived from news, constructed via LLM-assisted filtering and human annotation of company mentions. Benchmarking shows Gemma3-27B achieving 0.61 F1-macro on environmental classification, gpt-oss 20B reaching 0.45 F1-macro on social classification, and fine-tuned SloBERTa attaining 0.54 F1-macro on governance classification. The best-performing model is further demonstrated in a longitudinal case study of ESG aspects for selected companies.
What carries the argument
The Slovene ESG sentiment dataset, built by LLM-assisted selection of relevant news followed by human labeling of environmental, social and governance sentiment for companies.
If this is right
- Automated detection of ESG sentiment becomes feasible for Slovene-language corporate news without full manual review.
- Large language models can be used directly for environmental and social aspects while a fine-tuned SloBERTa model handles governance.
- Long-term tracking of individual companies' ESG coverage in news is now practical using the released classifier.
- The dataset supplies training data that can support further model development for other low-resource languages.
Where Pith is reading between the lines
- Investors or regulators working with smaller Central European firms could integrate the dataset to supplement existing ESG ratings.
- Periodic updates to the dataset with fresh news would allow ongoing monitoring of how media sentiment on ESG topics evolves.
- Similar LLM-plus-human pipelines might be tested for other languages or for finer-grained subtopics within each ESG category.
Load-bearing premise
The LLM-assisted filtering plus human annotation process produces accurate labels that represent true ESG sentiment in Slovene news without major selection bias or annotation mistakes.
What would settle it
An independent re-annotation of a random subset of the dataset by multiple Slovene-speaking experts that yields agreement below 65 percent with the published labels or a clear drop in model F1 scores when the models are tested on newly collected news from the same sources.
Figures
read the original abstract
Environmental, Social, and Governance (ESG) considerations are increasingly integral to assessing corporate performance, reputation, and long-term sustainability. Yet, reliable ESG ratings remain limited for smaller companies and emerging markets. We introduce the first publicly available Slovene ESG sentiment dataset and a suite of models for automatic ESG sentiment detection. The dataset, derived from the MaCoCu Slovene news collection, combines large language model (LLM)-assisted filtering with human annotation of company-related ESG content. We evaluate the performance of monolingual (SloBERTa) and multilingual (XLM-R) models, embedding-based classifiers (TabPFN), hierarchical ensemble architectures, and large language models. Results show that LLMs achieve the strongest performance on Environmental (Gemma3-27B, F1-macro: 0.61) and Social aspects (gpt-oss 20B, F1-macro: 0.45), while fine-tuned SloBERTa is the best model on Governance classification (F1-macro: 0.54). We then show in a small case study how the best-preforming classifier (gpt-oss) can be applied to investigate ESG aspects for selected companies across a long time frame.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the first publicly available Slovene ESG sentiment dataset derived from the MaCoCu news collection via LLM-assisted filtering followed by human annotation. It benchmarks monolingual (SloBERTa), multilingual (XLM-R), embedding-based (TabPFN), hierarchical ensembles, and LLM-based models, reporting that LLMs achieve the highest F1-macro on Environmental (Gemma3-27B: 0.61) and Social (gpt-oss 20B: 0.45) aspects while fine-tuned SloBERTa leads on Governance (0.54); a case study applies the best model to track company-level ESG trends over time.
Significance. If the dataset labels prove reliable, the work fills a clear gap in low-resource language resources for ESG analysis and provides a practical demonstration of model application to longitudinal company data. The empirical benchmarking across model families and the public release of the dataset are strengths that could support follow-on research in multilingual sentiment tasks.
major comments (3)
- [Dataset construction] Dataset construction section: the LLM-assisted filtering + human annotation pipeline is described at high level only, with no reported inter-annotator agreement, number of annotators, annotation guidelines, class distribution, or validation of the LLM filter against a held-out gold sample. Because the central claims rest on the accuracy of these labels (e.g., the F1-macro scores of 0.61/0.45/0.54), the absence of these metrics makes the performance differences between models difficult to interpret.
- [Results] Results section (model comparison table): the reported F1-macro scores are given without accompanying baselines (random or majority-class), statistical significance tests between models, or per-class precision/recall breakdowns. This is especially relevant for the Social aspect (0.45), where performance is close to chance levels for a three-way classification task.
- [Methods] Methods section on annotation: no discussion or quantitative check is provided for possible selection bias introduced by the LLM filtering step before human review. In a low-resource setting this is load-bearing for the claim that the released dataset is representative of Slovene news ESG content.
minor comments (2)
- [Abstract] Abstract does not state dataset size, number of documents, or label distribution, which would help readers immediately assess the scale of the contribution.
- [Case study] The case-study section would benefit from clearer description of the time window, number of companies examined, and any quantitative metrics beyond qualitative trends.
Simulated Author's Rebuttal
Thank you for the constructive and detailed feedback on our manuscript. We appreciate the referee's identification of areas where additional transparency and analysis would strengthen the work. We address each major comment below and will make the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: [Dataset construction] Dataset construction section: the LLM-assisted filtering + human annotation pipeline is described at high level only, with no reported inter-annotator agreement, number of annotators, annotation guidelines, class distribution, or validation of the LLM filter against a held-out gold sample. Because the central claims rest on the accuracy of these labels (e.g., the F1-macro scores of 0.61/0.45/0.54), the absence of these metrics makes the performance differences between models difficult to interpret.
Authors: We agree that these details are critical for evaluating label quality and interpreting model performance differences. In the revised manuscript we will expand the Dataset construction section to report the number of annotators, inter-annotator agreement, the annotation guidelines (as an appendix), class distributions for each aspect, and the results of a validation of the LLM filter on a held-out gold sample. revision: yes
-
Referee: [Results] Results section (model comparison table): the reported F1-macro scores are given without accompanying baselines (random or majority-class), statistical significance tests between models, or per-class precision/recall breakdowns. This is especially relevant for the Social aspect (0.45), where performance is close to chance levels for a three-way classification task.
Authors: We acknowledge the value of these additions for a rigorous comparison. In revision we will include random and majority-class baselines, report statistical significance tests between models, and add per-class precision, recall, and F1 breakdowns alongside the macro scores, with particular attention to contextualizing the Social aspect results. revision: yes
-
Referee: [Methods] Methods section on annotation: no discussion or quantitative check is provided for possible selection bias introduced by the LLM filtering step before human review. In a low-resource setting this is load-bearing for the claim that the released dataset is representative of Slovene news ESG content.
Authors: We agree that potential selection bias from the LLM filtering step requires explicit discussion and quantification. In the revised Methods section we will add a discussion of this issue together with a quantitative check, such as a comparison of ESG content distributions between the filtered set and a random sample from the full MaCoCu collection or reporting of human rejection rates on LLM-filtered items. revision: yes
Circularity Check
No circularity: purely empirical dataset creation and model benchmarking
full rationale
The paper introduces a new Slovene ESG sentiment dataset derived from MaCoCu via LLM-assisted filtering plus human annotation, then benchmarks monolingual, multilingual, embedding-based, ensemble, and LLM classifiers on the resulting labels, reporting F1-macro scores (e.g., Gemma3-27B at 0.61 for Environmental). No equations, derivations, fitted parameters, uniqueness theorems, or ansatzes are present. All claims are grounded in the described data-construction pipeline and experimental results rather than any self-referential reduction; the work is self-contained as standard empirical NLP dataset release and evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human annotation after LLM filtering yields reliable ground-truth labels for ESG sentiment.
Reference graph
Works this paper leans on
-
[1]
Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models
Introduction Environmental,Social,andGovernance(ESG)con- siderationshavebecomeessentialintheevaluation of corporate performance and investment potential (Chen et al., 2023). Increased awareness of cor- porate sustainability has led to the integration of ESG metrics into financial and public evaluations of businesses. Despite this momentum, a signifi- cant...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Related Work The increasing relevance of ESG topics has driven the development of computational methods for un- derstanding sustainability discourse in text. Prior research on ESG-related text analysis has focused oncompanyreportsandfinancialdisclosures,lever- aging supervised machine learning to assess senti- mentandtopicrelevance(Nassirtoussietal.,2015)...
2015
-
[3]
irrelevant
SloESG-News 1.0 dataset To create an appropriate dataset, we extracted arti- cles from the MaCoCu Slovenian dataset (Bañón et al., 2022), filtering for a curated list of Slovenian companies, defined by an expert in ESG focus- ing on companies where at least one of the three aspects E, S or G is strongly present. The data was preprocessed to extract a subs...
2022
-
[4]
Methodology for ESG modelling Our methods used to classify ESG-related senti- ment on the proposed dataset focus on two dif- ferent perspectives: adapting pre-trained machine learning models (such as BERT and TabPFN) and zero-shotqueryingofLLMs,rangingfromthemono- lingual Slovene model GaMS to the multilingual reasoning model GPT-OSS, as well as building ...
2020
-
[5]
is a meta-learned classifier that performs approximate Bayesian inference through in-context learning without gradient-based training. Given embeddingmatrixX ∈R n×d andlabelsy, TabPFN producesprobabilisticpredictions p(y∗|X, y, x∗)for test instancesx ∗ through a single forward pass, leveraging patterns learned from synthetic tabular datasets during meta-t...
2020
-
[6]
ProbabilityExtraction: Eachbasemodelpro- duces probability distributionsPa ∈R n×4 for aspecta∈ {E,S,G}
-
[7]
Logit Transformation: Convert probabilities to logits to handle extreme values and provide unbounded feature space: La = log(clip(P a, ϵ,1.0))(1) whereϵ= 10 −6 prevents numerical instability
-
[8]
Concatenation: For each base model, con- catenate aspect logits: Xbase = [LE ||LS ||LG]∈R n×12 (2) This transformation preserves relative probability magnitudes while providing a more stable feature space for meta-learning, avoiding the compression of probabilities near 0 or 1 that can occur in linear scaling. 4.5. Meta-Level Ensemble Architecture We prop...
-
[9]
Level 1 (Family-Specific Meta-Models): Eachbasefamily itrainsanindependentmeta- MLP: Zi =MLP fami (Xfami )∈R n×12 (4)
-
[10]
Level 2 (Cross-Family Aggregation): A sec- ond meta-MLP combines family-level outputs: ˆY=MLP final([Z1 ||Z 2 || · · · ||Z k])(5) This architecture allows each family to learn spe- cialized combination strategies (e.g., TabPFN fami- lies may benefit from uncertainty calibration while transformer families may require confidence rescal- ing) before global a...
2011
-
[11]
Embedding Models + TabPFN: Extract em- beddings from D80, optionally apply SVD di- mensionality reduction, fit TabPFN classifier, predict onD20 andD test
-
[12]
This nested val- idation promotes generalization and reduces over- fitting to base-model biases
Transformer Models: Fine-tune onD80 with earlystoppingbasedon D20performance,gen- erate final predictions onD20 and Dtest using best checkpoint Thisproducestwosetsofmeta-featuresperbase model: • X(20) meta ∈R |D20|×12: Meta-features for valida- tion samples • X(test) meta ∈R |Dtest|×12: Meta-featuresfortestsam- ples Stage 3: Meta-Model Training.Meta-model...
-
[13]
Final Tower
Results and Discussion The results presented in Tables 3–5 provide clear evidencethattransformer-basedarchitectures,sup- ported by ensemble and multi-task learning strate- gies, are well suited for ESG sentiment classifica- tion in Slovene news. The consistent performance gains achieved by the multi-task fusion models across all three ESG dimensions indic...
-
[14]
Case Study: Qualitative Temporal ESG Evaluation After evaluating the proposed models, we select the gpt-oss-20b model to analyze the sentiment distribution over time for four companies of interest byanalysingalargenewsmediamonitoringdataset for the period 2010-2025. The annual average sentiment score is computed by subtracting the count of negative sentim...
-
[15]
Conclusions and Further Work This work presents the first publicly available Slovene ESG dataset and uses it as a resource for training LLM-based, transformer-based classi- fication models and hierarchical ensembling. Be- yond technical performance, these findings have broader implications for sustainability analytics: au- tomated monitoring of ESG sentim...
-
[16]
Code Availability The source code is publicly available athttps: //github.com/bkolosk1/slo-news-esg
-
[17]
handle.net/11356/2102
Data Availability The dataset is publicly available athttp://hdl. handle.net/11356/2102. Acknowledgments This work was supported by the Slovenian Re- search and Innovation Agency (ARIS) through the projects EMMA (Embeddings-based Tech- niques for Media Monitoring Applications; L2- 50070), LargeLanguageModelsforDigitalHuman- ities (LLM4DH; GC-0002), and th...
-
[18]
TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second
Bibliographical References S. Angioni et al. 2024. Exploring environmental, social, and governance (esg) discourse in news: An ai-powered investigation through knowledge graph analysis.IEEE Access. Dogu Araci. 2019. FinBERT: Financial sentiment analysis with pre-trained language models. In Proceedings of the 57th Annual Meeting of the Association for Comp...
work page internal anchor Pith review arXiv 2024
-
[19]
InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP)
RoBERTa: A robustly optimized bert pre- training approach. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. Srishti Mehra, Robert Louka, and Yixun Zhang
2019
-
[20]
Arman Khadjeh Nassirtoussi, Saeed Aghabozorgi, Teh Ying Wah, and David C.L
Esgbert: Language model to help with classification tasks related to companies environ- mental, social, and governance practices.arXiv preprint arXiv:2203.16788. Arman Khadjeh Nassirtoussi, Saeed Aghabozorgi, Teh Ying Wah, and David C.L. Ngo. 2015. Text miningformarketprediction: Asystematicreview. Expert Systems with Applications, 41(16):7653– 7670. Open...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.