LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets
Pith reviewed 2026-05-20 05:24 UTC · model grok-4.3
pith:PAYYMIOV Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{PAYYMIOV}
Prints a linked pith:PAYYMIOV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A multi-stage pipeline builds an 84,000-sample Arabic financial sentiment dataset supporting company-level analysis on the Saudi Exchange.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By integrating official financial news and social media through a multi-stage pipeline of data collection, cleaning, deduplication, entity linking with transformer-based NER plus a curated company lexicon, and five-class sentiment annotation, the authors construct a dataset of 84K samples that supports company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange, with experiments demonstrating reliable and scalable Arabic financial sentiment analysis.
What carries the argument
The multi-stage pipeline for Arabic financial corpus construction, with transformer-based NER for entity linking to canonical company identifiers combined with five-class sentiment labeling.
If this is right
- Sentiment aggregation becomes possible at the level of individual companies listed on the Saudi Exchange.
- Sentiment dynamics can be tracked over time in relation to actual stock market movements.
- The framework provides a scalable method for financial sentiment analysis in Arabic without relying solely on English resources.
- Both institutional investor sentiment from news and public sentiment from social media can be captured and compared.
Where Pith is reading between the lines
- Such a dataset could enable the development of Arabic-specific predictive models for stock price movements based on sentiment signals.
- Similar pipelines could be applied to other Arabic financial markets to build comparable resources.
- The work underscores the importance of language-specific entity linking and annotation for accurate sentiment in financial texts.
Load-bearing premise
The multi-stage pipeline including automated entity linking and sentiment annotation produces labels that truly represent investor sentiment in Arabic financial texts.
What would settle it
If a random sample of the dataset is manually labeled by Arabic-speaking financial experts and shows substantial disagreement with the automated five-class labels, that would undermine the claim of reliable analysis.
Figures
read the original abstract
Investor sentiment shapes financial markets, yet modeling sentiment in Arabic financial contexts remains challenging due to linguistic complexity and limited resources. We present an Arabic NLP framework for large-scale financial sentiment analysis tailored to the Saudi market, integrating official financial news and social media to capture institutional and public investor sentiment. The framework constructs a large Arabic financial corpus through a multi-stage pipeline encompassing data collection, cleaning, deduplication, entity linking, and sentiment annotation. Transformer-based NER combined with a curated company lexicon links textual mentions to canonical company identifiers, with sentiment labels assigned using a five-class scheme. The resulting dataset of 84K samples supports company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange. Experimental results demonstrate reliable and scalable Arabic financial sentiment analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an Arabic NLP framework for large-scale financial sentiment analysis tailored to Saudi markets. It describes a multi-stage pipeline for constructing an 84K-sample dataset from official financial news and social media, using transformer-based NER combined with a company lexicon for entity linking and a five-class scheme for sentiment annotation. The resulting dataset is positioned to enable company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange, with experimental results claimed to demonstrate reliable and scalable Arabic financial sentiment analysis.
Significance. If the label quality were demonstrated, the work would offer a substantial empirical contribution by filling a resource gap in Arabic financial NLP and enabling new analyses of investor sentiment in an emerging market. The scale of the 84K dataset and the integration of institutional and public sources represent a clear strength in data-construction efforts.
major comments (2)
- [Abstract and §3] Abstract and §3 (Methodology): The central claim that the 84K-sample dataset 'supports company-level sentiment aggregation and analysis of sentiment dynamics' and yields 'reliable' results rests on the unverified accuracy of the five-class sentiment annotation step. No accuracy, F1-score, inter-annotator agreement, or expert validation metrics are reported for this component, leaving the downstream aggregation and correlation analyses without grounding.
- [§4] §4 (Experiments): The assertion of 'reliable and scalable' performance is stated without any baseline comparisons, error analysis, or quantitative evaluation of the full pipeline on held-out data, which is load-bearing for the claim that the framework advances Arabic financial sentiment analysis.
minor comments (1)
- [§3] The description of the five-class sentiment scheme would benefit from an explicit definition or example labels in the text or a table.
Simulated Author's Rebuttal
We are grateful to the referee for their thorough review and constructive feedback on our manuscript. We address each major comment below and outline the revisions we plan to make to strengthen the presentation of our work.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Methodology): The central claim that the 84K-sample dataset 'supports company-level sentiment aggregation and analysis of sentiment dynamics' and yields 'reliable' results rests on the unverified accuracy of the five-class sentiment annotation step. No accuracy, F1-score, inter-annotator agreement, or expert validation metrics are reported for this component, leaving the downstream aggregation and correlation analyses without grounding.
Authors: We agree that quantitative validation of the sentiment annotation step is necessary to ground the downstream claims. The manuscript describes the five-class scheme and its integration into the pipeline but does not include accuracy, F1, or inter-annotator agreement figures. In the revised version we will add a dedicated subsection reporting inter-annotator agreement computed on a stratified sample of annotations, together with expert validation results on a held-out subset, thereby providing the required empirical support for the company-level aggregation analyses. revision: yes
-
Referee: [§4] §4 (Experiments): The assertion of 'reliable and scalable' performance is stated without any baseline comparisons, error analysis, or quantitative evaluation of the full pipeline on held-out data, which is load-bearing for the claim that the framework advances Arabic financial sentiment analysis.
Authors: The current §4 presents the results of applying the pipeline at scale and initial sentiment-market correlations, yet we acknowledge the absence of explicit baselines, error analysis, and held-out quantitative evaluation. We will revise the section to include (i) comparisons against existing Arabic sentiment baselines, (ii) a detailed error analysis of the full pipeline, and (iii) performance metrics on a held-out test partition, thereby more rigorously substantiating the claims of reliability and scalability. revision: yes
Circularity Check
No circularity: empirical dataset construction without derivational reduction
full rationale
The paper presents an empirical multi-stage pipeline for data collection, cleaning, entity linking via transformer NER plus lexicon, and five-class sentiment annotation to produce an 84K-sample Arabic financial corpus. No equations, mathematical derivations, fitted parameters, or predictions are described that could reduce to inputs by construction. Claims about supporting company-level aggregation and sentiment dynamics analysis rest on the pipeline output and experimental results rather than any self-referential loop or self-citation load-bearing premise. This is self-contained empirical work with no load-bearing steps that match the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Transformer-based NER combined with a curated lexicon can reliably link Arabic textual mentions to canonical company identifiers.
- domain assumption Five-class sentiment annotation on the collected corpus accurately reflects investor sentiment.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
multi-stage pipeline encompassing data collection, cleaning, deduplication, entity linking, and sentiment annotation... five-class scheme
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Transformer-based NER combined with a curated company lexicon
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
ALLaM: Large language models for arabic and english
Ahmed Abdelali, Maram Hasanain, Hamdy Mubarak, Laura Kallmeyer, Hassan Sajjad, Fahim Dalvi, et al. ALLaM: Large language models for arabic and english. arXiv preprint arXiv:2407.15390, 2024. SDAIA Arabic foun- dation model
-
[2]
Hero O. Ahmad and Shahla U. Umar. Senti- ment analysis of financial textual data using machine learning and deep learning models. Informatica, 47(5):153–158, 2023
work page 2023
-
[3]
Ara- hallueval: A fine-grained hallucination evalua- tion framework for arabic llms
Aisha Alansari and Hamzah Luqman. Ara- hallueval: A fine-grained hallucination evalua- tion framework for arabic llms. arXiv preprint, 2025
work page 2025
-
[4]
Borsah: A disruptive frame- work for the stock market predictions
Saad M Alshahrani, Said A Salloum, and Khaled Shaalan. Borsah: A disruptive frame- work for the stock market predictions. Inter- national Journal of Information Management , 41:117–129, 2018
work page 2018
-
[5]
Sentiment analysis in finan- cial news: Enhancing predictive models for stock market behavior
Martins Amola. Sentiment analysis in finan- cial news: Enhancing predictive models for stock market behavior. Preprint, 2025. Avail- able at ResearchGate
work page 2025
-
[6]
AraBERT : Transformer-based model for ara- bic language understanding
Wissam Antoun, Fady Baly, and Hazem Hajj. AraBERT : Transformer-based model for ara- bic language understanding. In Proceedings of the 4th Workshop on Open-Source Ara- bic Corpora and Processing T ools (OSACT) , pages 9–15. European Language Resources Association (ELRA), 2020
work page 2020
-
[7]
Finbert: Financial senti- ment analysis with pre-trained language mod- els
Dogu T an Araci. Finbert: Financial senti- ment analysis with pre-trained language mod- els. arXiv preprint, 2019
work page 2019
-
[8]
A light lexicon-based mobile application for sen- timent mining of Arabic tweets
Gilbert Badaro, Ramy Baly, Rana Akel, Linda Fayad, Jeffrey Khairallah, Hazem Hajj, Khaled Shaban, and Wassim El-Hajj. A light lexicon-based mobile application for sen- timent mining of Arabic tweets. In Nizar Habash, Stephan Vogel, and Kareem Dar- wish, editors, Proceedings of the Second Workshop on Arabic Natural Language Pro- cessing, pages 18–25, Beiji...
-
[9]
Association for Computational Linguis- tics
-
[10]
Nicholas Barberis, Andrei Shleifer, and Robert Vishny. A model of investor sentiment. Journal of financial economics , 49(3):307– 343, 1997
work page 1997
-
[11]
Savita Bhat and Vasudeva Varma. Large language models as annotators: A prelimi- nary evaluation for annotating low-resource language content. In Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems. Association for Computational Linguistics, 2023
work page 2023
-
[12]
Financial sentiment analysis: Tech- niques and applications
Kelvin Du, Frank Xing, Rui Mao, and Erik Cambria. Financial sentiment analysis: Tech- niques and applications. ACM Computing Surveys, 56(9):220, 2024
work page 2024
-
[13]
Arabic named entity recognition using deep learning approach
Ismail El Bazi and Nabil Laachfoubi. Arabic named entity recognition using deep learning approach. International Journal of Electrical and Computer Engineering , 9(3):2025–2032, 2019
work page 2025
-
[14]
AceGPT : Localizing large language models in arabic
Huang Huang, Fei Zhu, Jianfeng Qin, Yulei T ang, Xuebai Lin, Guo Liu, and Wei Wang. AceGPT : Localizing large language models in arabic. arXiv preprint arXiv:2309.12053 ,
-
[15]
Arabic-specialized instruction-tuned model
-
[16]
The interplay of variant, size, and task type in arabic pre-trained language models
Go Inoue, Bashar Alhafni, Nurpeiis Baimukan, Houda Bouamor, and Nizar Habash. The interplay of variant, size, and task type in arabic pre-trained language models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP), pages 92–104. Association for Computational Linguistics, 2021. CAMeL - BERT model family
work page 2021
-
[17]
Llms-as-judges: A comprehen- sive survey on llm-based evaluation methods, 2024
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehen- sive survey on llm-based evaluation methods, 2024
work page 2024
-
[18]
Neha Sengupta, Sunil Kumar Sharma, Muhammed Masoud, Abbas Akkasi, Karthik Kamur, Shivani Bhatia, Ebtesam Almazrouei, et al. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open gener- ative large language models. arXiv preprint arXiv:2308.16149, 2023. 13B parameter Arabic-centric LLM from Inception/G42
-
[19]
Big data: Deep learning for financial sentiment analysis
Sahar Sohangir, Dingding Wang, Anna Pomeranets, and T aghi M Khoshgoftaar. Big data: Deep learning for financial sentiment analysis. Journal of Big Data , 5(1):1–25, 2018
work page 2018
-
[20]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuur- mans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain-of-thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Bloomberggpt: A large language model for finance
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance. arXiv preprint, 2023. A. Reproducibility A.1. Model Configurations All models were configured with deterministic sam- pling (temperature = 0.0) to ensure reproducibil...
work page 2023
-
[22]
were not evaluated due to API availability con- straints during the evaluation period. A.2. Production Deployment Requirements Beyond benchmark metrics, models must satisfy the following requirements for production integra- tion:
-
[23]
Taxonomy Compliance: Output exactly five sentiment classes without category collapse
-
[24]
Structured Output: Return JSON format with sentiment labels and confidence scores
-
[25]
Reproducibility: Generate identical predic- tions with deterministic sampling (tempera- ture = 0)
-
[26]
Latency: Complete inference within 5 min- utes per 1,000 samples
-
[27]
ﺍܳ(” the stock is experienc- ing technical correction
Cost Efficiency: Maintain inference cost be- low $0.0012 per sample A.3. Dataset Availability The Arabic Financial Sentiment Corpus (AFSC) comprising 84,431 labeled samples will be re- leased under Creative Commons Attribution 4.0 In- ternational License upon acceptance. The dataset includes preprocessed Arabic text, five-class sen- timent labels with conf...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.