pith. sign in

arxiv: 2605.19714 · v1 · pith:PAYYMIOVnew · submitted 2026-05-19 · 💻 cs.CL

LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets

Pith reviewed 2026-05-20 05:24 UTC · model grok-4.3

classification 💻 cs.CL
keywords Arabic NLPfinancial sentiment analysisSaudi stock marketentity linkingsentiment annotationinvestor sentimentsocial media analysisnews corpus
0
0 comments X p. Extension
pith:PAYYMIOV Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{PAYYMIOV}

Prints a linked pith:PAYYMIOV badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A multi-stage pipeline builds an 84,000-sample Arabic financial sentiment dataset supporting company-level analysis on the Saudi Exchange.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an Arabic NLP framework for financial sentiment analysis tailored to Saudi markets by combining official financial news and social media data. It uses a multi-stage process of collection, cleaning, deduplication, entity linking via transformer NER and lexicon, and five-class sentiment annotation to create a large corpus. The resulting 84K samples enable aggregation of sentiment at the company level and examination of how sentiment relates to stock market behavior. Experimental results indicate that this approach delivers reliable and scalable sentiment analysis in Arabic financial contexts.

Core claim

By integrating official financial news and social media through a multi-stage pipeline of data collection, cleaning, deduplication, entity linking with transformer-based NER plus a curated company lexicon, and five-class sentiment annotation, the authors construct a dataset of 84K samples that supports company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange, with experiments demonstrating reliable and scalable Arabic financial sentiment analysis.

What carries the argument

The multi-stage pipeline for Arabic financial corpus construction, with transformer-based NER for entity linking to canonical company identifiers combined with five-class sentiment labeling.

If this is right

  • Sentiment aggregation becomes possible at the level of individual companies listed on the Saudi Exchange.
  • Sentiment dynamics can be tracked over time in relation to actual stock market movements.
  • The framework provides a scalable method for financial sentiment analysis in Arabic without relying solely on English resources.
  • Both institutional investor sentiment from news and public sentiment from social media can be captured and compared.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such a dataset could enable the development of Arabic-specific predictive models for stock price movements based on sentiment signals.
  • Similar pipelines could be applied to other Arabic financial markets to build comparable resources.
  • The work underscores the importance of language-specific entity linking and annotation for accurate sentiment in financial texts.

Load-bearing premise

The multi-stage pipeline including automated entity linking and sentiment annotation produces labels that truly represent investor sentiment in Arabic financial texts.

What would settle it

If a random sample of the dataset is manually labeled by Arabic-speaking financial experts and shows substantial disagreement with the automated five-class labels, that would undermine the claim of reliable analysis.

Figures

Figures reproduced from arXiv: 2605.19714 by Eman M. Albalkhi, Enrico Lopedoto, Joud A. Albaiti, Mona H. Albaqawi.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hallucination distributions across datasets: News (left) and Social Media (right). [5]. Given the dataset scale (84K samples), man￾ual annotation was impractical. To mitigate bias and enhance reliability, a multi-stage automated la￾beling framework was employed, progressively re￾fining label quality through model comparison and agreement analysis. The final labeling strategy relied on multiple high-capacit… view at source ↗
Figure 4
Figure 4. Figure 4: Correlation matrix of sentiment outputs across evaluated models on the News dataset, highlighting inter-model agreement patterns [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: illustrates the multi-stage consensus la￾beling process adopted in this study. 6. Results and Discussion 6.1. Benchmark Results Models are evaluated using Accuracy and Macro￾F1 for class-balanced robustness [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: summarizes the cost–quality trade-off across the evaluated models using Macro-F1 as a class-balanced performance metric. Statistical Significance Testing Paired t-tests confirm performance differences are statistically significant: GPT-5’s Macro-F1 (0.829) exceeds DeepSeek R1 Reasoner (0.739) at p < 0.01 (t = 3.47), and DeepSeek R1 Reasoner outperforms DeepSeek R1 Chat (0.360) at p < 0.001 (t = 8.92), conf… view at source ↗
read the original abstract

Investor sentiment shapes financial markets, yet modeling sentiment in Arabic financial contexts remains challenging due to linguistic complexity and limited resources. We present an Arabic NLP framework for large-scale financial sentiment analysis tailored to the Saudi market, integrating official financial news and social media to capture institutional and public investor sentiment. The framework constructs a large Arabic financial corpus through a multi-stage pipeline encompassing data collection, cleaning, deduplication, entity linking, and sentiment annotation. Transformer-based NER combined with a curated company lexicon links textual mentions to canonical company identifiers, with sentiment labels assigned using a five-class scheme. The resulting dataset of 84K samples supports company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange. Experimental results demonstrate reliable and scalable Arabic financial sentiment analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents an Arabic NLP framework for large-scale financial sentiment analysis tailored to Saudi markets. It describes a multi-stage pipeline for constructing an 84K-sample dataset from official financial news and social media, using transformer-based NER combined with a company lexicon for entity linking and a five-class scheme for sentiment annotation. The resulting dataset is positioned to enable company-level sentiment aggregation and analysis of sentiment dynamics relative to stock market behavior on the Saudi Exchange, with experimental results claimed to demonstrate reliable and scalable Arabic financial sentiment analysis.

Significance. If the label quality were demonstrated, the work would offer a substantial empirical contribution by filling a resource gap in Arabic financial NLP and enabling new analyses of investor sentiment in an emerging market. The scale of the 84K dataset and the integration of institutional and public sources represent a clear strength in data-construction efforts.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Methodology): The central claim that the 84K-sample dataset 'supports company-level sentiment aggregation and analysis of sentiment dynamics' and yields 'reliable' results rests on the unverified accuracy of the five-class sentiment annotation step. No accuracy, F1-score, inter-annotator agreement, or expert validation metrics are reported for this component, leaving the downstream aggregation and correlation analyses without grounding.
  2. [§4] §4 (Experiments): The assertion of 'reliable and scalable' performance is stated without any baseline comparisons, error analysis, or quantitative evaluation of the full pipeline on held-out data, which is load-bearing for the claim that the framework advances Arabic financial sentiment analysis.
minor comments (1)
  1. [§3] The description of the five-class sentiment scheme would benefit from an explicit definition or example labels in the text or a table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thorough review and constructive feedback on our manuscript. We address each major comment below and outline the revisions we plan to make to strengthen the presentation of our work.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Methodology): The central claim that the 84K-sample dataset 'supports company-level sentiment aggregation and analysis of sentiment dynamics' and yields 'reliable' results rests on the unverified accuracy of the five-class sentiment annotation step. No accuracy, F1-score, inter-annotator agreement, or expert validation metrics are reported for this component, leaving the downstream aggregation and correlation analyses without grounding.

    Authors: We agree that quantitative validation of the sentiment annotation step is necessary to ground the downstream claims. The manuscript describes the five-class scheme and its integration into the pipeline but does not include accuracy, F1, or inter-annotator agreement figures. In the revised version we will add a dedicated subsection reporting inter-annotator agreement computed on a stratified sample of annotations, together with expert validation results on a held-out subset, thereby providing the required empirical support for the company-level aggregation analyses. revision: yes

  2. Referee: [§4] §4 (Experiments): The assertion of 'reliable and scalable' performance is stated without any baseline comparisons, error analysis, or quantitative evaluation of the full pipeline on held-out data, which is load-bearing for the claim that the framework advances Arabic financial sentiment analysis.

    Authors: The current §4 presents the results of applying the pipeline at scale and initial sentiment-market correlations, yet we acknowledge the absence of explicit baselines, error analysis, and held-out quantitative evaluation. We will revise the section to include (i) comparisons against existing Arabic sentiment baselines, (ii) a detailed error analysis of the full pipeline, and (iii) performance metrics on a held-out test partition, thereby more rigorously substantiating the claims of reliability and scalability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction without derivational reduction

full rationale

The paper presents an empirical multi-stage pipeline for data collection, cleaning, entity linking via transformer NER plus lexicon, and five-class sentiment annotation to produce an 84K-sample Arabic financial corpus. No equations, mathematical derivations, fitted parameters, or predictions are described that could reduce to inputs by construction. Claims about supporting company-level aggregation and sentiment dynamics analysis rest on the pipeline output and experimental results rather than any self-referential loop or self-citation load-bearing premise. This is self-contained empirical work with no load-bearing steps that match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the full ledger cannot be audited. The approach implicitly relies on standard NLP assumptions about data representativeness and model accuracy for Arabic text.

axioms (2)
  • domain assumption Transformer-based NER combined with a curated lexicon can reliably link Arabic textual mentions to canonical company identifiers.
    Invoked in the entity-linking stage of the pipeline.
  • domain assumption Five-class sentiment annotation on the collected corpus accurately reflects investor sentiment.
    Central to labeling and downstream aggregation.

pith-pipeline@v0.9.0 · 5668 in / 1334 out tokens · 59995 ms · 2026-05-20T05:24:34.516021+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    ALLaM: Large language models for arabic and english

    Ahmed Abdelali, Maram Hasanain, Hamdy Mubarak, Laura Kallmeyer, Hassan Sajjad, Fahim Dalvi, et al. ALLaM: Large language models for arabic and english. arXiv preprint arXiv:2407.15390, 2024. SDAIA Arabic foun- dation model

  2. [2]

    Ahmad and Shahla U

    Hero O. Ahmad and Shahla U. Umar. Senti- ment analysis of financial textual data using machine learning and deep learning models. Informatica, 47(5):153–158, 2023

  3. [3]

    Ara- hallueval: A fine-grained hallucination evalua- tion framework for arabic llms

    Aisha Alansari and Hamzah Luqman. Ara- hallueval: A fine-grained hallucination evalua- tion framework for arabic llms. arXiv preprint, 2025

  4. [4]

    Borsah: A disruptive frame- work for the stock market predictions

    Saad M Alshahrani, Said A Salloum, and Khaled Shaalan. Borsah: A disruptive frame- work for the stock market predictions. Inter- national Journal of Information Management , 41:117–129, 2018

  5. [5]

    Sentiment analysis in finan- cial news: Enhancing predictive models for stock market behavior

    Martins Amola. Sentiment analysis in finan- cial news: Enhancing predictive models for stock market behavior. Preprint, 2025. Avail- able at ResearchGate

  6. [6]

    AraBERT : Transformer-based model for ara- bic language understanding

    Wissam Antoun, Fady Baly, and Hazem Hajj. AraBERT : Transformer-based model for ara- bic language understanding. In Proceedings of the 4th Workshop on Open-Source Ara- bic Corpora and Processing T ools (OSACT) , pages 9–15. European Language Resources Association (ELRA), 2020

  7. [7]

    Finbert: Financial senti- ment analysis with pre-trained language mod- els

    Dogu T an Araci. Finbert: Financial senti- ment analysis with pre-trained language mod- els. arXiv preprint, 2019

  8. [8]

    A light lexicon-based mobile application for sen- timent mining of Arabic tweets

    Gilbert Badaro, Ramy Baly, Rana Akel, Linda Fayad, Jeffrey Khairallah, Hazem Hajj, Khaled Shaban, and Wassim El-Hajj. A light lexicon-based mobile application for sen- timent mining of Arabic tweets. In Nizar Habash, Stephan Vogel, and Kareem Dar- wish, editors, Proceedings of the Second Workshop on Arabic Natural Language Pro- cessing, pages 18–25, Beiji...

  9. [9]

    Association for Computational Linguis- tics

  10. [10]

    A model of investor sentiment

    Nicholas Barberis, Andrei Shleifer, and Robert Vishny. A model of investor sentiment. Journal of financial economics , 49(3):307– 343, 1997

  11. [11]

    Large language models as annotators: A prelimi- nary evaluation for annotating low-resource language content

    Savita Bhat and Vasudeva Varma. Large language models as annotators: A prelimi- nary evaluation for annotating low-resource language content. In Proceedings of the 4th Workshop on Evaluation and Comparison of NLP Systems. Association for Computational Linguistics, 2023

  12. [12]

    Financial sentiment analysis: Tech- niques and applications

    Kelvin Du, Frank Xing, Rui Mao, and Erik Cambria. Financial sentiment analysis: Tech- niques and applications. ACM Computing Surveys, 56(9):220, 2024

  13. [13]

    Arabic named entity recognition using deep learning approach

    Ismail El Bazi and Nabil Laachfoubi. Arabic named entity recognition using deep learning approach. International Journal of Electrical and Computer Engineering , 9(3):2025–2032, 2019

  14. [14]

    AceGPT : Localizing large language models in arabic

    Huang Huang, Fei Zhu, Jianfeng Qin, Yulei T ang, Xuebai Lin, Guo Liu, and Wei Wang. AceGPT : Localizing large language models in arabic. arXiv preprint arXiv:2309.12053 ,

  15. [15]

    Arabic-specialized instruction-tuned model

  16. [16]

    The interplay of variant, size, and task type in arabic pre-trained language models

    Go Inoue, Bashar Alhafni, Nurpeiis Baimukan, Houda Bouamor, and Nizar Habash. The interplay of variant, size, and task type in arabic pre-trained language models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop (WANLP), pages 92–104. Association for Computational Linguistics, 2021. CAMeL - BERT model family

  17. [17]

    Llms-as-judges: A comprehen- sive survey on llm-based evaluation methods, 2024

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehen- sive survey on llm-based evaluation methods, 2024

  18. [18]

    Jais and jais-chat: Arabic-centric foundation and instruction-tuned open gener- ative large language models

    Neha Sengupta, Sunil Kumar Sharma, Muhammed Masoud, Abbas Akkasi, Karthik Kamur, Shivani Bhatia, Ebtesam Almazrouei, et al. Jais and jais-chat: Arabic-centric foundation and instruction-tuned open gener- ative large language models. arXiv preprint arXiv:2308.16149, 2023. 13B parameter Arabic-centric LLM from Inception/G42

  19. [19]

    Big data: Deep learning for financial sentiment analysis

    Sahar Sohangir, Dingding Wang, Anna Pomeranets, and T aghi M Khoshgoftaar. Big data: Deep learning for financial sentiment analysis. Journal of Big Data , 5(1):1–25, 2018

  20. [20]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuur- mans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain-of-thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2023

  21. [21]

    Bloomberggpt: A large language model for finance

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance. arXiv preprint, 2023. A. Reproducibility A.1. Model Configurations All models were configured with deterministic sam- pling (temperature = 0.0) to ensure reproducibil...

  22. [22]

    were not evaluated due to API availability con- straints during the evaluation period. A.2. Production Deployment Requirements Beyond benchmark metrics, models must satisfy the following requirements for production integra- tion:

  23. [23]

    Taxonomy Compliance: Output exactly five sentiment classes without category collapse

  24. [24]

    Structured Output: Return JSON format with sentiment labels and confidence scores

  25. [25]

    Reproducibility: Generate identical predic- tions with deterministic sampling (tempera- ture = 0)

  26. [26]

    Latency: Complete inference within 5 min- utes per 1,000 samples

  27. [27]

    ﺍܳ(” the stock is experienc- ing technical correction

    Cost Efficiency: Maintain inference cost be- low $0.0012 per sample A.3. Dataset Availability The Arabic Financial Sentiment Corpus (AFSC) comprising 84,431 labeled samples will be re- leased under Creative Commons Attribution 4.0 In- ternational License upon acceptance. The dataset includes preprocessed Arabic text, five-class sen- timent labels with conf...