arxiv: 2605.01323 · v2 · submitted 2026-05-02 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SiNFluD: Creating and Evaluating Figurative Language Dataset for Sindhi

Wazir Ali , Adeeb Noor , Saifullah Tumrani

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Sindhifigurative languagedatasettext classificationmultilingual modelsannotationlow-resource NLP

0 comments

The pith

The paper introduces SiNFluD as a benchmark dataset for classifying figurative language in Sindhi texts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates and releases SiNFluD, a new dataset to support classification of figurative language in Sindhi. Authors gather raw texts from blogs, social media platforms, and literary sources, then have two native speakers annotate the material with an online tool. The annotations reach an inter-annotator agreement of 0.81, and the authors run cross-validation to set baselines. They test several multilingual models and find that the largest version of XLM-RoBERTa performs best on the task. The resource fills a gap for a low-resource language where figurative expressions are common but lack dedicated benchmarks.

Core claim

We introduce SiNFluD, a novel benchmark dataset for Sindhi figurative language classification. We first collect raw text from various blogs, social media platforms, and literary sources, and subsequently prepare the corpus for annotation. Two native annotators label the data using the Doccano text annotation tool, achieving an inter-annotator agreement of 0.81. We then establish baseline results using 5-fold and 10-fold cross-validation. Finally, we evaluate mBERT, XLM-RoBERTa, and XLM-RoBERTa-XL models, along with SetFit for few-shot fine-tuning of sentence transformers. Among these, the pretrained XLM-RoBERTa-XL achieves the best performance.

What carries the argument

The SiNFluD dataset of annotated Sindhi texts, paired with cross-validation baselines and evaluations of pretrained multilingual transformer models.

If this is right

Future Sindhi NLP systems can measure progress against the reported cross-validation baselines for figurative language detection.
Pretrained multilingual models such as XLM-RoBERTa-XL provide a strong starting point for low-resource figurative language tasks.
The 0.81 inter-annotator agreement supplies evidence that native-speaker labeling can produce usable training data for this language.
The few-shot SetFit results indicate viable paths when only small amounts of labeled Sindhi data are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collection and annotation approach could extend to other figurative phenomena such as idioms or sarcasm in Sindhi.
Improved figurative language classifiers might support downstream applications like literary analysis or social-media monitoring in Sindhi.
The pattern of larger multilingual models outperforming smaller ones suggests similar gains could appear in related low-resource language tasks.

Load-bearing premise

The collected texts from blogs, social media, and literary sources, together with the two-annotator labels, sufficiently represent typical Sindhi figurative language use.

What would settle it

A fresh collection of Sindhi texts annotated independently by new native speakers that produces agreement below 0.7 or reverses the model ranking would undermine the dataset as a stable benchmark.

read the original abstract

In this article, we introduce SiNFluD, a novel benchmark dataset for Sindhi figurative language classification. We first collect raw text from various blogs, social media platforms, and literary sources, and subsequently prepare the corpus for annotation. Two native annotators label the data using the Doccano text annotation tool, achieving an inter-annotator agreement of 0.81. We then establish baseline results using 5-fold and 10-fold cross-validation. Finally, we evaluate mBERT, XLM-RoBERTa, and XLM-RoBERTa-XL models, along with SetFit for few-shot fine-tuning of sentence transformers. Among these, the pretrained XLM-RoBERTa-XL achieves the best performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SiNFluD, a novel benchmark dataset for Sindhi figurative language classification. Raw texts are collected from blogs, social media, and literary sources, annotated by two native speakers via Doccano yielding IAA=0.81, and used to establish baselines via 5-fold/10-fold cross-validation plus evaluations of mBERT, XLM-RoBERTa, XLM-RoBERTa-XL, and SetFit, with XLM-RoBERTa-XL reported as the strongest model.

Significance. A well-documented Sindhi figurative-language dataset with reliable annotations would be a useful addition to low-resource NLP resources, particularly for non-literal language understanding. The reported IAA of 0.81 provides moderate evidence of annotation quality, but the absence of dataset statistics and concrete metrics prevents any assessment of whether the benchmark is practically usable or whether the model ranking is robust.

major comments (2)

[Abstract] Abstract: no total number of annotated instances, no class distribution or label set (binary figurative/literal vs. multi-class figures of speech), and no numerical performance scores (accuracy or F1) from the cross-validation or model runs are supplied. These quantities are required to evaluate the central claim that SiNFluD constitutes a usable benchmark and that XLM-RoBERTa-XL is the best model.
[Evaluation] Evaluation and results sections: without reported dataset size, split details, or concrete metrics, the statements that 5-fold/10-fold CV and SetFit baselines were established and that XLM-RoBERTa-XL outperforms the other models cannot be verified for statistical reliability or sensitivity to imbalance, which is especially critical for a low-resource language.

minor comments (1)

[Data collection] The data-collection description lists sources but does not quantify how many texts came from each source or how duplicates and noise were handled; adding these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the abstract and evaluation sections would benefit from additional quantitative details to better support the claims regarding the benchmark's usability and the model comparisons. We will revise the manuscript to address these points directly.

read point-by-point responses

Referee: [Abstract] Abstract: no total number of annotated instances, no class distribution or label set (binary figurative/literal vs. multi-class figures of speech), and no numerical performance scores (accuracy or F1) from the cross-validation or model runs are supplied. These quantities are required to evaluate the central claim that SiNFluD constitutes a usable benchmark and that XLM-RoBERTa-XL is the best model.

Authors: We accept this observation. The abstract as currently written does not contain these specific quantities. In the revised version, we will expand the abstract to report the total number of annotated instances, the label set, the class distribution, and the key numerical performance scores (accuracy and F1) obtained from the cross-validation experiments and model evaluations. revision: yes
Referee: [Evaluation] Evaluation and results sections: without reported dataset size, split details, or concrete metrics, the statements that 5-fold/10-fold CV and SetFit baselines were established and that XLM-RoBERTa-XL outperforms the other models cannot be verified for statistical reliability or sensitivity to imbalance, which is especially critical for a low-resource language.

Authors: We agree that the evaluation and results sections require more explicit reporting to enable verification. We will revise these sections to include the dataset size, details on the cross-validation splits, the concrete accuracy and F1 scores for each model (mBERT, XLM-RoBERTa, XLM-RoBERTa-XL, and SetFit), and discussion of any class imbalance effects to substantiate the reported model ranking and baseline establishment. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset creation and standard model baselines

full rationale

The paper introduces SiNFluD via text collection from blogs/social media/literary sources, annotation by two native speakers using Doccano (IAA=0.81), 5/10-fold CV baselines, and evaluation of mBERT/XLM-RoBERTa variants plus SetFit. No equations, derivations, fitted parameters renamed as predictions, or self-citations load-bearing a uniqueness claim exist. All steps are externally verifiable data collection and off-the-shelf model runs; the central claim does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset creation and evaluation paper with no mathematical derivations, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5425 in / 1005 out tokens · 32215 ms · 2026-05-12T01:41:15.332349+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce SiNFluD, a novel benchmark dataset for Sindhi figurative language classification... achieving an inter-annotator agreement of 0.81... the pretrained XLM-RoBERTa-XL achieves the best performance.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The annotated dataset consists of two main categories: literal (labeled as 0) and figurative (labeled as 1).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Introduction Human languages are generally filled with fig- urative expressions including idioms, sarcasm, metaphors, irony, and metonymy which transcend literal meanings to convey emotion and nuanced intent (Falkum, 2022). These non-literal terms are generally used in daily communication (Malik and Abdalkarim, 2018), social media, and literature to expre...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

trained on more than 100 languages, and efficient SetFit (Tunstall et al., 2022b) frameworks

work page
[3]

Idioms are commonly used to express complex ideas with cultural nuance, while metaphors en- able analogical reasoning to describe abstract con- cepts (Banou et al., 2025)

Related Work Figurative language in the form of idioms, simi- les, metaphors, and personification represent fun- damental aspects of communication that extend beyond literal meanings to convey nuanced in- tent, emotion, and cultural context (Falkum, 2022). Idioms are commonly used to express complex ideas with cultural nuance, while metaphors en- able ana...

work page 2022
[4]

How- ever, Tamil, and Malayalam, face challenges in developing large domain-specific corpora (Jana et al., 2024)

for sarcasm detection in Urdu tweets. How- ever, Tamil, and Malayalam, face challenges in developing large domain-specific corpora (Jana et al., 2024). Similarly, cross-lingual metaphor de- tection in low-resource languages often relies on fine-tuning pre-trained models, but performance is limited by insufficient idiom diversity and con- textual coverage ...

work page 2024
[5]

like” or “as

Creation of the Dataset This section presents the procedure from the very beginning, including crawling text from various Sindhi blogs, literary works, and books, as well as cleaning, labeling, inter-annotator agreement, and complete statistics of the dataset. 3.1. Collection of Text Sindhi figurative language resources are scarce in online formats, with ...

work page 2018
[6]

Experimental Setup and Baseline This section presents the experimental setup, in- cluding the data split and implementation details, followed by a comprehensive analysis of the base- line results. 4.1. Experimental Setup Firstly, we performed 5-fold and 10-fold cross- validation using a baseline classifier in order to evaluate the reliability of the newly...

work page
[7]

Results & Analysis The results presented in Table 5 demonstrate con- sistently strong performance across all evaluated pretrained language models (PLMs) for the binary classification task distinguishing literal from figura- tive expressions. Performance is reported in terms ofaccuracy, withallmodelsachievingresultswithin a relatively narrow range, indicat...

work page
[8]

The proposed SiNFluD bench- mark is compiled from a diverse range of textual sources, including books, blogs, social media con- tent, and literary works

Conclusion This study presents the development and evalu- ation of a novel benchmark dataset for the anal- ysis of figurative expressions in the low-resource Sindhi language. The proposed SiNFluD bench- mark is compiled from a diverse range of textual sources, including books, blogs, social media con- tent, and literary works. The annotation process was c...

work page
[9]

Wazir Ali, Junyu Lu, and Zenglin Xu

Word embedding based new corpus for low-resourced language: Sindhi.arXiv preprint arXiv:1911.12579. Wazir Ali, Junyu Lu, and Zenglin Xu. 2020. SiNER: A large dataset for Sindhi named entity recog- nition. InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 2953–2961, Marseille, France. European Lan- guage Resources Association...

work page arXiv 1911
[10]

Qingqing Hong, Dongyu Zhang, Jiayi Lin, Dapeng Yin, Shuyue Zhu, and Junli Wang

Non-literal language processing is jointly supported by the language and theory of mind networks: Evidence from a novel meta-analytic fmri approach.Cortex, 162:58–114. Qingqing Hong, Dongyu Zhang, Jiayi Lin, Dapeng Yin, Shuyue Zhu, and Junli Wang. 2025. Rhetor- ical device-aware sarcasm detection with coun- terfactual data augmentation. InFindings of the ...

work page arXiv 2025
[11]

Software available from https://github.com/doccano/doccano

doccano: Text annotation tool for human. Software available from https://github.com/doccano/doccano. Ambile Official. 2024. Sindhi language mega corpus 118 million tokens. https://huggingface.co/datasets/ ambile-official/Sindhi_Mega_ Corpus_118_Million_Tokens. Silvia V Oprea and Walid Magdy. 2025. Llm-as- a-judge for sarcasm detection using supervised fin...

work page 2024
[12]

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

How multilingual is multilingual bert? In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4996–5001. Vassiliki Rentoumi, George Giannakopoulos, Van- gelis Karkaletsis, and George A Vouros. 2009. Sentiment analysis of figurative language using a word sense disambiguation approach. InPro- ceedingsoftheInternati...

work page internal anchor Pith review Pith/arXiv arXiv 2009