Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

· 2026 · cs.CL · arXiv 2605.01292

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

The growing spread of misinformation in digital media highlights the need for reliable fake news detection systems, yet progress in under-resourced languages such as Bangla is limited by small and imbalanced datasets. This study investigates whether Large Language Model (LLM) based augmentation can effectively address this limitation and improve Bangla fake news classification. Existing datasets remain valuable but highly imbalanced, limiting model performance, and LLM based augmentation for Bangla has been scarcely explored. To fill this gap, we propose a systematic augmentation framework that generates synthetic Bangla news articles using the instruction tuned Gemma 3 27B IT model, supported by semantic filtering and controlled subsampling to preserve label consistency and diversity. We compare zero shot and few shot prompting, evaluate multiple augmentation rates, and examine random versus similarity-based selection strategies. Our experiments show that augmenting only the minority class with a high augmentation rate and random subsampling yields the strongest gains, raising the Fake News F1 score from 0.85 to 0.88. To support reproducibility and further research in this low-resource domain, we publicly release 4,545 synthetically generated Bangla fake news samples along with our full implementation. These findings demonstrate that well-designed LLM-driven augmentation can significantly improve fake news detection in low resource settings and provide a practical foundation for advancing multilingual misinformation research.

representative citing papers

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali

cs.CL · 2026-05-29 · unverdicted · novelty 7.0 · 2 refs

BenHalluEval is the first dedicated hallucination benchmark for Bengali LLMs, using dual-track evaluation and the BenHalluScore metric to reveal variation in model calibration across seven LLMs.

LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification

cs.CL · 2026-05-30 · unverdicted · novelty 2.0

LinguIUTics team applies QLoRA fine-tuning of Qwen3-8B plus stratified CV, minority lexical augmentation, logit bias tuning and ensemble blending to achieve 0.3917 macro F1 (7.7 points above Ministral-8B baseline) on PsyDefDetect 2026.

citing papers explorer

Showing 2 of 2 citing papers after filters.

BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali cs.CL · 2026-05-29 · unverdicted · none · ref 5 · 2 links · internal anchor
BenHalluEval is the first dedicated hallucination benchmark for Bengali LLMs, using dual-track evaluation and the BenHalluScore metric to reveal variation in model calibration across seven LLMs.
LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification cs.CL · 2026-05-30 · unverdicted · none · ref 18 · internal anchor
LinguIUTics team applies QLoRA fine-tuning of Qwen3-8B plus stratified CV, minority lexical augmentation, logit bias tuning and ensemble blending to achieve 0.3917 macro F1 (7.7 points above Ministral-8B baseline) on PsyDefDetect 2026.

Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach

fields

years

verdicts

representative citing papers

citing papers explorer