HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.
Jais and jais-chat: Arabic-centric foundation and instruction-tuned open gener- ative large language models
7 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 7representative citing papers
XL-SafetyBench is a new cross-cultural benchmark showing frontier LLMs decouple jailbreak robustness from cultural sensitivity while local models trade off attack success against neutral-safe rates in a near-linear pattern indicating generation failure rather than alignment.
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
ArabCulture-Dialogue dataset shows LLMs perform worse on dialectal Arabic than Modern Standard Arabic across cultural reasoning, translation, and generation tasks.
Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
Introduces a multi-stage Arabic financial sentiment pipeline that produces an 84K-sample corpus for company-level analysis tied to Saudi stock market behavior.
Residual-stream noise injection raises narrative diversity in Arabic educational stories while preserving reading-grade level, outperforming high-temperature sampling across five 7-9B models.
citing papers explorer
-
HalluScore: Large Language Model Hallucination Question Answering Benchmark
HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.
-
XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity
XL-SafetyBench is a new cross-cultural benchmark showing frontier LLMs decouple jailbreak robustness from cultural sensitivity while local models trade off attack success against neutral-safe rates in a near-linear pattern indicating generation failure rather than alignment.
-
LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation
LQM introduces a six-level linguistically motivated error taxonomy for MT evaluation and applies it via expert annotation to LLM outputs on a new 3,850-sentence multi-dialect Arabic corpus.
-
Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues
ArabCulture-Dialogue dataset shows LLMs perform worse on dialectal Arabic than Modern Standard Arabic across cultural reasoning, translation, and generation tasks.
-
Low-Resource Languages Jailbreak GPT-4
Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
-
LLM-Based Financial Sentiment Analysis in Arabic: Evidence from Saudi Markets
Introduces a multi-stage Arabic financial sentiment pipeline that produces an 84K-sample corpus for company-level analysis tied to Saudi stock market behavior.
-
Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation
Residual-stream noise injection raises narrative diversity in Arabic educational stories while preserving reading-grade level, outperforming high-temperature sampling across five 7-9B models.