arxiv: 2604.21421 · v1 · submitted 2026-04-23 · 💻 cs.CR · cs.AI· cs.CL

Recognition: unknown

Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation

Michele Miranda , Xinlan Yan , Nishant Mishra , Rachel Murphy , Ameen Abu-Hanna , S\'ebastien Brati\`eres , Iacer Calixto

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:42 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords differential privacyde-identificationclinical notesDutchlarge language modelsNERprivacy-utility trade-off

0 comments

The pith

Combining differential privacy with LLM redaction improves the privacy-utility trade-off for Dutch clinical notes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares pure differential privacy, named entity recognition, large language models, and hybrid pipelines for removing identifying details from Dutch medical notes. Pure DP adds noise that substantially reduces how well the notes support later tasks such as recognizing medical entities or relations. Preprocessing the text first with NER or LLM redaction before the DP step preserves more of that usefulness while still providing privacy protection. The LLM preprocessing route shows the clearest gain in balancing the two goals, making automated de-identification more practical for sharing data under rules like GDPR and HIPAA.

Core claim

The authors show that differential privacy mechanisms applied alone to Dutch clinical text cause large drops in utility on entity and relation classification tasks, while hybrid strategies that first redact protected information using NER or especially LLM-based methods before applying DP deliver markedly better privacy-utility trade-offs as measured by both leakage metrics and downstream task performance.

What carries the argument

Hybrid pipelines that apply linguistic preprocessing (NER or LLM redaction) before differential privacy mechanisms.

Load-bearing premise

That performance on entity and relation classification tasks accurately reflects the real-world usefulness of the de-identified notes for secondary healthcare research.

What would settle it

A follow-up evaluation that applies the same de-identified notes to an actual secondary research task such as outcome prediction and finds no utility advantage for the hybrid methods over pure DP.

Figures

Figures reproduced from arXiv: 2604.21421 by Ameen Abu-Hanna, Iacer Calixto, Michele Miranda, Nishant Mishra, Rachel Murphy, S\'ebastien Brati\`eres, Xinlan Yan.

**Figure 1.** Figure 1: Overview of our comparative analysis. A raw document Draw is de-identified using 5 different pipelines, which are evaluated against a manually de-identified version of the same document Dmanual. We use a range of open-source and proprietary LLMs that vary in architecture and size in our experiments. methods become increasingly strong but do not provide any privacy guarantees (Pissarra et al., 2024; Yang … view at source ↗

**Figure 2.** Figure 2: Comparison of privacy leakage across different de-identification pipelines and DP budgets (ϵ). This figure includes two DP mechanisms: RANTEXT and Metric-DP, each applied to three pipelines: PDP, PNER→DP, and PLLM→DP. For PLLM→DP, we use Deepseek-70B as the deidentification module as it performs the best in terms of privacy. Horizontal lines indicate nonDP baselines, including one NER-based pipeline (G… view at source ↗

**Figure 3.** Figure 3: PII leakage by pipeline and privacy budget [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of utility F1-score for Entity Classification (EC) task across different deidentification pipelines and DP budgets (ϵ) (see [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Comparison of evaluation metrics includ [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Protecting patient privacy in clinical narratives is essential for enabling secondary use of healthcare data under regulations such as GDPR and HIPAA. While manual de-identification remains the gold standard, it is costly and slow, motivating the need for automated methods that combine privacy guarantees with high utility. Most automated text de-identification pipelines employed named entity recognition (NER) to identify protected entities for redaction. Although methods based on differential privacy (DP) provide formal privacy guarantees, more recently also large language models (LLMs) are increasingly used for text de-identification in the clinical domain. In this work, we present the first comparative study of DP, NER, and LLMs for Dutch clinical text de-identification. We investigate these methods separately as well as hybrid strategies that apply NER or LLM preprocessing prior to DP, and assess performance in terms of privacy leakage and extrinsic evaluation (entity and relation classification). We show that DP mechanisms alone degrade utility substantially, but combining them with linguistic preprocessing, especially LLM-based redaction, significantly improves the privacy-utility trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. This paper presents the first comparative evaluation of differentially private (DP) mechanisms, named entity recognition (NER), and large language model (LLM)-based methods for de-identifying Dutch clinical notes. It examines standalone approaches as well as hybrid pipelines that apply NER or LLM preprocessing before DP, measuring performance via privacy leakage metrics and extrinsic utility on entity and relation classification tasks. The central finding is that DP alone substantially degrades utility, whereas hybrid strategies—particularly LLM-based redaction followed by DP—yield a meaningfully better privacy-utility trade-off.

Significance. If the empirical results hold under scrutiny, the work supplies timely, language-specific evidence on practical de-identification strategies for Dutch clinical text, an under-studied setting relative to English. The hybrid LLM-DP approach is shown to mitigate the utility penalty of pure DP while retaining formal privacy guarantees, which could directly inform GDPR-compliant secondary-use pipelines in healthcare. The inclusion of extrinsic downstream tasks adds relevance beyond intrinsic privacy metrics, though the paper's own evaluation design limits the strength of claims about broader clinical utility.

major comments (1)

Evaluation section (extrinsic tasks): The central claim that LLM preprocessing improves the privacy-utility trade-off rests on entity and relation classification performance. These tasks are semantically close to the NER/LLM redaction step itself, so measured gains may reflect task alignment rather than preserved semantic content for secondary clinical uses (e.g., cohort studies or outcome modeling). No results are reported on more distant tasks such as diagnosis prediction or temporal event extraction, leaving the generalizability of the improvement untested and weakening support for the headline conclusion.

minor comments (2)

Abstract: The abstract states the evaluation approach and main finding but provides no quantitative results (e.g., specific privacy leakage rates, F1 scores, or DP parameters such as ε), making it difficult for readers to gauge the magnitude of the reported improvements without reading the full results section.
Dataset and experimental details: The manuscript would benefit from an explicit table or subsection listing the Dutch clinical corpus size, number of notes, protected entity types, and the exact DP mechanisms and privacy budgets (ε, δ) used in each condition.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below.

read point-by-point responses

Referee: Evaluation section (extrinsic tasks): The central claim that LLM preprocessing improves the privacy-utility trade-off rests on entity and relation classification performance. These tasks are semantically close to the NER/LLM redaction step itself, so measured gains may reflect task alignment rather than preserved semantic content for secondary clinical uses (e.g., cohort studies or outcome modeling). No results are reported on more distant tasks such as diagnosis prediction or temporal event extraction, leaving the generalizability of the improvement untested and weakening support for the headline conclusion.

Authors: We thank the referee for highlighting this important consideration. Entity and relation classification were chosen as extrinsic tasks because they are standard benchmarks in clinical NLP literature for assessing de-identification utility and because Dutch-annotated datasets are available for them, enabling direct comparison across methods. The relation classification task requires contextual inference and semantic linking beyond entity detection alone, offering evidence that hybrid LLM-DP approaches preserve more than surface-level information. We agree, however, that more distant tasks such as diagnosis prediction or temporal event extraction would better demonstrate generalizability to broader secondary uses. Such evaluations would require additional annotated data and resources beyond the current study scope. In the revised manuscript we will add an explicit limitations paragraph in the Discussion section that acknowledges the scope of the chosen tasks, qualifies the headline claims accordingly, and identifies these more distant tasks as valuable directions for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparative evaluation with independent benchmarks

full rationale

The paper conducts a comparative study of DP, NER, and LLM-based de-identification methods on Dutch clinical notes, measuring privacy leakage and utility via standard extrinsic tasks (entity and relation classification). No mathematical derivations, equations, fitted parameters, or predictions are present. Central claims rest on direct experimental results rather than any self-referential reduction, self-citation chains, or ansatz smuggling. Evaluations use established metrics and tasks that do not reduce to the preprocessing steps by construction. This is a standard empirical setup with no load-bearing circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical evaluation relying on established privacy and NLP techniques without new theoretical constructs or fitted parameters.

axioms (2)

domain assumption Differential privacy mechanisms provide formal privacy guarantees when correctly implemented.
The paper invokes standard DP theory for privacy claims.
domain assumption NER and LLM models can reliably identify protected health information in clinical text.
Preprocessing effectiveness is assumed for hybrid strategies.

pith-pipeline@v0.9.0 · 5508 in / 1197 out tokens · 52740 ms · 2026-05-09T21:42:33.446641+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

299 extracted references · 25 canonical work pages · 3 internal anchors

[1]

Catalan Speecon database

Speecon Consortium. Catalan Speecon database. 2011

2011
[2]

The EMILLE/CIIL Corpus

Anthony McEnery and others. The EMILLE/CIIL Corpus. 2004

2004
[3]

The OrienTel Moroccan MCA (Modern Colloquial Arabic) database

Khalid Choukri and Niklas Paullson. The OrienTel Moroccan MCA (Modern Colloquial Arabic) database. 2004

2004
[4]

ItalWordNet v.2

Roventini, Adriana and Marinelli, Rita and Bertagna, Francesca. ItalWordNet v.2
[5]

2006 , isbn =

Dwork, Cynthia , title =. Proceedings of the 33rd International Conference on Automata, Languages and Programming - Volume Part II , pages =. 2006 , isbn =. doi:10.1007/11787006_1 , abstract =

work page doi:10.1007/11787006_1 2006
[6]

Dwork, Cynthia and Roth, Aaron , title =. Found. Trends Theor. Comput. Sci. , month = aug, pages =. 2014 , issue_date =. doi:10.1561/0400000042 , abstract =

work page doi:10.1561/0400000042 2014
[7]

and Friedlin, F

Meystre, Stephane M. and Friedlin, F. Jeffrey and South, Brett R. and Shen, Shuying and Samore, Matthew H. , title=. BMC Medical Research Methodology , year=. doi:10.1186/1471-2288-10-70 , url=

work page doi:10.1186/1471-2288-10-70
[8]

International Symposium on Privacy Enhancing Technologies , year=

Broadening the Scope of Differential Privacy Using Metrics , author=. International Symposium on Privacy Enhancing Technologies , year=
[9]

2019 IEEE International Conference on Data Mining (ICDM) , year=

Leveraging Hierarchical Representations for Preserving Privacy and Utility in Text , author=. 2019 IEEE International Conference on Data Mining (ICDM) , year=

2019
[10]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Veselin and Zettlemoyer, Luke. BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguisti...

work page doi:10.18653/v1/2020.acl-main.703 2020
[11]

Conference of the European Chapter of the Association for Computational Linguistics , year=

ADePT: Auto-encoder based Differentially Private Text Transformation , author=. Conference of the European Chapter of the Association for Computational Linguistics , year=
[12]

2025 , eprint=

InferDPT: Privacy-Preserving Inference for Black-box Large Language Model , author=. 2025 , eprint=

2025
[13]

Yue, Xiang and Du, Minxin and Wang, Tianhao and Li, Yaliang and Sun, Huan and Chow, Sherman S. M. Differential Privacy for Text Analytics via Natural Text Sanitization. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.337

work page doi:10.18653/v1/2021.findings-acl.337 2021
[14]

A Customized Text Sanitization Mechanism with Differential Privacy

Chen, Sai and Mo, Fengran and Wang, Yanhao and Chen, Cen and Nie, Jian-Yun and Wang, Chengyu and Cui, Jamie. A Customized Text Sanitization Mechanism with Differential Privacy. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.355

work page doi:10.18653/v1/2023.findings-acl.355 2023
[15]

2023 , eprint=

DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4 , author=. 2023 , eprint=

2023
[16]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

2020
[17]

Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus

Stubbs, Amber and Uzuner, \"O zlem. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J Biomed Inform

2014
[18]

Department of Health and Human Services

U.S. Department of Health and Human Services. 45 CFR § 164.514 – de-identification of health information
[19]

European Parliament and Council of the European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (General Data Protection Regulation). Official Journal of the European Union

2016
[20]

2019 , eprint=

Privacy- and Utility-Preserving Textual Analysis via Calibrated Multivariate Perturbations , author=. 2019 , eprint=

2019
[21]

Language Resources and Evaluation , pages=

Creation of a gold standard Dutch corpus of clinical notes for adverse drug event detection: the Dutch ADE corpus , author=. Language Resources and Evaluation , pages=. 2025 , publisher=

2025
[22]

arXiv preprint arXiv:2211.01147 , year=

An easy-to-use and robust approach for the differentially private de-identification of clinical textual documents , author=. arXiv preprint arXiv:2211.01147 , year=

work page arXiv
[23]

Digital Health , volume=

Data privacy in healthcare: Global challenges and solutions , author=. Digital Health , volume=. 2025 , publisher=

2025
[24]

Journal of medical Internet research , volume=

Use and understanding of anonymization and de-identification in the biomedical literature: scoping review , author=. Journal of medical Internet research , volume=. 2019 , publisher=

2019
[25]

arXiv preprint arXiv:1912.09582 , year=

Bertje: A dutch bert model , author=. arXiv preprint arXiv:1912.09582 , year=

work page arXiv 1912
[26]

nl: a language model for Dutch electronic health records , author=

MedRoBERTa. nl: a language model for Dutch electronic health records , author=. Computational Linguistics in the Netherlands , volume=. 2021 , organization=

2021
[27]

How to successfully recycle English GPT-2 to make models for other languages , author=

As good as new. How to successfully recycle English GPT-2 to make models for other languages , author=. 2020 , eprint=

2020
[28]

De-identification of patient notes with recurrent neural networks

Dernoncourt, Franck and Lee, Ji Young and Uzuner, Ozlem and Szolovits, Peter. De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association (JAMIA)
[29]

The International FLAIRS Conference Proceedings , author=

De-identification of Emergency Medical Records in French: Survey and Comparison of State-of-the-Art Automated Systems , volume=. The International FLAIRS Conference Proceedings , author=. 2021 , month=. doi:10.32473/flairs.v34i1.128480 , abstractNote=

work page doi:10.32473/flairs.v34i1.128480 2021
[30]

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation

Wang, Peng and Li, Yong and Yang, Liang and Li, Simin and Li, Linfeng and Zhao, Zehan and Long, Shaopei and Wang, Fei and Wang, Hongqian and Li, Ying and Wang, Chengliang. An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation. JMIR Med Inform
[32]

Enhancing text anonymization via re-identification risk-based explainability , journal =

Benet Manzanares-Salor and David Sánchez , keywords =. Enhancing text anonymization via re-identification risk-based explainability , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.knosys.2024.112945 , url =

work page doi:10.1016/j.knosys.2024.112945 2025
[33]

C-sanitized: A privacy model for document redaction and sanitization , year =

S\'. C-sanitized: A privacy model for document redaction and sanitization , year =. J. Assoc. Inf. Sci. Technol. , month = jan, pages =. doi:10.1002/asi.23363 , abstract =

work page doi:10.1002/asi.23363
[34]

Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe

Yue, Xiang and Inan, Huseyin and Li, Xuechen and Kumar, Girish and McAnallen, Julia and Shajari, Hoda and Sun, Huan and Levitan, David and Sim, Robert. Synthetic Text Generation with Differential Privacy: A Simple and Practical Recipe. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. do...

work page doi:10.18653/v1/2023.acl-long.74 2023
[35]

International Conference on Learning Representations , year=

Differentially Private Fine-tuning of Language Models , author=. International Conference on Learning Representations , year=
[36]

Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

2022
[37]

ACM Trans

Liu, Xiao-Yang and Zhu, Rongyi and Zha, Daochen and Gao, Jiechao and Zhong, Shan and White, Matt and Qiu, Meikang , title =. ACM Trans. Manage. Inf. Syst. , month = aug, keywords =. 2024 , publisher =. doi:10.1145/3682068 , abstract =

work page doi:10.1145/3682068 2024
[38]

arXiv preprint arXiv:2209.09631 , year=

De-identification of French unstructured clinical notes for machine learning tasks , author=. arXiv preprint arXiv:2209.09631 , year=

work page arXiv
[39]

arXiv preprint arXiv:2507.19396 , year=

Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study , author=. arXiv preprint arXiv:2507.19396 , year=

work page arXiv
[40]

Robust Utility-Preserving Text Anonymization Based on Large Language Models

Yang, Tianyu and Zhu, Xiaodan and Gurevych, Iryna. Robust Utility-Preserving Text Anonymization Based on Large Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1404

work page doi:10.18653/v1/2025.acl-long.1404 2025
[41]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=
[44]

MedGemma Technical Report

Medgemma technical report , author=. arXiv preprint arXiv:2507.05201 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

, author=

De-identification of Personal Information:. , author=. 2015 , publisher=

2015
[46]

Health (San Francisco) , volume=

Simple demographics often identify people uniquely , author=. Health (San Francisco) , volume=
[47]

Mamma Mia! Where ' s My Name? De-Identifying I talian Clinical Notes with Large Language Models

Miranda, Michele and Brati \`e res, S \'e bastien and Patarnello, Stefano and Lilli, Livia. Mamma Mia! Where ' s My Name? De-Identifying I talian Clinical Notes with Large Language Models. Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025). 2025

2025
[48]

Proceedings of the First Workshop on Writing Aids at the Crossroads of AI, Cognitive Science and NLP (WRAICOGS 2025). 2025

2025
[49]

Chain-of- M eta W riting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts

Buhnila, Ioana and Cislaru, Georgeta and Todirascu, Amalia. Chain-of- M eta W riting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts. 2025

2025
[50]

Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities

Shi, Ken and Penn, Gerald. Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities. 2025

2025
[51]

Reading Between the Lines: A dataset and a study on why some texts are tougher than others

Khallaf, Nouran and Eugeni, Carlo and Sharoff, Serge. Reading Between the Lines: A dataset and a study on why some texts are tougher than others. 2025

2025
[52]

P ara R ev : Building a dataset for Scientific Paragraph Revision annotated with revision instruction

Jourdan, L \'e ane and Boudin, Florian and Dufour, Richard and Hernandez, Nicolas and Aizawa, Akiko. P ara R ev : Building a dataset for Scientific Paragraph Revision annotated with revision instruction. 2025

2025
[53]

Towards an operative definition of creative writing: a preliminary assessment of creativeness in AI and human texts

Maggi, Chiara and Vitaletti, Andrea. Towards an operative definition of creative writing: a preliminary assessment of creativeness in AI and human texts. 2025

2025
[54]

Decoding Semantic Representations in the Brain Under Language Stimuli with Large Language Models

Sato, Anna and Kobayashi, Ichiro. Decoding Semantic Representations in the Brain Under Language Stimuli with Large Language Models. 2025

2025
[55]

Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4). 2025

2025
[56]

A rabic S ense: A Benchmark for Evaluating Commonsense Reasoning in A rabic with Large Language Models

Lamsiyah, Salima and Zeinalipour, Kamyar and El amrany, Samir and Brust, Matthias and Maggini, Marco and Bouvry, Pascal and Schommer, Christoph. A rabic S ense: A Benchmark for Evaluating Commonsense Reasoning in A rabic with Large Language Models. 2025

2025
[57]

Lahjawi: A rabic Cross-Dialect Translator

Hamed, Mohamed Motasim and Hreden, Muhammad and Hennara, Khalil and Aldallal, Zeina and Chrouf, Sara and AlModhayan, Safwan. Lahjawi: A rabic Cross-Dialect Translator. 2025

2025
[58]

Lost in Variation: An Unsupervised Methodology for Mining Lexico-syntactic Patterns in Middle A rabic Texts

Bezan. Lost in Variation: An Unsupervised Methodology for Mining Lexico-syntactic Patterns in Middle A rabic Texts. 2025

2025
[59]

SADSL y C : A Corpus for Saudi A rabian Multi-dialect Identification through Song Lyrics

Alahmari, Salwa Saad. SADSL y C : A Corpus for Saudi A rabian Multi-dialect Identification through Song Lyrics. 2025

2025
[60]

Enhancing Dialectal A rabic Intent Detection through Cross-Dialect Multilingual Input Augmentation

Hossain, Shehenaz and Shammary, Fouad and Shammary, Bahaulddin and Afli, Haithem. Enhancing Dialectal A rabic Intent Detection through Cross-Dialect Multilingual Input Augmentation. 2025

2025
[61]

D ial2 MSA -Verified: A Multi-Dialect A rabic Social Media Dataset for Neural Machine Translation to M odern S tandard A rabic

Khered, Abdullah and Benkhedda, Youcef and Batista-Navarro, Riza. D ial2 MSA -Verified: A Multi-Dialect A rabic Social Media Dataset for Neural Machine Translation to M odern S tandard A rabic. 2025

2025
[62]

Web-Based Corpus Compilation of the Emirati A rabic Dialect

El-Ghawi, Yousra A. Web-Based Corpus Compilation of the Emirati A rabic Dialect. 2025

2025
[63]

Evaluating Calibration of A rabic Pre-trained Language Models on Dialectal Text

Al-Laith, Ali and Kebdani, Rachida. Evaluating Calibration of A rabic Pre-trained Language Models on Dialectal Text. 2025

2025
[64]

Empirical Evaluation of Pre-trained Language Models for Summarizing M oroccan D arija News Articles

Aftiss, Azzedine and Lamsiyah, Salima and Schommer, Christoph and El Alaoui, Said Ouatik. Empirical Evaluation of Pre-trained Language Models for Summarizing M oroccan D arija News Articles. 2025

2025
[65]

D ialect2 SQL : A Novel Text-to- SQL Dataset for A rabic Dialects with a Focus on M oroccan D arija

Chafik, Salmane and Ezzini, Saad and Berrada, Ismail. D ialect2 SQL : A Novel Text-to- SQL Dataset for A rabic Dialects with a Focus on M oroccan D arija. 2025

2025
[66]

A ra S im: Optimizing A rabic Dialect Translation in Children`s Literature with LLM s and Similarity Scores

Bouomar, Alaa Hassan and Abbas, Noorhan. A ra S im: Optimizing A rabic Dialect Translation in Children`s Literature with LLM s and Similarity Scores. 2025

2025
[67]

Navigating Dialectal Bias and Ethical Complexities in L evantine A rabic Hate Speech Detection

Haj Ahmed, Ahmed and Yew, Rui-Jie and Minocher, Xerxes and Venkatasubramanian, Suresh. Navigating Dialectal Bias and Ethical Complexities in L evantine A rabic Hate Speech Detection. 2025

2025
[68]

Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects. 2025

2025
[69]

Findings of the V ar D ial Evaluation Campaign 2025: The N or SID Shared Task on N orwegian Slot, Intent and Dialect Identification

Scherrer, Yves and van der Goot, Rob and M hlum, Petter. Findings of the V ar D ial Evaluation Campaign 2025: The N or SID Shared Task on N orwegian Slot, Intent and Dialect Identification. 2025

2025
[70]

Information Theory and Linguistic Variation: A Study of B razilian and E uropean P ortuguese

Alves, Diego. Information Theory and Linguistic Variation: A Study of B razilian and E uropean P ortuguese. 2025

2025
[71]

Leveraging Open-Source Large Language Models for Native Language Identification

Ng, Yee Man and Markov, Ilia. Leveraging Open-Source Large Language Models for Native Language Identification. 2025

2025
[72]

and Tayyar Madabushi, Harish

Torgbi, Melissa and Clayman, Andrew and Speight, Jordan J. and Tayyar Madabushi, Harish. Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the U nited K ingdom. 2025

2025
[73]

Large Language Models as a Normalizer for Transliteration and Dialectal Translation

Alam, Md Mahfuz Ibn and Anastasopoulos, Antonios. Large Language Models as a Normalizer for Transliteration and Dialectal Translation. 2025

2025
[74]

Testing the Boundaries of LLM s: Dialectal and Language-Variety Tasks

Faisal, Fahim and Anastasopoulos, Antonios. Testing the Boundaries of LLM s: Dialectal and Language-Variety Tasks. 2025

2025
[75]

Text Generation Models for L uxembourgish with Limited Data: A Balanced Multilingual Strategy

Plum, Alistair and Ranasinghe, Tharindu and Purschke, Christoph. Text Generation Models for L uxembourgish with Limited Data: A Balanced Multilingual Strategy. 2025

2025
[76]

Retrieval of Parallelizable Texts Across C hurch S lavic Variants

Lendvai, Piroska and Reichel, Uwe and Jouravel, Anna and Rabus, Achim and Renje, Elena. Retrieval of Parallelizable Texts Across C hurch S lavic Variants. 2025

2025
[77]

Neural Text Normalization for L uxembourgish Using Real-Life Variation Data

Lutgen, Anne-Marie and Plum, Alistair and Purschke, Christoph and Plank, Barbara. Neural Text Normalization for L uxembourgish Using Real-Life Variation Data. 2025

2025
[78]

Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal B avarian Case Study

Kr. Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal B avarian Case Study. 2025

2025
[79]

Regional Distribution of the /el/-/ l/ Merger in A ustralian E nglish

Coats, Steven and Diskin-Holdaway, Chlo \'e and Loakes, Debbie. Regional Distribution of the /el/-/ l/ Merger in A ustralian E nglish. 2025

2025
[80]

Learning Cross-Dialectal Morphophonology with Syllable Structure Constraints

Khalifa, Salam and Qaddoumi, Abdelrahim and Kodner, Jordan and Rambow, Owen. Learning Cross-Dialectal Morphophonology with Syllable Structure Constraints. 2025

2025
[81]

and Riabi, Arij and Seddah, Djam \'e

Lopetegui, Javier A. and Riabi, Arij and Seddah, Djam \'e. Common Ground, Diverse Roots: The Difficulty of Classifying Common Examples in S panish Varieties. 2025

2025

Showing first 80 references.