GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

Anne Lauscher; Fabian Mewes; Vagrant Gautam

arxiv: 2605.30214 · v1 · pith:R2TMPTIYnew · submitted 2026-05-28 · 💻 cs.CL

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

Fabian Mewes , Anne Lauscher , Vagrant Gautam This is my paper

Pith reviewed 2026-06-29 07:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords pronoun fidelityGerman language modelsgrammatical genderneopronounsreferential reasoninggender biasLLM evaluation

0 comments

The pith

LLMs maintain strong grammatical agreement for masculine and feminine pronouns in German but fail with neopronouns xier and en, and most models lose robustness when distractors appear.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the GRUFF dataset to test pronoun fidelity, which measures whether models reuse a previously specified pronoun for an entity even when other entities intervene. It finds that models follow masculine and feminine pronouns according to German's grammatical gender rules without extra context, yet do not do so for the neopronouns xier and en. Models generally lose track when distractors are added, though encoder-only models hold up better in German than they do in English. Occupational stereotypes show little consistency across grammatical cases and across different models. These results matter because German requires more gender agreement than English, so English-only studies of bias and reference may miss language-specific patterns in reasoning and inclusion.

Core claim

We present GRUFF, a large-scale dataset covering four gender agreement systems in nouns and four pronoun sets to measure pronoun fidelity in German. Using this dataset we show that LLMs exhibit strong grammatical agreement for masculine and feminine entities in the absence of explicit context, but not for neopronouns xier and en. Models are generally not robust to distractors, but encoder-only models are more robust in German than in English, reflecting the importance of grammatical gender. Occupational stereotypes in this context are poorly correlated across grammatical cases and across most models, except ones with closely related architectures.

What carries the argument

The GRUFF dataset, which isolates the task of correctly reusing a specified pronoun for a discourse entity despite intervening distractors across different noun gender classes and pronoun types including neopronouns.

If this is right

Models will correctly reuse masculine and feminine pronouns in German text without distractors.
Neopronouns xier and en will not be maintained reliably by current LLMs in German discourse.
Encoder-only models will show greater robustness to distractors than decoder-only models when processing German pronouns.
Occupational stereotypes will show low correlation across different grammatical cases for most models.
Bias patterns measured in German will differ from those in English because of richer grammatical gender.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The greater robustness of encoder-only models in German suggests that explicit grammatical gender marking during training improves handling of referential distractors.
The dataset could support development of fine-tuning methods that raise neopronoun fidelity while preserving agreement on traditional pronouns.
Because stereotypes correlate poorly across cases, bias audits in German should test multiple grammatical forms rather than nominative alone.
Similar fidelity gaps may appear in other languages with grammatical gender such as French or Spanish, warranting parallel datasets.

Load-bearing premise

The GRUFF dataset items isolate pronoun fidelity from other discourse factors and that the chosen occupational stereotypes and distractors are representative of real usage patterns in German.

What would settle it

A controlled test in which current LLMs reuse xier and en at rates matching er and sie in sentences without distractors would falsify the claim of differential agreement for neopronouns.

Figures

Figures reproduced from arXiv: 2605.30214 by Anne Lauscher, Fabian Mewes, Vagrant Gautam.

**Figure 1.** Figure 1: Examples from our proposed dataset, GRUFF (§4), summarizing the main contributions of this paper. In all cases, the blank should be filled with sie (she). pronouns are a linguistic site where minority stress occurs. As large language models (LLMs) become increasingly integrated into our daily lives, their ability to navigate these nuances is no longer just a technical requirement, but an ethical imperative… view at source ↗

**Figure 2.** Figure 2: Overview of the creation of each task instance in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Model accuracy on grammatical gender agreement with and without context. Without context, models [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy at pronoun fidelity with an introductory sentence and [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Pronoun fidelity split by pronoun set and by grammatical case. Overall models are best at reusing [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Top 10 occupations over-resolved to accusative pronouns sie (feminine) on the right, versus ihn (masculine) on the left, in a pattern indicative of stereotyping. Results with other grammatical cases are shown in Appendix E. N-A N-D N-G A-D A-G D-G GBERT-base GBERT-large mBERT XLM-RoBERTa SauerkrautLM-8B SauerkrautLM-70B Llama-70B 0.08 -0.03 0.25 -0.02 0.09 -0.19 0.09 -0.11 0.09 -0.01 -0.12 0.02 -0.12 -0.21… view at source ↗

**Figure 7.** Figure 7: Spearman’s correlation between the stereotyp [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Top 10 occupations over-resolved to nominative pronouns [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Top 10 occupations over-resolved to dative pronouns [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Top 10 occupations over-resolved to genitive pronouns [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Spearman’s correlation between the stereo [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Spearman’s correlation between the stereo [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

read the original abstract

Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with the task of pronoun fidelity, which assesses models' abilities to correctly reuse a previously-specified pronoun for a discourse entity, independent of other potentially distracting discourse entities mentioned in between. However, such research focuses on English, which is a language with limited grammatical gender and almost no gender agreement. In this paper we contribute a novel, large-scale dataset, GRUFF, to measure pronoun fidelity in German, covering four different gender agreement systems in nouns, and four sets of pronouns. With this dataset, we show that LLMs show strong grammatical agreement for masculine and feminine entities in the absence of explicit context, but not for neopronouns xier and en. Models are generally not robust to distractors, but encoder-only models are more robust in German than in English, reflecting the importance of grammatical gender. Finally, we show that occupational stereotypes in this context are poorly correlated across grammatical cases, and across most models, except ones with closely related architectures. We release all code and data to encourage further work on gender-inclusive language and referential reasoning in German.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRUFF gives a usable new dataset for German pronoun tests but the robustness and stereotype claims rest on unshown details of item construction.

read the letter

The paper's core move is releasing GRUFF, a large dataset that applies pronoun fidelity testing to German across four noun gender agreement systems and four pronoun sets, including the neopronouns xier and en. That extension from English-only work is the actual addition.

It does a few things cleanly. The setup lets them compare grammatical agreement strength for masculine and feminine referents against neopronouns, and it checks robustness when distractors are inserted. The finding that encoder-only models hold up better in German than prior English results is worth noting, since German has richer agreement. Releasing code and data is the right step for follow-up work.

The soft spots are in the experimental controls. The abstract states that items isolate fidelity from other discourse factors, but without the actual item templates or distractor selection criteria it is hard to judge whether occupational stereotypes and case variations are representative or just convenient. The claim that stereotypes correlate poorly across grammatical cases is reported, yet it could shift with different occupation lists or model families. These are not fatal, but they mean the patterns are tied to the specific dataset choices.

This paper is for people already working on multilingual LLM evaluation and gender-inclusive language tasks. The dataset itself is the part that could get used. It is coherent enough on its own terms to go to referees rather than desk reject, though any review would need to press on the item construction details and whether the German-English model comparison controls for architecture differences.

Referee Report

0 major / 3 minor

Summary. The paper introduces the GRUFF dataset to measure pronoun fidelity in German LLMs across four gender agreement systems in nouns and four pronoun sets. It claims LLMs exhibit strong grammatical agreement for masculine and feminine entities without explicit context but not for neopronouns xier and en; models are generally not robust to distractors, though encoder-only models show greater robustness in German than English; and occupational stereotypes are poorly correlated across grammatical cases and across most models except those with closely related architectures. All code and data are released.

Significance. If the empirical results hold, this extends English-centric pronoun fidelity and bias research to a language with rich grammatical gender, providing evidence on how grammatical features affect referential reasoning and neopronoun handling in LLMs. The release of code and data is a clear strength supporting reproducibility and further work on gender-inclusive language. The model-family contrast offers a falsifiable benchmark for multilingual evaluation.

minor comments (3)

Abstract: the four pronoun sets are referenced but not named; listing them (or adding a short table) would improve immediate clarity for readers.
Dataset description: include one or two concrete GRUFF example items in the main text (rather than only in supplementary material) to illustrate how distractors and occupational stereotypes are instantiated.
Results: the claim of 'poorly correlated' stereotypes across cases would be strengthened by reporting the exact correlation coefficients and any statistical tests used.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the GRUFF dataset and its contributions to multilingual pronoun fidelity research, as well as for recommending minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical study that introduces the GRUFF dataset and reports LLM evaluation results on pronoun fidelity across German gender systems. No derivations, equations, fitted parameters, or first-principles predictions are present that could reduce to inputs by construction. Claims about model behavior follow directly from the stated experimental contrasts on the released data, with no self-citation chains or ansatzes invoked as load-bearing support. This is a standard falsifiable empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical model; the work rests on the domain assumption that pronoun fidelity tasks validly separate reasoning from bias.

axioms (1)

domain assumption Pronoun fidelity is a valid measure of referential reasoning independent of stereotypical bias
Central to the task definition and evaluation in the abstract

pith-pipeline@v0.9.1-grok · 5752 in / 1119 out tokens · 20030 ms · 2026-06-29T07:37:05.609193+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 2 canonical work pages · 2 internal anchors

[1]

MisgenderMender: A community-informed approach to interventions for misgendering. InPro- ceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7538–7558, Mexico City, Mexico. Association for Computational Linguistics. Alina Huck. 2021. Ef...

2024
[2]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Fairtranslate: an english-french dataset for gender bias evaluation in machine translation by over- coming gender binarity. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and 10 Transparency, FAccT ’25, page 150–166, New York, NY , USA. Association for Computing Machinery. Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, and Arianna ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

InSecond Conference on Language Modeling

Agree to disagree? a meta-evaluation of LLM misgendering. InSecond Conference on Language Modeling. Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang
[4]

InProceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 1630–1640, Florence, Italy

Mitigating gender bias in natural language processing: Literature review. InProceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 1630–1640, Florence, Italy. Association for Computational Linguistics. tresiwalde. 2024. Llama-3-sauerkrautlm-70b-instruct- awq. Hugging Face Model Repository. Last ac- cessed: 2025-10...

2024
[5]

Queer NLP: A Critical Survey on Literature Gaps, Biases and Trends

Queer nlp: A critical survey on literature gaps, biases and trends.Preprint, arXiv:2602.16151. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- donez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Li...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

MisgenderMender: A community-informed approach to interventions for misgendering. InPro- ceedings of the 2024 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7538–7558, Mexico City, Mexico. Association for Computational Linguistics. Alina Huck. 2021. Ef...

2024

[2] [2]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Fairtranslate: an english-french dataset for gender bias evaluation in machine translation by over- coming gender binarity. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and 10 Transparency, FAccT ’25, page 150–166, New York, NY , USA. Association for Computing Machinery. Jaap Jumelet, Leonie Weissweiler, Joakim Nivre, and Arianna ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

InSecond Conference on Language Modeling

Agree to disagree? a meta-evaluation of LLM misgendering. InSecond Conference on Language Modeling. Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang

[4] [4]

InProceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 1630–1640, Florence, Italy

Mitigating gender bias in natural language processing: Literature review. InProceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 1630–1640, Florence, Italy. Association for Computational Linguistics. tresiwalde. 2024. Llama-3-sauerkrautlm-70b-instruct- awq. Hugging Face Model Repository. Last ac- cessed: 2025-10...

2024

[5] [5]

Queer NLP: A Critical Survey on Literature Gaps, Biases and Trends

Queer nlp: A critical survey on literature gaps, biases and trends.Preprint, arXiv:2602.16151. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- donez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Li...

work page internal anchor Pith review Pith/arXiv arXiv 2018