Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers

Arianna Bisazza; Gabriele Sarti; Raquel Fern\'andez; Vera Neplenbroek

arxiv: 2606.02776 · v3 · pith:GQEFZSI5new · submitted 2026-06-01 · 💻 cs.CL

Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers

Vera Neplenbroek , Gabriele Sarti , Arianna Bisazza , Raquel Fern\'andez This is my paper

Pith reviewed 2026-06-28 14:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM advicesociodemographicsconversational contexttopic proxiesadvice disparitieslinguistic featureshigh-stakes scenarios

0 comments

The pith

Conversation topics predict LLM advice better than user sociodemographics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether differences in LLM advice across users stem mainly from inferred sociodemographic traits or from other aspects of the conversation. It shows that models infer those traits poorly from a single history and that group-level disparities in advice remain small. Instead, the topics raised in the conversation turn out to be the strongest predictor of what advice the model produces. These topics often stand in for demographic groups and shift the advice in ways that are hard to anticipate. The comparison is made by measuring how well sociodemographics versus linguistic features such as topic, emotion, and readability explain the outputs in high-stakes domains like legal and medical advice.

Core claim

Although disparities between sociodemographic groups exist in LLM advice, they are minimal in magnitude, and LLMs struggle to infer user sociodemographics from a single conversation history. Conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways.

What carries the argument

Predictive comparison of user sociodemographics against (psycho)linguistic features of the conversation (topic, emotions, readability) to determine which best accounts for variation in LLM advice.

If this is right

Disparities in LLM advice between sociodemographic groups are minimal in magnitude.
LLMs struggle to infer user sociodemographics from a single conversation history.
Conversation topics affect LLM advice in unpredictable ways.
Research is needed to understand and mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers testing for demographic fairness in LLMs may need to control for topic to avoid mistaking topic effects for demographic bias.
Users who raise different topics could receive inconsistent advice even when they share the same sociodemographic profile.
Mitigation efforts might focus on making models less sensitive to topic shifts rather than on demographic balancing alone.

Load-bearing premise

That measuring sociodemographics against the chosen set of linguistic features is sufficient to identify the main driver of any disparities in LLM advice.

What would settle it

A controlled test in which sociodemographic groups produce large differences in advice even after conversation topic is held fixed across groups.

Figures

Figures reproduced from arXiv: 2606.02776 by Arianna Bisazza, Gabriele Sarti, Raquel Fern\'andez, Vera Neplenbroek.

**Figure 1.** Figure 1: Conversation histories from the PRISM dataset, followed by a high-stakes question from the salary domain of SBB and responses by Qwen 3.6 27B. The main predictors of differences in salary are whether the conversation is about job search or travel, not the user’s age or gender. 2025). Most remarkably, conversation histories that contain no explicit sociodemographic information are nonetheless sufficient to … view at source ↗

**Figure 2.** Figure 2: Significant differences in each model’s av [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Linear probing macro F1 scores for Gemma on the Community Alignment Dataset for unbalanced classes. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Average difference in Gemma’s predictions between two users from the same / a different sociodemo [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Top 20 ElasticNet features by coefficient magnitude for Llama’s salary predictions on the Community [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Model behavior for conversations from the Community Alignment Dataset and questions about government [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Model behavior for conversations from the Community Alignment Dataset and questions about legal [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Model behavior for conversations from the Community Alignment Dataset and questions about medical [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Model behavior for conversations from the Community Alignment Dataset and questions about political [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Model behavior for conversations from the Community Alignment Dataset and questions about salary [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Model behavior for conversations from PRISM and questions about government benefits. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Model behavior for conversations from PRISM and questions about legal advice. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Model behavior for conversations from PRISM and questions about medical advice. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Model behavior for conversations from PRISM and questions about political topics. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗

**Figure 15.** Figure 15: Model behavior for conversations from PRISM and questions about salary recommendations. [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Confusion matrices for Kimi’s predictions. Kimi tends to overpredict the majority class: It often predicts [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Linear probing macro F1 scores for Gemma on the Community Alignment Dataset for balanced classes. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Linear probing macro F1 scores for Gemma on PRISM for unbalanced classes. A blue circle indicates [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: Linear probing macro F1 scores for Gemma on PRISM for balanced classes. A blue circle indicates the [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Linear probing macro F1 scores for Llama on the Community Alignment Dataset for unbalanced classes. [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 21.** Figure 21: Linear probing macro F1 scores for Llama on the Community Alignment Dataset for balanced classes. A [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗

**Figure 22.** Figure 22: Linear probing macro F1 scores for Llama on PRISM for unbalanced classes. A blue circle indicates the [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗

**Figure 23.** Figure 23: Linear probing macro F1 scores for Llama on PRISM for balanced classes. A blue circle indicates the [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗

**Figure 24.** Figure 24: Top 20 ElasticNet features by coefficient magnitude for Gemma’s government benefits predictions on the Community Alignment Dataset. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_24.png] view at source ↗

**Figure 25.** Figure 25: Top 20 ElasticNet features by coefficient magnitude for Gemma’s legal predictions on the Community Alignment Dataset. topic_What sales strategies work well for selling topic_I'm a data scientist working for a healthcar topic_I'm a UX designer for a tech startup and I n topic_Can you provide tips on preparing for common topic_What is the definition of a 'moral dilemma,' topic_Can you give advice on strateg… view at source ↗

**Figure 26.** Figure 26: Top 20 ElasticNet features by coefficient magnitude for Gemma’s medical predictions on the Community Alignment Dataset. topic_Who is the lead singer of rock band Radiohea topic_Can you provide guidance on navigating insid topic_Recommend a sci-fi TV show that has received topic_What is the plot of 'The Handmaid's Tale'? topic_I've been invited to be a guest lecturer for topic_What is the story of the anci… view at source ↗

**Figure 27.** Figure 27: Top 20 ElasticNet features by coefficient magnitude for Gemma’s political predictions on the Community Alignment Dataset. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_27.png] view at source ↗

**Figure 28.** Figure 28: Top 20 ElasticNet features by coefficient magnitude for Gemma’s salary predictions on the Community Alignment Dataset. topic_"Election and Political Parties" topic_"Debating Immigration Policies" model_response_liwc_Social type_to_token_ratio_model_response topic_"Israel-Palestine Conflict" model_response_liwc_Clout topic_"Gender and LGBTQ+ Identity" model_response_liwc_QMark topic_"Travel Recommendations… view at source ↗

**Figure 29.** Figure 29: Top 20 ElasticNet features by coefficient magnitude for Gemma’s government benefits predictions on PRISM. topic_"Discussions on Abortion" topic_"Climate Change" topic_"Israel-Palestine Conflict" topic_"Global War Discussions" topic_"Animal and Pet Inquiries" topic_"Travel Recommendations" topic_"Popular Culture (Sports, Music, TV)" topic_"Health and Wellness Advice" topic_"Managing Relationships" num_uniq… view at source ↗

**Figure 30.** Figure 30: Top 20 ElasticNet features by coefficient magnitude for Gemma’s legal predictions on PRISM. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_30.png] view at source ↗

**Figure 31.** Figure 31: Top 20 ElasticNet features by coefficient magnitude for Gemma’s medical predictions on PRISM. topic_"Debating Immigration Policies" topic_"Popular Culture (Sports, Music, TV)" topic_"Election and Political Parties" topic_"Animal and Pet Inquiries" gender_Male avg_concreteness_model_response model_response_liwc_Social model_response_liwc_risk s_positive_model_response user_prompt_liwc_ethnicity model_respo… view at source ↗

**Figure 32.** Figure 32: Top 20 ElasticNet features by coefficient magnitude for Gemma’s political predictions on PRISM. topic_"Travel Recommendations" topic_"Job Search" topic_"Economic Policy and Income Inequality" model_response_liwc_Social topic_"Religion and Spirituality" topic_"Climate Change" model_response_liwc_relig topic_"Israel-Palestine Conflict" model_response_liwc_socbehav model_response_liwc_money ethnicity_Hispani… view at source ↗

**Figure 33.** Figure 33: Top 20 ElasticNet features by coefficient magnitude for Gemma’s salary predictions on PRISM. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_33.png] view at source ↗

**Figure 34.** Figure 34: Top 20 ElasticNet features by coefficient magnitude for Llama’s government benefits predictions on the Community Alignment Dataset. topic_I'm a lawyer specializing in employment law topic_What are the different types of clouds? topic_Can you recommend a good pair of hiking boot topic_How can Latin Americans transition from a ca topic_What are some outdoor activities to do at Ac topic_Write a essay for my … view at source ↗

**Figure 35.** Figure 35: Top 20 ElasticNet features by coefficient magnitude for Llama’s legal predictions on the Community Alignment Dataset. topic_What's the best way to experience the underw topic_Can you recommend a good GPS watch for hikin topic_Create a character profile for a cyberpunk p topic_What's a great way to stay hydrated on a lon topic_Can you suggest a scenic hike in the Vanoise topic_What are the best places to v… view at source ↗

**Figure 36.** Figure 36: Top 20 ElasticNet features by coefficient magnitude for Llama’s medical predictions on the Community Alignment Dataset. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_36.png] view at source ↗

**Figure 37.** Figure 37: Top 20 ElasticNet features by coefficient magnitude for Llama’s political predictions on the Community Alignment Dataset. topic_"Israel-Palestine Conflict" topic_"Global War Discussions" topic_"Travel Recommendations" topic_"Debating Immigration Policies" topic_"Election and Political Parties" reside_region_Americas avg_num_syllables_model_response education_Completed Secondary School model_response_liwc_… view at source ↗

**Figure 38.** Figure 38: Top 20 ElasticNet features by coefficient magnitude for Llama’s government benefits predictions on PRISM. topic_"Discussions on Race and Racism" topic_"Gender and LGBTQ+ Identity" topic_"Debating Immigration Policies" topic_"Weather Inquiries" topic_"Ethics of Death and Killing" topic_"Election and Political Parties" topic_"Israel-Palestine Conflict" topic_"Travel Recommendations" model_response_liwc_affi… view at source ↗

**Figure 39.** Figure 39: Top 20 ElasticNet features by coefficient magnitude for Llama’s legal predictions on PRISM. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_39.png] view at source ↗

**Figure 40.** Figure 40: Top 20 ElasticNet features by coefficient magnitude for Llama’s medical predictions on PRISM. topic_"Gender and LGBTQ+ Identity" topic_"Discussions on Race and Racism" topic_"Animal and Pet Inquiries" topic_"Election and Political Parties" topic_"Weather Inquiries" topic_"Discussions on Abortion" topic_"Debating Immigration Policies" topic_"Managing Relationships" topic_"Israel-Palestine Conflict" model_r… view at source ↗

**Figure 41.** Figure 41: Top 20 ElasticNet features by coefficient magnitude for Llama’s political predictions on PRISM. topic_"Travel Recommendations" topic_"Animal and Pet Inquiries" topic_"Job Search" topic_"Debating Immigration Policies" model_response_liwc_Social model_response_liwc_socbehav model_response_liwc_Lifestyle topic_"Holiday Celebration Planning" topic_"Popular Culture (Sports, Music, TV)" model_response_liwc_WC t… view at source ↗

**Figure 42.** Figure 42: Top 20 ElasticNet features by coefficient magnitude for Llama’s salary predictions on PRISM. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_42.png] view at source ↗

**Figure 43.** Figure 43: Top 20 ElasticNet features by coefficient magnitude for Qwen’s government benefits predictions on the Community Alignment Dataset. topic_Write a letter to my friend who's moving to topic_What are the benefits of bee pollination? topic_Can you recognize and correct grammatical er topic_What are different career options for a psyc topic_Can you summarize a long piece of text or st topic_I'm worried about a … view at source ↗

**Figure 44.** Figure 44: Top 20 ElasticNet features by coefficient magnitude for Qwen’s legal predictions on the Community Alignment Dataset. topic_Can we truly change who we are, or are we st topic_I'm looking for the best places to watch the topic_I want to upgrade my gaming console, can you topic_I want to try some street food in Recife, wh topic_What makes life worth living? topic_What are the best cafes in Porto Alegre for t… view at source ↗

**Figure 45.** Figure 45: Top 20 ElasticNet features by coefficient magnitude for Qwen’s medical predictions on the Community Alignment Dataset. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_45.png] view at source ↗

**Figure 46.** Figure 46: Top 20 ElasticNet features by coefficient magnitude for Qwen’s political predictions on the Community Alignment Dataset. topic_Write a tongue-twister poem about a llama wh topic_Can I get advice on creating a long-distance topic_I'm a member of the Parent-Teacher Associati topic_How to prepare for a career in a high-growth topic_I'm looking for a budget-friendly hotel in A topic_I'd like to write a thank-… view at source ↗

**Figure 47.** Figure 47: Top 20 ElasticNet features by coefficient magnitude for Qwen’s salary predictions on the Community Alignment Dataset. num_unique_lemmas_model_response topic_"Job Search" avg_sent_len_user_prompt model_response_liwc_Social s_negative_user_prompt model_response_liwc_differ type_to_token_ratio_model_response num_unique_lemmas_user_prompt model_response_liwc_BigWords model_response_liwc_insight religion_No Af… view at source ↗

**Figure 48.** Figure 48: Top 20 ElasticNet features by coefficient magnitude for Qwen’s government benefits predictions on PRISM. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_48.png] view at source ↗

**Figure 49.** Figure 49: Top 20 ElasticNet features by coefficient magnitude for Qwen’s legal predictions on PRISM. topic_"Debating Immigration Policies" topic_"Animal and Pet Inquiries" topic_"Managing Relationships" topic_"Popular Culture (Sports, Music, TV)" model_response_liwc_i user_prompt_liwc_WPS topic_"Job Search" s_negative_user_prompt e_caring_model_response num_unique_lemmas_user_prompt model_response_liwc_OtherP user_… view at source ↗

**Figure 50.** Figure 50: Top 20 ElasticNet features by coefficient magnitude for Qwen’s medical predictions on PRISM. num_punctuation_model_response flesch_reading_ease_model_response s_negative_model_response e_curiosity_model_response num_tokens_model_response english_proficiency_Native speaker model_response_liwc_Perception user_prompt_liwc_socbehav reside_region_Oceania s_negative_user_prompt model_response_liwc_Comma model_r… view at source ↗

**Figure 51.** Figure 51: Top 20 ElasticNet features by coefficient magnitude for Qwen’s political predictions on PRISM. 36 [PITH_FULL_IMAGE:figures/full_fig_p036_51.png] view at source ↗

**Figure 52.** Figure 52: Top 20 ElasticNet features by coefficient magnitude for Qwen’s salary predictions on PRISM. Dataset Domain Model Topic Demograp. Emotion Polite. Sent. Concrete. Reading Ease LIWC Ling. Max Mean Max Mean Max Mean Max Mean Max Mean Max Mean Max Mean Max Mean Max Mean Community Alignment Dataset Benefits Gemma 18.62 2.12 0.52 0.14 0.28 0.08 0.10 0.05 0.38 0.20 0.19 0.19 0.23 0.21 1.49 0.17 1.01 0.35 Llama 18… view at source ↗

**Figure 53.** Figure 53: Average difference in Llama’s predictions between two users from the same / a different sociodemographic [PITH_FULL_IMAGE:figures/full_fig_p037_53.png] view at source ↗

**Figure 54.** Figure 54: Average difference in Qwen’s predictions between two users from the same / a different sociodemographic [PITH_FULL_IMAGE:figures/full_fig_p038_54.png] view at source ↗

**Figure 55.** Figure 55: Model behavior with mitigation prompt for conversations from PRISM and questions about government [PITH_FULL_IMAGE:figures/full_fig_p039_55.png] view at source ↗

**Figure 56.** Figure 56: Model behavior with mitigation prompt for conversations from PRISM and questions about political [PITH_FULL_IMAGE:figures/full_fig_p040_56.png] view at source ↗

**Figure 57.** Figure 57: Model behavior with mitigation prompt for conversations from PRISM and questions about salary [PITH_FULL_IMAGE:figures/full_fig_p041_57.png] view at source ↗

read the original abstract

When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others. In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability. We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs infer sociodemographics poorly from one conversation and that topics drive advice differences more than direct group signals, but the ranking of topics as the top driver rests on a narrow set of compared features.

read the letter

The main point here is that LLMs do not pick up user sociodemographics reliably from conversation history, group-level disparities in advice turn out small, and topic is the strongest predictor among the features they tested. That last part matters for anyone deploying these models in legal or medical settings.

What the work adds is a direct test of inference failure plus a head-to-head comparison of sociodemographics against topic, emotions, and readability. The result that topics function as proxies and affect outputs in uneven ways is a concrete step beyond earlier disparity reports.

The comparison itself is the soft spot. They conclude topics are most predictive after pitting them against only three other measured things. If lexical patterns, turn length, or model priors explain more variance, the ranking could change. The abstract gives no sample sizes, model versions, or exact prediction metrics, so it is difficult to judge how cleanly the isolation was done. That concern from the stress-test note lands; the feature set is limited by design.

The paper is aimed at people who study or deploy conversational LLMs in high-stakes domains. A reader already working on bias measurement will get a usable data point on proxies, even if they want tighter controls. The question is practical and the empirical framing is honest, so the manuscript deserves a full referee pass rather than a desk rejection. I would send it out.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs struggle to infer user sociodemographics from a single conversation history, that outcome disparities across sociodemographic groups are minimal in magnitude, and that conversation topics are the most predictive factor (among topics, emotions, and readability) of LLM-generated advice, functioning as proxies for sociodemographic groups and affecting advice in unpredictable ways.

Significance. If the empirical ranking of predictive power holds under a more exhaustive feature set, the work would usefully shift focus from direct demographic inference to contextual proxies in high-stakes LLM advice, providing a concrete empirical basis for studying topic-driven disparities and motivating targeted mitigation research.

major comments (3)

[§4] §4 (feature comparison): the claim that topics are 'most predictive' rests on a comparison limited to sociodemographics, emotions, and readability. Without an ablation that includes additional variables such as lexical n-grams, conversation length, or model priors, it is unclear whether the observed ranking would survive a broader feature set; this directly affects the central proxy conclusion.
[Results] Results on inference accuracy: the statement that LLMs 'struggle to infer' sociodemographics requires explicit metrics (e.g., F1 or AUC per demographic category) and controls for class imbalance; the abstract alone does not report these values, leaving the 'struggle' claim unquantified relative to chance or trivial baselines.
[Results] Disparity magnitude: the assertion that disparities are 'minimal' needs a concrete effect-size threshold or comparison to prior work; without reported confidence intervals or standardized differences, it is difficult to assess whether the minimal-magnitude claim is robust or sensitive to the chosen advice domains.

minor comments (2)

[Methods] Clarify the exact operationalization of 'conversation topic' (e.g., LDA topics, LLM-generated labels, or human annotations) and report inter-annotator agreement if applicable.
[Discussion] The abstract states topics 'often affect advice in unpredictable ways'; provide at least one concrete example of an unpredictable effect with the corresponding prompt and output pair.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each of the major comments below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [§4] §4 (feature comparison): the claim that topics are 'most predictive' rests on a comparison limited to sociodemographics, emotions, and readability. Without an ablation that includes additional variables such as lexical n-grams, conversation length, or model priors, it is unclear whether the observed ranking would survive a broader feature set; this directly affects the central proxy conclusion.

Authors: Our analysis focused on a set of features drawn from psycholinguistic literature that are plausibly linked to sociodemographic differences. While we agree that an exhaustive comparison including n-grams and model priors would provide additional robustness, the current results demonstrate that topics outperform the other considered features in predictive power. We will add a discussion of this limitation and note that future work could explore broader feature sets. However, the proxy conclusion is supported within the scope of our comparisons. revision: partial
Referee: [Results] Results on inference accuracy: the statement that LLMs 'struggle to infer' sociodemographics requires explicit metrics (e.g., F1 or AUC per demographic category) and controls for class imbalance; the abstract alone does not report these values, leaving the 'struggle' claim unquantified relative to chance or trivial baselines.

Authors: The full manuscript includes detailed metrics in the results section, including per-category performance and comparisons to baselines. To address the concern, we will revise the abstract to explicitly state the key quantitative findings, such as F1 scores near chance levels after imbalance correction. revision: yes
Referee: [Results] Disparity magnitude: the assertion that disparities are 'minimal' needs a concrete effect-size threshold or comparison to prior work; without reported confidence intervals or standardized differences, it is difficult to assess whether the minimal-magnitude claim is robust or sensitive to the chosen advice domains.

Authors: We will incorporate effect sizes, confidence intervals, and standardized differences in the results. Additionally, we will include comparisons to effect sizes reported in prior studies on LLM-generated disparities to better contextualize the 'minimal' claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical feature comparison

full rationale

The paper conducts a direct empirical comparison of sociodemographic variables against measured (psycho)linguistic features (topic, emotions, readability) to assess predictive power over LLM advice outputs. No equations, parameter fitting followed by renamed predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation. The central claim that topics are most predictive follows from the authors' own measurements on their collected data without reducing to an input by construction or imported uniqueness result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study with no mathematical derivations, free parameters, or postulated entities; all claims rest on experimental comparisons whose details are absent from the abstract.

pith-pipeline@v0.9.1-grok · 5731 in / 1088 out tokens · 34813 ms · 2026-06-28T14:36:36.384122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 31 canonical work pages · 1 internal anchor

[1]

The AI Gap: How Socioeconomic Status Affects Language Technology Interactions

Bassignana, Elisa and Curry, Amanda Cercas and Hovy, Dirk. The AI Gap: How Socioeconomic Status Affects Language Technology Interactions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.914

work page doi:10.18653/v1/2025.acl-long.914 2025
[2]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Who's Asking? Investigating Bias Through the Lens of Disability-Framed Queries in LLMs , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[3]

2024 , url=

Elinor Poole-Dayan and Deb Roy and Jad Kabbara , booktitle=. 2024 , url=

2024
[4]

Classist Tools: Social Class Correlates with Performance in NLP

Cercas Curry, Amanda and Attanasio, Giuseppe and Talat, Zeerak and Hovy, Dirk. Classist Tools: Social Class Correlates with Performance in NLP. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.682

work page doi:10.18653/v1/2024.acl-long.682 2024
[5]

Native Design Bias: Studying the Impact of E nglish Nativeness on Language Model Performance

Reusens, Manon and Borchert, Philipp and De Weerdt, Jochen and Baesens, Bart. Native Design Bias: Studying the Impact of E nglish Nativeness on Language Model Performance. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics...

work page doi:10.18653/v1/2025.findings-ijcnlp.73 2025
[6]

and Narayanan, Arvind , year=

Aylin Caliskan and Joanna J. Bryson and Arvind Narayanan , title =. Science , volume =. 2017 , doi =. https://www.science.org/doi/pdf/10.1126/science.aal4230 , abstract =

work page doi:10.1126/science.aal4230 2017
[7]

``You Gotta be a Doctor, Lin'' : An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations

Nghiem, Huy and Prindle, John and Zhao, Jieyu and Daum \'e Iii, Hal. ``You Gotta be a Doctor, Lin'' : An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.413

work page doi:10.18653/v1/2024.emnlp-main.413 2024
[8]

The Impact of Name Age Perception on Job Recommendations in LLM s

Kamruzzaman, Mahammed and Kim, Gene Louis. The Impact of Name Age Perception on Job Recommendations in LLM s. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.778

work page doi:10.18653/v1/2025.findings-acl.778 2025
[9]

Presumed Cultural Identity: How Names Shape LLM Responses

Pawar, Siddhesh Milind and Arora, Arnav and Kaffee, Lucie-Aim \'e e and Augenstein, Isabelle. Presumed Cultural Identity: How Names Shape LLM Responses. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1207

work page doi:10.18653/v1/2025.findings-emnlp.1207 2025
[10]

2026 , eprint=

One Persona, Many Cues, Different Results: How Sociodemographic Cues Impact LLM Personalization , author=. 2026 , eprint=

2026
[11]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

2023
[12]

Belinkov

Belinkov, Yonatan , title =. Computational Linguistics , volume =. 2022 , month =. doi:10.1162/coli_a_00422 , url =

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[13]

Deep Dominance - How to Properly Compare Deep Neural Models , booktitle =

Rotem Dror and Segev Shlomov and Roi Reichart , editor =. Deep Dominance - How to Properly Compare Deep Neural Models , booktitle =. 2019 , url =. doi:10.18653/v1/p19-1266 , timestamp =

work page doi:10.18653/v1/p19-1266 2019
[14]

Behavior research methods , volume=

Concreteness ratings for 40 thousand generally known English word lemmas , author=. Behavior research methods , volume=. 2014 , publisher=

2014
[15]

2019 , journal=

Language Models are Unsupervised Multitask Learners , author=. 2019 , journal=

2019
[16]

T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification

Barbieri, Francesco and Camacho-Collados, Jose and Espinosa Anke, Luis and Neves, Leonardo. T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.148

work page doi:10.18653/v1/2020.findings-emnlp.148 2020
[17]

Austin, TX: University of Texas at Austin , volume=

The development and psychometric properties of LIWC-22 , author=. Austin, TX: University of Texas at Austin , volume=
[18]

University of Chicago Legal Forum , author =

Demarginalizing the. University of Chicago Legal Forum , author =. 1989 , pages =

1989
[19]

, author=

A new readability yardstick. , author=. Journal of applied psychology , volume=. 1948 , publisher=

1948
[20]

doi:10.5281/zenodo.10009823 , url =

Ines Montani and Matthew Honnibal and Matthew Honnibal and Adriane Boyd and Sofie Van Landeghem and Henning Peters , title =. doi:10.5281/zenodo.10009823 , url =

work page doi:10.5281/zenodo.10009823
[21]

Language and Social Class , urldate =

Basil Bernstein , journal =. Language and Social Class , urldate =
[22]

The Fourteenth International Conference on Learning Representations , year=

Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations , author=. The Fourteenth International Conference on Learning Representations , year=
[23]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=
[24]

Psychology of women quarterly , volume=

The gender stereotyping of emotions , author=. Psychology of women quarterly , volume=. 2000 , publisher=

2000
[25]

Angry Men, Sad Women: Large Language Models Reflect Gendered Stereotypes in Emotion Attribution

Plaza-del-Arco, Flor Miriam and Cercas Curry, Amanda and Curry, Alba and Abercrombie, Gavin and Hovy, Dirk. Angry Men, Sad Women: Large Language Models Reflect Gendered Stereotypes in Emotion Attribution. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.415

work page doi:10.18653/v1/2024.acl-long.415 2024
[26]

Divine LL a MA s: Bias, Stereotypes, Stigmatization, and Emotion Representation of Religion in Large Language Models

Plaza-del-Arco, Flor Miriam and Curry, Amanda Cercas and Paoli, Susanna and Cercas Curry, Alba and Hovy, Dirk. Divine LL a MA s: Bias, Stereotypes, Stigmatization, and Emotion Representation of Religion in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.251

work page doi:10.18653/v1/2024.findings-emnlp.251 2024
[27]

User-Level Race and Ethnicity Predictors from T witter Text

Preo t iuc-Pietro, Daniel and Ungar, Lyle. User-Level Race and Ethnicity Predictors from T witter Text. Proceedings of the 27th International Conference on Computational Linguistics. 2018

2018
[28]

Newman and Carla J

Matthew L. Newman and Carla J. Groom and Lori D. Handelman and James W. Pennebaker , title =. Discourse Processes , volume =. 2008 , publisher =. doi:10.1080/01638530802073712 , URL =

work page doi:10.1080/01638530802073712 2008
[29]

2026 , eprint=

Kimi K2: Open Agentic Intelligence , author=. 2026 , eprint=

2026
[30]

2026 , eprint=

The Need for a Socially-Grounded Persona Framework for User Simulation , author=. 2026 , eprint=

2026
[31]

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[32]

The Pluralistic Moral Gap: Understanding Moral Judgment and Value Differences between Humans and Large Language Models

Russo, Giuseppe and Nozza, Debora and R. The Pluralistic Moral Gap: Understanding Moral Judgment and Value Differences between Humans and Large Language Models. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.305

work page doi:10.18653/v1/2026.eacl-long.305 2026
[33]

2026 , eprint=

Can Fairness Be Prompted? Prompt-Based Debiasing Strategies in High-Stakes Recommendations , author=. 2026 , eprint=

2026
[34]

Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

Shan, Zhengyang and Mueller, Aaron. Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.199

work page doi:10.18653/v1/2026.eacl-long.199 2026
[35]

2026 , eprint=

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs , author=. 2026 , eprint=

2026
[36]

The Mathematics of the Uncertain , pages=

An optimal transportation approach for assessing almost stochastic order , author=. The Mathematics of the Uncertain , pages=. 2018 , publisher=

2018
[37]

arXiv preprint arXiv:2204.06815 , year=

deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks , author=. arXiv preprint arXiv:2204.06815 , year=

arXiv
[38]

ELLA : Empowering LLM s for Interpretable, Accurate and Informative Legal Advice

Hu, Yutong and Luo, Kangcheng and Feng, Yansong. ELLA : Empowering LLM s for Interpretable, Accurate and Informative Legal Advice. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.acl-demos.36

work page doi:10.18653/v1/2024.acl-demos.36 2024
[39]

J ob F air: A Framework for Benchmarking Gender Hiring Bias in Large Language Models

Wang, Ze and Wu, Zekun and Guan, Xin and Thaler, Michael and Koshiyama, Adriano and Lu, Skylar and Beepath, Sachin and Ertekin, Ediz and Perez-Ortiz, Maria. J ob F air: A Framework for Benchmarking Gender Hiring Bias in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.184

work page doi:10.18653/v1/2024.findings-emnlp.184 2024
[40]

Stereotype or Personalization? User Identity Biases Chatbot Recommendations

Kantharuban, Anjali and Milbauer, Jeremiah and Sap, Maarten and Strubell, Emma and Neubig, Graham. Stereotype or Personalization? User Identity Biases Chatbot Recommendations. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1254

work page doi:10.18653/v1/2025.findings-acl.1254 2025
[41]

No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models

Plaza-del-Arco, Flor Miriam and R. No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models. Proceedings of the 9th Widening NLP Workshop. 2025. doi:10.18653/v1/2025.winlp-main.39

work page doi:10.18653/v1/2025.winlp-main.39 2025
[42]

Nature , volume=

AI generates covertly racist decisions about people based on their dialect , author=. Nature , volume=. 2024 , publisher=

2024
[43]

Large Language Models Discriminate Against Speakers of G erman Dialects

Bui, Minh Duc and Holtermann, Carolin and Hofmann, Valentin and Lauscher, Anne and von der Wense, Katharina. Large Language Models Discriminate Against Speakers of G erman Dialects. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.415

work page doi:10.18653/v1/2025.emnlp-main.415 2025
[44]

Linguistic Bias in C hat GPT : Language Models Reinforce Dialect Discrimination

Fleisig, Eve and Smith, Genevieve and Bossi, Madeline and Rustagi, Ishita and Yin, Xavier and Klein, Dan. Linguistic Bias in C hat GPT : Language Models Reinforce Dialect Discrimination. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.750

work page doi:10.18653/v1/2024.emnlp-main.750 2024
[45]

Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLM s

Rodr \'i guez, Elisa Forcada and Perez-de-Vinaspre, Olatz and Campos, Jon Ander and Klakow, Dietrich and Gautam, Vagrant. Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLM s. Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP). 2025. doi:10.18653/v1/2025.gebnlp-1.18

work page doi:10.18653/v1/2025.gebnlp-1.18 2025
[46]

2025 , eprint=

Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks , author=. 2025 , eprint=

2025
[47]

2023 , eprint=

Evaluating and Mitigating Discrimination in Language Model Decisions , author=. 2023 , eprint=

2023
[48]

C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.154

work page doi:10.18653/v1/2020.emnlp-main.154 2020
[49]

S tereo S et: Measuring stereotypical bias in pretrained language models

Nadeem, Moin and Bethke, Anna and Reddy, Siva. S tereo S et: Measuring stereotypical bias in pretrained language models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.416

work page doi:10.18653/v1/2021.acl-long.416 2021
[50]

Language and Linguistics Compass , volume =

Hovy, Dirk and Prabhumoye, Shrimai , title =. Language and Linguistics Compass , volume =. doi:https://doi.org/10.1111/lnc3.12432 , url =. https://compass.onlinelibrary.wiley.com/doi/pdf/10.1111/lnc3.12432 , abstract =

work page doi:10.1111/lnc3.12432
[51]

2026 , eprint=

Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study , author=. 2026 , eprint=

2026
[52]

2026 , eprint=

Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA , author=. 2026 , eprint=

2026
[53]

2026 , eprint=

LLMs Can Infer Political Alignment from Online Conversations , author=. 2026 , eprint=

2026
[54]

2026 , eprint=

Large-scale online deanonymization with LLMs , author=. 2026 , eprint=

2026
[55]

2026 , eprint=

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist , author=. 2026 , eprint=

2026
[56]

2026 , eprint=

Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness , author=. 2026 , eprint=

2026
[57]

2025 , eprint=

Prioritize Economy or Climate Action? Investigating ChatGPT Response Differences Based on Inferred Political Orientation , author=. 2025 , eprint=

2025
[58]

Implicit Personalization in Language Models: A Systematic Study

Jin, Zhijing and Heil, Nils and Liu, Jiarui and Dhuliawala, Shehzaad and Qi, Yahang and Sch. Implicit Personalization in Language Models: A Systematic Study. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.717

work page doi:10.18653/v1/2024.findings-emnlp.717 2024
[59]

Does the Prompt-based Large Language Model Recognize Students' Demographics and Introduce Bias in Essay Scoring? , year =

Yang, Kaixun and Rakovi\'. Does the Prompt-based Large Language Model Recognize Students' Demographics and Introduce Bias in Essay Scoring? , year =. Artificial Intelligence in Education: 26th International Conference, AIED 2025, Palermo, Italy, July 22–26, 2025, Proceedings, Part II , pages =. doi:10.1007/978-3-031-98417-4_6 , abstract =

work page doi:10.1007/978-3-031-98417-4_6 2025
[60]

and Hovy, Dirk

Lauscher, Anne and Bianchi, Federico and Bowman, Samuel R. and Hovy, Dirk. S ocio P robe: What, When, and Where Language Models Learn about Sociodemographics. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.539

work page doi:10.18653/v1/2022.emnlp-main.539 2022
[61]

2025 , eprint=

Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts , author=. 2025 , eprint=

2025
[62]

2026 , eprint=

DAIQ: Auditing Demographic Attribute Inference from Question in LLMs , author=. 2026 , eprint=

2026
[63]

2025 , eprint=

Accumulating Context Changes the Beliefs of Language Models , author=. 2025 , eprint=

2025
[64]

The Twelfth International Conference on Learning Representations , year=

Beyond Memorization: Violating Privacy via Inference with Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
[65]

2026 , url=

Philippe Laban and Hiroaki Hayashi and Yingbo Zhou and Jennifer Neville , booktitle=. 2026 , url=

2026
[66]

2024 , eprint=

Designing a Dashboard for Transparency and Control of Conversational AI , author=. 2024 , eprint=

2024
[67]

Reading Between the Prompts: How Stereotypes Shape LLM ' s Implicit Personalization

Neplenbroek, Vera and Bisazza, Arianna and Fern \'a ndez, Raquel. Reading Between the Prompts: How Stereotypes Shape LLM ' s Implicit Personalization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1029

work page doi:10.18653/v1/2025.emnlp-main.1029 2025
[68]

2026 , eprint=

From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support , author=. 2026 , eprint=

2026
[69]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[70]

2025 , eprint=

PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory , author=. 2025 , eprint=

2025
[71]

2025 , eprint=

Language Models Change Facts Based on the Way You Talk , author=. 2025 , eprint=

2025
[72]

Hannah Rose Kirk and Alexander Whitefield and Paul R. The. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=
[73]

2026 , eprint=

Different Demographic Cues Yield Inconsistent Conclusions About LLM Personalization and Bias , author=. 2026 , eprint=

2026
[74]

The Fourteenth International Conference on Learning Representations , year=

Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset , author=. The Fourteenth International Conference on Learning Representations , year=

[1] [1]

The AI Gap: How Socioeconomic Status Affects Language Technology Interactions

Bassignana, Elisa and Curry, Amanda Cercas and Hovy, Dirk. The AI Gap: How Socioeconomic Status Affects Language Technology Interactions. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.914

work page doi:10.18653/v1/2025.acl-long.914 2025

[2] [2]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Who's Asking? Investigating Bias Through the Lens of Disability-Framed Queries in LLMs , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[3] [3]

2024 , url=

Elinor Poole-Dayan and Deb Roy and Jad Kabbara , booktitle=. 2024 , url=

2024

[4] [4]

Classist Tools: Social Class Correlates with Performance in NLP

Cercas Curry, Amanda and Attanasio, Giuseppe and Talat, Zeerak and Hovy, Dirk. Classist Tools: Social Class Correlates with Performance in NLP. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.682

work page doi:10.18653/v1/2024.acl-long.682 2024

[5] [5]

Native Design Bias: Studying the Impact of E nglish Nativeness on Language Model Performance

Reusens, Manon and Borchert, Philipp and De Weerdt, Jochen and Baesens, Bart. Native Design Bias: Studying the Impact of E nglish Nativeness on Language Model Performance. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics...

work page doi:10.18653/v1/2025.findings-ijcnlp.73 2025

[6] [6]

and Narayanan, Arvind , year=

Aylin Caliskan and Joanna J. Bryson and Arvind Narayanan , title =. Science , volume =. 2017 , doi =. https://www.science.org/doi/pdf/10.1126/science.aal4230 , abstract =

work page doi:10.1126/science.aal4230 2017

[7] [7]

``You Gotta be a Doctor, Lin'' : An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations

Nghiem, Huy and Prindle, John and Zhao, Jieyu and Daum \'e Iii, Hal. ``You Gotta be a Doctor, Lin'' : An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.413

work page doi:10.18653/v1/2024.emnlp-main.413 2024

[8] [8]

The Impact of Name Age Perception on Job Recommendations in LLM s

Kamruzzaman, Mahammed and Kim, Gene Louis. The Impact of Name Age Perception on Job Recommendations in LLM s. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.778

work page doi:10.18653/v1/2025.findings-acl.778 2025

[9] [9]

Presumed Cultural Identity: How Names Shape LLM Responses

Pawar, Siddhesh Milind and Arora, Arnav and Kaffee, Lucie-Aim \'e e and Augenstein, Isabelle. Presumed Cultural Identity: How Names Shape LLM Responses. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.1207

work page doi:10.18653/v1/2025.findings-emnlp.1207 2025

[10] [10]

2026 , eprint=

One Persona, Many Cues, Different Results: How Sociodemographic Cues Impact LLM Personalization , author=. 2026 , eprint=

2026

[11] [11]

Nature , volume=

Large language models encode clinical knowledge , author=. Nature , volume=. 2023 , publisher=

2023

[12] [12]

Belinkov

Belinkov, Yonatan , title =. Computational Linguistics , volume =. 2022 , month =. doi:10.1162/coli_a_00422 , url =

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022

[13] [13]

Deep Dominance - How to Properly Compare Deep Neural Models , booktitle =

Rotem Dror and Segev Shlomov and Roi Reichart , editor =. Deep Dominance - How to Properly Compare Deep Neural Models , booktitle =. 2019 , url =. doi:10.18653/v1/p19-1266 , timestamp =

work page doi:10.18653/v1/p19-1266 2019

[14] [14]

Behavior research methods , volume=

Concreteness ratings for 40 thousand generally known English word lemmas , author=. Behavior research methods , volume=. 2014 , publisher=

2014

[15] [15]

2019 , journal=

Language Models are Unsupervised Multitask Learners , author=. 2019 , journal=

2019

[16] [16]

T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification

Barbieri, Francesco and Camacho-Collados, Jose and Espinosa Anke, Luis and Neves, Leonardo. T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.148

work page doi:10.18653/v1/2020.findings-emnlp.148 2020

[17] [17]

Austin, TX: University of Texas at Austin , volume=

The development and psychometric properties of LIWC-22 , author=. Austin, TX: University of Texas at Austin , volume=

[18] [18]

University of Chicago Legal Forum , author =

Demarginalizing the. University of Chicago Legal Forum , author =. 1989 , pages =

1989

[19] [19]

, author=

A new readability yardstick. , author=. Journal of applied psychology , volume=. 1948 , publisher=

1948

[20] [20]

doi:10.5281/zenodo.10009823 , url =

Ines Montani and Matthew Honnibal and Matthew Honnibal and Adriane Boyd and Sofie Van Landeghem and Henning Peters , title =. doi:10.5281/zenodo.10009823 , url =

work page doi:10.5281/zenodo.10009823

[21] [21]

Language and Social Class , urldate =

Basil Bernstein , journal =. Language and Social Class , urldate =

[22] [22]

The Fourteenth International Conference on Learning Representations , year=

Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations , author=. The Fourteenth International Conference on Learning Representations , year=

[23] [23]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

[24] [24]

Psychology of women quarterly , volume=

The gender stereotyping of emotions , author=. Psychology of women quarterly , volume=. 2000 , publisher=

2000

[25] [25]

Angry Men, Sad Women: Large Language Models Reflect Gendered Stereotypes in Emotion Attribution

Plaza-del-Arco, Flor Miriam and Cercas Curry, Amanda and Curry, Alba and Abercrombie, Gavin and Hovy, Dirk. Angry Men, Sad Women: Large Language Models Reflect Gendered Stereotypes in Emotion Attribution. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.415

work page doi:10.18653/v1/2024.acl-long.415 2024

[26] [26]

Divine LL a MA s: Bias, Stereotypes, Stigmatization, and Emotion Representation of Religion in Large Language Models

Plaza-del-Arco, Flor Miriam and Curry, Amanda Cercas and Paoli, Susanna and Cercas Curry, Alba and Hovy, Dirk. Divine LL a MA s: Bias, Stereotypes, Stigmatization, and Emotion Representation of Religion in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.251

work page doi:10.18653/v1/2024.findings-emnlp.251 2024

[27] [27]

User-Level Race and Ethnicity Predictors from T witter Text

Preo t iuc-Pietro, Daniel and Ungar, Lyle. User-Level Race and Ethnicity Predictors from T witter Text. Proceedings of the 27th International Conference on Computational Linguistics. 2018

2018

[28] [28]

Newman and Carla J

Matthew L. Newman and Carla J. Groom and Lori D. Handelman and James W. Pennebaker , title =. Discourse Processes , volume =. 2008 , publisher =. doi:10.1080/01638530802073712 , URL =

work page doi:10.1080/01638530802073712 2008

[29] [29]

2026 , eprint=

Kimi K2: Open Agentic Intelligence , author=. 2026 , eprint=

2026

[30] [30]

2026 , eprint=

The Need for a Socially-Grounded Persona Framework for User Simulation , author=. 2026 , eprint=

2026

[31] [31]

Transformers: State-of-the-Art Natural Language Processing

Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Remi and Funtowicz, Morgan and Davison, Joe and Shleifer, Sam and von Platen, Patrick and Ma, Clara and Jernite, Yacine and Plu, Julien and Xu, Canwen and Le Scao, Teven and Gugger, Sylvain and Drame, M...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[32] [32]

The Pluralistic Moral Gap: Understanding Moral Judgment and Value Differences between Humans and Large Language Models

Russo, Giuseppe and Nozza, Debora and R. The Pluralistic Moral Gap: Understanding Moral Judgment and Value Differences between Humans and Large Language Models. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.305

work page doi:10.18653/v1/2026.eacl-long.305 2026

[33] [33]

2026 , eprint=

Can Fairness Be Prompted? Prompt-Based Debiasing Strategies in High-Stakes Recommendations , author=. 2026 , eprint=

2026

[34] [34]

Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

Shan, Zhengyang and Mueller, Aaron. Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026. doi:10.18653/v1/2026.eacl-long.199

work page doi:10.18653/v1/2026.eacl-long.199 2026

[35] [35]

2026 , eprint=

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs , author=. 2026 , eprint=

2026

[36] [36]

The Mathematics of the Uncertain , pages=

An optimal transportation approach for assessing almost stochastic order , author=. The Mathematics of the Uncertain , pages=. 2018 , publisher=

2018

[37] [37]

arXiv preprint arXiv:2204.06815 , year=

deep-significance-Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks , author=. arXiv preprint arXiv:2204.06815 , year=

arXiv

[38] [38]

ELLA : Empowering LLM s for Interpretable, Accurate and Informative Legal Advice

Hu, Yutong and Luo, Kangcheng and Feng, Yansong. ELLA : Empowering LLM s for Interpretable, Accurate and Informative Legal Advice. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). 2024. doi:10.18653/v1/2024.acl-demos.36

work page doi:10.18653/v1/2024.acl-demos.36 2024

[39] [39]

J ob F air: A Framework for Benchmarking Gender Hiring Bias in Large Language Models

Wang, Ze and Wu, Zekun and Guan, Xin and Thaler, Michael and Koshiyama, Adriano and Lu, Skylar and Beepath, Sachin and Ertekin, Ediz and Perez-Ortiz, Maria. J ob F air: A Framework for Benchmarking Gender Hiring Bias in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.184

work page doi:10.18653/v1/2024.findings-emnlp.184 2024

[40] [40]

Stereotype or Personalization? User Identity Biases Chatbot Recommendations

Kantharuban, Anjali and Milbauer, Jeremiah and Sap, Maarten and Strubell, Emma and Neubig, Graham. Stereotype or Personalization? User Identity Biases Chatbot Recommendations. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1254

work page doi:10.18653/v1/2025.findings-acl.1254 2025

[41] [41]

No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models

Plaza-del-Arco, Flor Miriam and R. No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models. Proceedings of the 9th Widening NLP Workshop. 2025. doi:10.18653/v1/2025.winlp-main.39

work page doi:10.18653/v1/2025.winlp-main.39 2025

[42] [42]

Nature , volume=

AI generates covertly racist decisions about people based on their dialect , author=. Nature , volume=. 2024 , publisher=

2024

[43] [43]

Large Language Models Discriminate Against Speakers of G erman Dialects

Bui, Minh Duc and Holtermann, Carolin and Hofmann, Valentin and Lauscher, Anne and von der Wense, Katharina. Large Language Models Discriminate Against Speakers of G erman Dialects. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.415

work page doi:10.18653/v1/2025.emnlp-main.415 2025

[44] [44]

Linguistic Bias in C hat GPT : Language Models Reinforce Dialect Discrimination

Fleisig, Eve and Smith, Genevieve and Bossi, Madeline and Rustagi, Ishita and Yin, Xavier and Klein, Dan. Linguistic Bias in C hat GPT : Language Models Reinforce Dialect Discrimination. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.750

work page doi:10.18653/v1/2024.emnlp-main.750 2024

[45] [45]

Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLM s

Rodr \'i guez, Elisa Forcada and Perez-de-Vinaspre, Olatz and Campos, Jon Ander and Klakow, Dietrich and Gautam, Vagrant. Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLM s. Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP). 2025. doi:10.18653/v1/2025.gebnlp-1.18

work page doi:10.18653/v1/2025.gebnlp-1.18 2025

[46] [46]

2025 , eprint=

Obscured but Not Erased: Evaluating Nationality Bias in LLMs via Name-Based Bias Benchmarks , author=. 2025 , eprint=

2025

[47] [47]

2023 , eprint=

Evaluating and Mitigating Discrimination in Language Model Decisions , author=. 2023 , eprint=

2023

[48] [48]

C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.154

work page doi:10.18653/v1/2020.emnlp-main.154 2020

[49] [49]

S tereo S et: Measuring stereotypical bias in pretrained language models

Nadeem, Moin and Bethke, Anna and Reddy, Siva. S tereo S et: Measuring stereotypical bias in pretrained language models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.416

work page doi:10.18653/v1/2021.acl-long.416 2021

[50] [50]

Language and Linguistics Compass , volume =

Hovy, Dirk and Prabhumoye, Shrimai , title =. Language and Linguistics Compass , volume =. doi:https://doi.org/10.1111/lnc3.12432 , url =. https://compass.onlinelibrary.wiley.com/doi/pdf/10.1111/lnc3.12432 , abstract =

work page doi:10.1111/lnc3.12432

[51] [51]

2026 , eprint=

Identifying and Mitigating Gender Cues in Academic Recommendation Letters: An Interpretability Case Study , author=. 2026 , eprint=

2026

[52] [52]

2026 , eprint=

Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA , author=. 2026 , eprint=

2026

[53] [53]

2026 , eprint=

LLMs Can Infer Political Alignment from Online Conversations , author=. 2026 , eprint=

2026

[54] [54]

2026 , eprint=

Large-scale online deanonymization with LLMs , author=. 2026 , eprint=

2026

[55] [55]

2026 , eprint=

NESSiE: The Necessary Safety Benchmark -- Identifying Errors that should not Exist , author=. 2026 , eprint=

2026

[56] [56]

2026 , eprint=

Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness , author=. 2026 , eprint=

2026

[57] [57]

2025 , eprint=

Prioritize Economy or Climate Action? Investigating ChatGPT Response Differences Based on Inferred Political Orientation , author=. 2025 , eprint=

2025

[58] [58]

Implicit Personalization in Language Models: A Systematic Study

Jin, Zhijing and Heil, Nils and Liu, Jiarui and Dhuliawala, Shehzaad and Qi, Yahang and Sch. Implicit Personalization in Language Models: A Systematic Study. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.717

work page doi:10.18653/v1/2024.findings-emnlp.717 2024

[59] [59]

Does the Prompt-based Large Language Model Recognize Students' Demographics and Introduce Bias in Essay Scoring? , year =

Yang, Kaixun and Rakovi\'. Does the Prompt-based Large Language Model Recognize Students' Demographics and Introduce Bias in Essay Scoring? , year =. Artificial Intelligence in Education: 26th International Conference, AIED 2025, Palermo, Italy, July 22–26, 2025, Proceedings, Part II , pages =. doi:10.1007/978-3-031-98417-4_6 , abstract =

work page doi:10.1007/978-3-031-98417-4_6 2025

[60] [60]

and Hovy, Dirk

Lauscher, Anne and Bianchi, Federico and Bowman, Samuel R. and Hovy, Dirk. S ocio P robe: What, When, and Where Language Models Learn about Sociodemographics. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.539

work page doi:10.18653/v1/2022.emnlp-main.539 2022

[61] [61]

2025 , eprint=

Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts , author=. 2025 , eprint=

2025

[62] [62]

2026 , eprint=

DAIQ: Auditing Demographic Attribute Inference from Question in LLMs , author=. 2026 , eprint=

2026

[63] [63]

2025 , eprint=

Accumulating Context Changes the Beliefs of Language Models , author=. 2025 , eprint=

2025

[64] [64]

The Twelfth International Conference on Learning Representations , year=

Beyond Memorization: Violating Privacy via Inference with Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=

[65] [65]

2026 , url=

Philippe Laban and Hiroaki Hayashi and Yingbo Zhou and Jennifer Neville , booktitle=. 2026 , url=

2026

[66] [66]

2024 , eprint=

Designing a Dashboard for Transparency and Control of Conversational AI , author=. 2024 , eprint=

2024

[67] [67]

Reading Between the Prompts: How Stereotypes Shape LLM ' s Implicit Personalization

Neplenbroek, Vera and Bisazza, Arianna and Fern \'a ndez, Raquel. Reading Between the Prompts: How Stereotypes Shape LLM ' s Implicit Personalization. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1029

work page doi:10.18653/v1/2025.emnlp-main.1029 2025

[68] [68]

2026 , eprint=

From Chatbots to Confidants: A Cross-Cultural Study of LLM Adoption for Emotional Support , author=. 2026 , eprint=

2026

[69] [69]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[70] [70]

2025 , eprint=

PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory , author=. 2025 , eprint=

2025

[71] [71]

2025 , eprint=

Language Models Change Facts Based on the Way You Talk , author=. 2025 , eprint=

2025

[72] [72]

Hannah Rose Kirk and Alexander Whitefield and Paul R. The. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

[73] [73]

2026 , eprint=

Different Demographic Cues Yield Inconsistent Conclusions About LLM Personalization and Bias , author=. 2026 , eprint=

2026

[74] [74]

The Fourteenth International Conference on Learning Representations , year=

Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset , author=. The Fourteenth International Conference on Learning Representations , year=