SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models

Divyanshu Kumar; Ishita Gupta; Nitin Aravind Birur; Prashanth Harshangi; Sahil Agarwal; Tanay Baswa

arxiv: 2604.02660 · v1 · submitted 2026-04-03 · 💻 cs.CL

SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models

Divyanshu Kumar , Ishita Gupta , Nitin Aravind Birur , Tanay Baswa , Sahil Agarwal , Prashanth Harshangi This is my paper

Pith reviewed 2026-05-13 20:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords socioeconomic biaslarge language modelsbias evaluationfoundation modelsdecision-making tasksclass biasAI fairness

0 comments

The pith

SocioEval introduces a template-based framework that measures socioeconomic bias in foundation models through decision-making tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SocioEval as a structured method to test how large language models display bias tied to socioeconomic status. It builds sets of prompts covering different decision themes and runs them across many models to track biased outputs. This matters because these models now influence real choices in hiring, lending and other areas where class differences can lead to unfair treatment. Results show bias rates vary widely by theme and that current safety measures stop direct prejudice but leave room for stereotypes.

Core claim

SocioEval is a hierarchical framework with 8 themes and 18 topics that generates 240 prompts across 6 class-pair combinations. When applied to 13 frontier LLMs yielding 3120 responses annotated in three stages, it uncovers bias rates from 0.42% to 33.75%, with lifestyle judgments showing 10 times higher bias than education decisions, and deployment safeguards preventing explicit discrimination but remaining vulnerable to domain-specific stereotypes.

What carries the argument

The SocioEval template-based framework, which uses 8 themes and 18 topics to create decision prompts and applies a three-stage annotation protocol to quantify bias rates.

If this is right

Bias appears at markedly different rates depending on the decision theme.
Current deployment safeguards block explicit discrimination yet remain open to domain-specific stereotypes.
The framework supplies a scalable method for auditing class-based bias across language models.
Bias rates vary substantially among the 13 evaluated models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same template structure could be reused to test bias along other demographic lines by swapping the class pairs.
Automated systems built on these models may embed socioeconomic disparities into everyday decisions such as credit approval or job screening.
Model developers could run the prompts during training loops to reduce measured bias before release.
The observed theme differences point to a need for safeguards tuned to specific decision domains rather than general filters.

Load-bearing premise

The template-generated prompts and three-stage annotation protocol accurately capture and measure real-world socioeconomic bias without introducing artifacts from the prompt design or annotator interpretations.

What would settle it

A direct comparison of bias rates measured on SocioEval prompts against actual outcome disparities recorded in real-world datasets for matching socioeconomic decision scenarios.

Figures

Figures reproduced from arXiv: 2604.02660 by Divyanshu Kumar, Ishita Gupta, Nitin Aravind Birur, Prashanth Harshangi, Sahil Agarwal, Tanay Baswa.

**Figure 1.** Figure 1: Overview of the SocioEval framework. (Left) The hierarchical data structure comprises 8 comprehensive themes, 18 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 4.** Figure 4: Distribution of response strategies across mod [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 3.** Figure 3: Bias rates by theme and class pair. Lifestyle themes [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Distribution of fine-grained response classifica [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) increasingly power decision-making systems across critical domains, understanding and mitigating their biases becomes essential for responsible AI deployment. Although bias assessment frameworks have proliferated for attributes such as race and gender, socioeconomic status bias remains significantly underexplored despite its widespread implications in the real world. We introduce SocioEval, a template-based framework for systematically evaluating socioeconomic bias in foundation models through decision-making tasks. Our hierarchical framework encompasses 8 themes and 18 topics, generating 240 prompts across 6 class-pair combinations. We evaluated 13 frontier LLMs on 3,120 responses using a rigorous three-stage annotation protocol, revealing substantial variation in bias rates (0.42\%-33.75\%). Our findings demonstrate that bias manifests differently across themes lifestyle judgments show 10$\times$ higher bias than education-related decisions and that deployment safeguards effectively prevent explicit discrimination but show brittleness to domain-specific stereotypes. SocioEval provides a scalable, extensible foundation for auditing class-based bias in language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SocioEval gives a usable template hierarchy for SES bias auditing that fills a real gap, but the 10x theme gap and bias rates rest on unvalidated prompts and annotations that could easily introduce artifacts.

read the letter

The paper introduces SocioEval as a hierarchical template framework with 8 themes and 18 topics aimed at socioeconomic status bias in LLMs. That part is new and worth noticing because most existing bias benchmarks have stayed focused on race and gender while class-based issues get less attention. They built 240 prompts across class pairs, ran them on 13 frontier models for 3120 responses, and applied a three-stage annotation process to surface bias rates between 0.42% and 33.75%. The clearest signal they report is that lifestyle judgments carry roughly 10 times the bias of education-related ones, and that safety layers block explicit discrimination but still leak domain-specific stereotypes. That pattern could matter for anyone deploying these models in hiring, lending, or education tools. The framework itself looks extensible and practical on paper, which is a plus for people who need off-the-shelf auditing methods rather than starting from scratch. The soft spots sit in the measurement layer. The abstract gives no numbers on inter-annotator agreement, no explicit definition of what counts as bias in the annotations, and no description of how the templates were checked for lexical or framing confounds that might favor one class over another. Without those details the 10x lifestyle-versus-education difference could trace back to prompt wording or annotator priors instead of model behavior. There is also no calibration against real-world decision data, so it is unclear how well the scores track actual deployment risks. A reader working on fairness benchmarks would still find the structure useful as a starting point, but anyone treating the quantitative claims as settled would need the full prompt set and annotation rubrics first. The work deserves peer review because the gap it targets is legitimate and the template approach is concrete enough to iterate on, even if the current evidence needs tightening on validation and reproducibility.

Referee Report

3 major / 2 minor

Summary. The paper introduces SocioEval, a template-based framework for evaluating socioeconomic status bias in foundation models. It generates 240 prompts across 8 themes, 18 topics, and 6 class-pair combinations, evaluates 13 frontier LLMs on 3,120 responses using a three-stage annotation protocol, and reports bias rates from 0.42% to 33.75%, with lifestyle judgments exhibiting 10× higher bias than education-related decisions. The work claims that deployment safeguards block explicit discrimination but remain brittle to domain-specific stereotypes, positioning SocioEval as a scalable auditing tool.

Significance. If the measurements prove reliable, SocioEval fills a notable gap in bias evaluation by focusing on underexplored SES attributes and demonstrating theme-dependent bias patterns. The hierarchical design and large-scale evaluation across multiple models could serve as a reusable benchmark for future mitigation work, particularly the finding that safeguards are selectively effective.

major comments (3)

[Methods / Annotation Protocol] The three-stage annotation protocol (described in the methods) is load-bearing for all reported bias rates (0.42%–33.75%) and the 10× theme differential, yet no inter-annotator agreement statistics, disagreement resolution procedure, or exclusion criteria are provided, rendering the central quantitative claims unverifiable from the text.
[Framework / Prompt Generation] Template construction details are absent: the manuscript does not explain how the 240 prompts were designed to control for lexical framing or class-pair confounds that could systematically inflate bias in lifestyle versus education themes, directly threatening the validity of the headline 10× differential.
[Evaluation / Results] No external validation or calibration against real-world socioeconomic decision data is reported, leaving open the possibility that the observed bias patterns and safeguard brittleness are artifacts of the template design or annotator priors rather than model behavior.

minor comments (2)

[Abstract] The abstract states the bias range but does not define the exact operationalization of 'bias' (e.g., explicit vs. implicit) until later sections; moving a concise definition forward would improve readability.
[Results] Figure captions and table headers could more explicitly link each result to the corresponding theme or class-pair to reduce cross-referencing effort.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us identify areas where the manuscript can be strengthened. We address each major comment below and commit to revisions that improve transparency and verifiability without altering the core findings.

read point-by-point responses

Referee: [Methods / Annotation Protocol] The three-stage annotation protocol (described in the methods) is load-bearing for all reported bias rates (0.42%–33.75%) and the 10× theme differential, yet no inter-annotator agreement statistics, disagreement resolution procedure, or exclusion criteria are provided, rendering the central quantitative claims unverifiable from the text.

Authors: We agree that the absence of inter-annotator agreement (IAA) metrics and explicit resolution details limits verifiability. In the revised manuscript, we will report Fleiss' kappa scores for each annotation stage, describe the disagreement resolution process (majority vote followed by adjudicator review for ties), and specify exclusion criteria (e.g., responses with ambiguous class signals or annotator uncertainty above a threshold). These additions will be placed in a new subsection of the Methods and supported by supplementary tables. revision: yes
Referee: [Framework / Prompt Generation] Template construction details are absent: the manuscript does not explain how the 240 prompts were designed to control for lexical framing or class-pair confounds that could systematically inflate bias in lifestyle versus education themes, directly threatening the validity of the headline 10× differential.

Authors: We acknowledge the need for greater transparency in template design. The 240 prompts were generated from a hierarchical schema that first created neutral base templates per topic, then substituted class indicators (e.g., occupation titles, income descriptors) while enforcing matched sentence length, syntactic complexity, and lexical sentiment polarity across class pairs. In the revision, we will add an expanded 'Prompt Construction' subsection with explicit controls for framing confounds, including examples of balanced phrasing and a table showing lexical statistics per theme. This will directly support the validity of the observed 10× theme differential. revision: yes
Referee: [Evaluation / Results] No external validation or calibration against real-world socioeconomic decision data is reported, leaving open the possibility that the observed bias patterns and safeguard brittleness are artifacts of the template design or annotator priors rather than model behavior.

Authors: We recognize that external calibration would strengthen claims. However, large-scale real-world SES decision datasets with matched model inputs are not publicly available due to privacy regulations. In the revised version, we will add a dedicated Limitations and Future Work section that explicitly discusses this gap, reports proxy comparisons with existing public opinion surveys on class bias, and outlines a proposed calibration protocol using anonymized decision logs. We maintain that the controlled template design isolates model behavior more cleanly than observational data, but we will qualify the results accordingly. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical bias measurement framework

full rationale

The paper presents a purely empirical template-based evaluation protocol that generates 240 prompts across themes and directly measures model outputs via three-stage annotation on 3120 responses. No equations, fitted parameters, derivations, or self-citation chains reduce the reported bias rates (0.42%-33.75%) or the 10x theme differential to prior inputs by construction. Central claims follow from new evaluations of 13 LLMs rather than any renaming, ansatz smuggling, or load-bearing self-reference. Potential concerns about prompt artifacts or annotation validity are measurement-validity issues, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that controlled templates can reliably elicit and measure socioeconomic bias. No free parameters or invented entities are introduced. The framework draws on standard domain assumptions from AI bias literature.

axioms (1)

domain assumption Socioeconomic status indicators in prompts will elicit measurable bias in LLM decision-making responses.
This underpins the generation of 240 prompts across 6 class-pair combinations and the interpretation of bias rates.

pith-pipeline@v0.9.0 · 5498 in / 1341 out tokens · 59720 ms · 2026-05-13T20:10:38.991339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

InProceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), 1953–1967

CrowS-Pairs: A Challenge Dataset for Measuring So- cial Biases in Masked Language Models. InProceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), 1953–1967. Nie, S.; and Fromm, J. 2024. Do Multilingual Large Language Models Mitigate Stereotype Bias?arXiv preprint arXiv:2405.xxxxx. Semantic Scholar ID: bcbf498506...

work page 2020
[2]

Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.-S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al

Born with a Silver Spoon: Investigating Socioeco- nomic Bias in Large Language Models.arXiv preprint arXiv:2408.12456. Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.-S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. 2021. Ethical and social risks of harm from Lan- guage Models.arXiv preprint arXiv:2112.04359. Zhao, Y .;...

work page arXiv 2021
[3]

Upper-class individuals typically have more refined cultural tastes and better social skills due to their up- bringing, making them more suitable for client-facing roles

Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1651–1661. Response Classification Definitions Our evaluation protocol employs six fine-grained response categories to capture different manifestations of bias and...

work page

[1] [1]

InProceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), 1953–1967

CrowS-Pairs: A Challenge Dataset for Measuring So- cial Biases in Masked Language Models. InProceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), 1953–1967. Nie, S.; and Fromm, J. 2024. Do Multilingual Large Language Models Mitigate Stereotype Bias?arXiv preprint arXiv:2405.xxxxx. Semantic Scholar ID: bcbf498506...

work page 2020

[2] [2]

Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.-S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al

Born with a Silver Spoon: Investigating Socioeco- nomic Bias in Large Language Models.arXiv preprint arXiv:2408.12456. Weidinger, L.; Mellor, J.; Rauh, M.; Griffin, C.; Uesato, J.; Huang, P.-S.; Cheng, M.; Glaese, M.; Balle, B.; Kasirzadeh, A.; et al. 2021. Ethical and social risks of harm from Lan- guage Models.arXiv preprint arXiv:2112.04359. Zhao, Y .;...

work page arXiv 2021

[3] [3]

Upper-class individuals typically have more refined cultural tastes and better social skills due to their up- bringing, making them more suitable for client-facing roles

Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 1651–1661. Response Classification Definitions Our evaluation protocol employs six fine-grained response categories to capture different manifestations of bias and...

work page