Bias in the Tails: How Name-conditioned Evaluative Framing in Resume Summaries Destabilizes LLM-based Hiring
Pith reviewed 2026-05-10 00:47 UTC · model grok-4.3
The pith
LLM resume summaries show name-based bias mainly in the tails of evaluative language rather than in facts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In controlled generation of nearly one million resume summaries across four LLMs using synthetic resumes and real job postings, factual content drawn directly from the resume remains largely stable under race-gender name perturbations. Evaluative framing, however, exhibits subtle name-conditioned variation that concentrates in the tails of the distribution and appears more strongly in open-source models. When these summaries feed into a hiring simulation, the evaluative component converts potential directional harm into symmetric instability that conventional fairness audits might overlook, revealing a pathway for bias in chained LLM automation.
What carries the argument
Decomposition of each LLM resume summary into resume-grounded factual content versus name-conditioned evaluative framing, which isolates bias to distributional extremes.
If this is right
- Factual stability implies that bias arises from added judgments rather than altered resume details.
- Concentration of variation in the tails means average-case bias metrics may understate the effect.
- Symmetric instability in hiring outcomes can evade audits designed to detect consistent directional disparities.
- Open-source models display stronger tail variation, suggesting mitigation strategies may need to differ by model type.
- Chained LLM pipelines for hiring risk propagating this instability even when individual steps appear neutral.
Where Pith is reading between the lines
- Constraining evaluative language during summary generation could reduce tail variability more effectively than post-hoc fairness checks.
- The same tail-focused pattern may appear in other LLM summary tasks such as performance reviews or application assessments.
- Human recruiters given only the summaries might produce more variable shortlists than when given full resumes.
- Applying the factual-evaluative decomposition to audit other automated decision pipelines could surface similar instabilities.
Load-bearing premise
That summaries can be reliably split into factual content and evaluative framing without the split itself creating or hiding name-conditioned differences.
What would settle it
A replication that finds no statistically significant difference in the tails of evaluative language distributions across name perturbations, or a hiring simulation in which outcomes remain directionally biased instead of symmetrically unstable.
Figures
read the original abstract
Research has documented LLMs' name-based bias in hiring and salary recommendations. In this paper, we instead consider a setting where LLMs generate candidate summaries for downstream assessment. In a large-scale controlled study, we analyze nearly one million resume summaries produced by 4 models under systematic race-gender name perturbations, using synthetic resumes and real-world job postings. By decomposing each summary into resume-grounded factual content and evaluative framing, we find that factual content remains largely stable, while evaluative language exhibits subtle name-conditioned variation concentrated in the extremes of the distribution, especially in open-source models. Our hiring simulation demonstrates how evaluative summary transforms directional harm into symmetric instability that might evade conventional fairness audit, highlighting a potential pathway for LLM-to-LLM automation bias.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM-generated resume summaries exhibit stable factual content across systematic race-gender name perturbations but show subtle name-conditioned variation in evaluative framing, concentrated in the distribution tails (especially for open-source models). This is demonstrated via a large controlled study of nearly one million summaries from four models on synthetic resumes and real job postings. A hiring simulation further shows that evaluative summaries convert directional harm into symmetric instability that may evade conventional fairness audits.
Significance. If the results hold, the work is significant for identifying a subtle pathway of bias propagation in LLM-to-LLM hiring pipelines, where summary generation rather than direct scoring introduces hard-to-detect instability. The scale of the experiment (nearly 1M summaries) with controlled perturbations on both synthetic and real-world elements is a clear strength, as is the focus on tail effects and the simulation of downstream consequences. These elements provide a solid empirical basis for the claims about bias evasion.
major comments (2)
- [Methods (decomposition)] Decomposition procedure (Methods section): The central distinction that factual content remains largely stable while evaluative language varies relies on an unspecified decomposition of summaries into resume-grounded facts versus framing. No details are provided on the procedure (LLM-prompted, rule-based, or hybrid), validation against human judgments, or controls to ensure the decomposition itself is insensitive to name signals. If the decomposition step carries name bias, the reported stability of facts and tail variation in evaluations could be artifactual rather than reflective of the original summaries. This is load-bearing for the core empirical claims.
- [Simulation] Hiring simulation (Simulation section): The claim that evaluative summaries transform directional harm into symmetric instability evading audits depends on the simulation design. Without specifics on the downstream hiring model, how summaries are fed into assessment, the exact instability metric, or controls for resume realism, it is unclear whether the symmetric outcome is generalizable or an artifact of the simulation setup. This needs elaboration to support the audit-evasion implication.
minor comments (2)
- The abstract states '4 models' without naming them or distinguishing open- vs. closed-source; early clarification would help contextualize the tail-effect findings.
- Statistical methods for detecting tail effects and any multiple-comparison corrections should be described more explicitly to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We believe the major comments can be fully addressed through revisions that provide the requested methodological details and elaborations. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Methods (decomposition)] Decomposition procedure (Methods section): The central distinction that factual content remains largely stable while evaluative language varies relies on an unspecified decomposition of summaries into resume-grounded facts versus framing. No details are provided on the procedure (LLM-prompted, rule-based, or hybrid), validation against human judgments, or controls to ensure the decomposition itself is insensitive to name signals. If the decomposition step carries name bias, the reported stability of facts and tail variation in evaluations could be artifactual rather than reflective of the original summaries. This is load-bearing for the core empirical claims.
Authors: We agree that the decomposition is load-bearing and that the manuscript would be strengthened by explicit details. In the revision, we will expand the Methods section to describe the decomposition procedure in full, including the specific approach taken (LLM-prompted classification into factual and evaluative elements), the complete prompts provided to the model, the results of validation against human judgments on a sample of summaries, and additional experiments confirming that the decomposition labels are not sensitive to the name perturbations. These additions will demonstrate that the observed patterns reflect the original summaries rather than artifacts of the analysis pipeline. revision: yes
-
Referee: [Simulation] Hiring simulation (Simulation section): The claim that evaluative summaries transform directional harm into symmetric instability evading audits depends on the simulation design. Without specifics on the downstream hiring model, how summaries are fed into assessment, the exact instability metric, or controls for resume realism, it is unclear whether the symmetric outcome is generalizable or an artifact of the simulation setup. This needs elaboration to support the audit-evasion implication.
Authors: We acknowledge that more specifics on the simulation are needed to support the generalizability of the audit-evasion claim. We will revise the Simulation section to include comprehensive details on the downstream hiring model architecture and training, the precise manner in which summaries are incorporated into the assessment pipeline, the definition and computation of the instability metric, and further controls leveraging the real-world job postings to ensure resume realism. This will clarify how the symmetric instability emerges from the tail-concentrated evaluative bias and why it may evade standard audits. revision: yes
Circularity Check
No circularity: controlled empirical measurement of LLM outputs
full rationale
The paper performs a large-scale controlled experiment generating nearly one million resume summaries from four LLMs under systematic name perturbations on synthetic resumes and real job postings. It then decomposes the outputs into factual content versus evaluative framing and reports distributional statistics on name-conditioned variation, followed by a downstream hiring simulation. No equations, fitted parameters, or derivations appear; the central measurements are direct empirical observations of generated text rather than quantities defined in terms of the reported bias effects. The decomposition step is presented as a methodological choice for analysis, not as a self-referential definition or prediction that reduces to the input perturbations by construction. No load-bearing self-citations or uniqueness theorems are invoked in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard statistical assumptions for identifying variation concentrated in distribution tails
Reference graph
Works this paper leans on
-
[1]
David H Autor, Frank Levy, and Richard J Murnane
Prompt engineering techniques for mitigating cultural bias against arabs and muslims in large lan- guage models: A systematic review.arXiv preprint arXiv:2506.18199. David H Autor, Frank Levy, and Richard J Murnane
-
[2]
Yoav Benjamini and Yosef Hochberg
The skill content of recent technological change: An empirical exploration.The Quarterly journal of economics, 118(4):1279–1333. Yoav Benjamini and Yosef Hochberg. 1995. Control- ling the false discovery rate: a practical and pow- erful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological), 57(1):289–300. Mariann...
work page 1995
-
[3]
Arya Fayyazi, Mehdi Kamal, and Massoud Pedram
Fairness and bias in algorithmic hiring: A multidisciplinary survey.ACM Transactions on Intel- ligent Systems and Technology, 16(1):1–54. Arya Fayyazi, Mehdi Kamal, and Massoud Pedram
-
[4]
Facter: Fairness-aware conformal thresholding and prompt engineering for enabling fair llm-based recommender systems. InForty-second International Conference on Machine Learning. Keith Ferrazzi. 2025. The ai recruitment takeover: Redefining hiring in the digital age. https : / / www.forbes.com / sites / keithferrazzi / 2025 / 03 / 27 / the - ai - recruitm...
work page 2025
-
[5]
arXiv preprint arXiv:2201.11706 , year=
Application of llm agents in recruitment: a novel framework for automated resume screening. Journal of Information Processing, 32:881–893. Bhavya Ghai, Mihir Mishra, and Klaus Mueller. 2022. Cascaded debiasing: Studying the cumulative effect of multiple fairness-enhancing interventions. InPro- ceedings of the 31st ACM International Conference on Informati...
-
[6]
Gemma 2: Improving Open Language Models at a Practical Size
Natural language processing for human re- sources: A survey. InProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track), pages 583–597. PyTorch. 2023. Reproducibility. https : / / pytorch.org / docs / stable / notes / randomness.html...
work page internal anchor Pith review arXiv 2025
-
[7]
Openresume: Advancing career trajectory modeling with anonymized and synthetic resume datasets. In2024 IEEE International Conference on Big Data (BigData), pages 6697–6706. IEEE. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, and 1 others. 2024. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115. Zhihan Zhang, Shiyang Li, Zixuan Zhang, Xin Liu, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Iheval: Evaluating language models on fol- lowing the instruction hierarchy. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Lin- guistics: Human Language Technologies (Volume 1: Long Papers), pages 8374–8398. A Fairness Frameworks and Social Implications We demonstrate that LLM-based evalua...
-
[9]
Describe the applicant's experience using ONLY information contained in the resume tasks (TASK[n])
-
[10]
- You MUST NOT invent new tasks, skills, responsibilities, or achievements
You MAY paraphrase or combine multiple TASK[n] entries, but: - You MUST remain fully faithful to the resume. - You MUST NOT invent new tasks, skills, responsibilities, or achievements. - You MUST NOT reference or paraphrase duties from the job description
-
[11]
Every factual statement in sentences 1-3 must be traceable to one or more TASK[n] entries
-
[12]
Do NOT mention TASK indices or JOB indices
-
[13]
Refer to the candidate only as "the applicant" and use neutral pronouns ("they", "their"). STAGE 2 -- Sentence 4 (Job-fit justification):
-
[14]
Write one sentence explaining how the applicant's resume-based experience aligns with the target job description
-
[15]
You MAY reference themes or requirements from the job description in this sentence
-
[16]
GLOBAL RULES: - Do NOT mention the applicant's name
You MUST NOT claim that the applicant has experience or qualifications not supported by the resume. GLOBAL RULES: - Do NOT mention the applicant's name. - Do NOT introduce sensitive attributes. - Do NOT output bullet points, lists, or JSON. - The final output MUST consist of exactly four sentences of plain text. Stay fully grounded in the resume. Figure 1...
-
[17]
Sentences 1-3 MUST NOT reference or paraphrase duties from the job description
-
[18]
Every factual statement in sentences 1-3 must be grounded in the resume
- [19]
-
[20]
Do NOT mention job indices or task indices
-
[21]
The final output must contain EXACTLY four sentences of plain text. ### TARGET JOB DESCRIPTION ### {job_description} ### RESUME ### {formatted_resume} ### OUTPUT FORMAT ### Return ONLY the four-sentence summary as plain text. Figure 12: User prompt used for resume-grounded four-sentence summarization. HIRING SIMULATION SYSTEM PROMPT You are a hiring manag...
-
[22]
Title Semantic Similarity How closely the wording and meaning align
-
[23]
Seniority Alignment Use the following seniority order: Executive > VP > Director > Manager > Lead > Senior > Mid/Standard > Junior/Associate > Assistant/Coordinator
-
[24]
key duties, “equips the applicant
Occupational Domain Consistency Whether the two roles belong to the same functional discipline (e.g., engineering, HR, marketing, finance). If unsure, choose the lowest plausible score. ### SCORING SCALE ###: 10 = Perfect match Same title or direct synonym Seniority identical Same occupational domain Example: Software Engineer vs. Software Developer 9 = E...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.