pith. sign in

arxiv: 2604.19204 · v1 · submitted 2026-04-21 · 💻 cs.CY · cs.LG

Auditing LLMs for Algorithmic Fairness in Casenote-Augmented Tabular Prediction

Pith reviewed 2026-05-10 01:42 UTC · model grok-4.3

classification 💻 cs.CY cs.LG
keywords algorithmic fairnessLLM auditinghousing placement predictioncasenote augmentationtabular classificationfine-tuningerror disparitiessocial services AI
0
0 comments X

The pith

A fine-tuned LLM with casenote summaries improves accuracy and reduces fairness disparities in housing placement predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the fairness of using large language models to predict housing placements for individuals, combining standard tabular data with summaries of outreach casenotes. It shows that fine-tuning the model on this augmented input increases overall prediction accuracy compared to using tabular data alone. Importantly, this approach also lowers the differences in error rates across demographic groups. Experiments with zero-shot methods using variable importance yield mixed fairness outcomes. The authors stress that auditing is essential because of historical biases in housing services, and conclude that this method can add text information without introducing new biases.

Core claim

A fine-tuned model augmented with casenote summaries can improve accuracy while reducing algorithmic fairness disparities in multi-class classification for housing placement, and zero-shot LLM classification does not introduce additional textual biases beyond those in tabular classification.

What carries the argument

The casenote-augmented fine-tuned LLM for tabular classification, audited via multi-class classification error disparities.

Load-bearing premise

The specific casenote data and the multi-class error disparity metric adequately represent real-world fairness concerns in housing placement without unmeasured selection effects or redaction artifacts.

What would settle it

A new dataset of housing placements where applying the fine-tuned casenote-augmented model increases rather than decreases error disparities across groups, or fails to improve accuracy.

Figures

Figures reproduced from arXiv: 2604.19204 by Angela Zhou, Ezinne Nwankwo, Xiao Qi Lee.

Figure 6.1
Figure 6.1. Figure 6.1: True Positive Rate (TPR) Bar Plot Level 0 (Gender) [PITH_FULL_IMAGE:figures/full_fig_p008_6_1.png] view at source ↗
Figure 6.2
Figure 6.2. Figure 6.2: True Positive Rate (TPR) Bar Plot Level 2 (Gender) [PITH_FULL_IMAGE:figures/full_fig_p008_6_2.png] view at source ↗
Figure 6.3
Figure 6.3. Figure 6.3: True Positive Rate (TPR) Bar Plot Level 0 (Race) [PITH_FULL_IMAGE:figures/full_fig_p009_6_3.png] view at source ↗
Figure 6.4
Figure 6.4. Figure 6.4: True Positive Rate (TPR) Bar Plot Level 2 (Race) [PITH_FULL_IMAGE:figures/full_fig_p009_6_4.png] view at source ↗
Figure 1
Figure 1. Figure 1: Confusion Matrix for Zero-Shot 17B 11 [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Confusion Matrix for Zero-Shot 70B 12 [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion Matrix for Fine-Tuned 70B 13 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Confusion Matrix for Zero-Shot 17B + Summaries [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Confusion Matrix for Zero-Shot 70B + Summaries [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Confusion Matrix for Fine-Tuned 70B + Summaries [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

LLMs are increasingly being considered for prediction tasks in high-stakes social service settings, but their algorithmic fairness properties in this context are poorly understood. In this short technical report, we audit the algorithmic fairness of LLM-based tabular classification on a real housing placement prediction task, augmented with street outreach casenotes from a nonprofit partner. We audit multi-class classification error disparities. We find that a fine-tuned model augmented with casenote summaries can improve accuracy while reducing algorithmic fairness disparities. We experiment with variable importance improvements to zero-shot tabular classification and find mixed results on resulting algorithmic fairness. Overall, given historical inequities in housing placement, it is crucial to audit LLM use. We find that leveraging LLMs to augment tabular classification with casenote summaries can safely leverage additional text information at low implementation burden. The outreach casenotes are fairly short and heavily redacted. Our assessment is that LLM zero-shot classification does not introduce additional textual biases beyond algorithmic biases in tabular classification. Combining fine-tuning and leveraging casenote summaries can improve accuracy and algorithmic fairness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper audits algorithmic fairness in LLM-augmented tabular classification for a real-world housing placement prediction task using short, heavily redacted street outreach casenotes. The central empirical claim is that fine-tuning an LLM model with casenote summaries improves predictive accuracy while reducing multi-class classification error disparities; zero-shot LLM use with variable importance yields mixed fairness results; and LLM augmentation introduces no additional textual biases beyond those already present in the tabular data.

Significance. If the reported experimental comparisons hold, this short technical report offers timely evidence on the fairness properties of LLMs in high-stakes social-service prediction. It demonstrates a low-implementation-burden approach to incorporating textual casenote information that can simultaneously boost accuracy and mitigate existing group disparities, while explicitly noting data limitations (redaction, brevity) that support the no-new-bias conclusion. Such grounded audits are valuable for informing responsible LLM deployment amid historical housing inequities.

minor comments (3)
  1. The abstract and results would benefit from explicit reporting of dataset size, number of classes/groups, exact definition of the multi-class error disparity metric, and any statistical tests or error bars accompanying the accuracy and disparity improvements.
  2. Baseline comparisons with purely tabular (non-LLM) models and with standard fairness-aware tabular methods should be added or clarified to better contextualize the magnitude of the reported gains from casenote augmentation.
  3. The discussion of casenote redaction and potential selection effects could be expanded with a brief sensitivity analysis or qualitative assessment to strengthen the claim that no additional textual biases are introduced.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our manuscript and for recommending minor revision. We appreciate the recognition that the work provides timely evidence on the fairness properties of LLMs in high-stakes social-service prediction and that our low-implementation-burden approach can boost accuracy while mitigating group disparities.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical audit reporting experimental results on LLM-augmented tabular classification for housing placement. The central claim—that fine-tuning with casenote summaries improves accuracy and reduces multi-class error disparities—is presented as a direct observation from model evaluations on the given dataset, not as a derivation, prediction, or quantity obtained by fitting parameters to a subset and then re-using them. No equations, self-definitional loops, uniqueness theorems, or load-bearing self-citations appear in the provided text; the fairness metrics and augmentation procedure are applied as standard evaluation steps without reducing the reported improvements to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical auditing study with no mathematical derivations; relies on standard assumptions about data representativeness and metric validity rather than new postulates.

axioms (2)
  • domain assumption Casenotes and tabular features are representative of the underlying population and decision process
    Invoked implicitly when generalizing audit results to real deployment
  • domain assumption Multi-class error disparity is an appropriate proxy for algorithmic fairness in housing placement
    Central to the auditing methodology described

pith-pipeline@v0.9.0 · 5489 in / 1198 out tokens · 34357 ms · 2026-05-10T01:42:23.370561+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Batch-Adaptive Causal Annotations

    Batch-Adaptive Annotations for Causal Inference with Complex-Embedded Outcomes , author=. arXiv preprint arXiv:2502.10605 , year=

  2. [2]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Towards Understanding and Mitigating Social Biases in Language Models , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =