Recognition: no theorem link
BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence
Pith reviewed 2026-05-12 03:21 UTC · model grok-4.3
The pith
LLM bias shifts as much from prompt format changes as from model differences and often cancels between selection and elaboration layers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bias in large language models is not a fixed property of the model weights but emerges from interactions between prompt construction and internal response structure. By evaluating each statement under a full factorial of task format, perspective, role, and sentiment, and by applying split coding to isolate selection from elaboration, the protocol produces separate scores for each layer together with an inconsistency measure. The resulting data show that prompt dimensions interact more strongly than their individual main effects, that selection-only and elaboration-only rankings across models are nearly uncorrelated, and that some prompt configurations reduce both layers while others suppress
What carries the argument
BiAxisAudit protocol that evaluates bias on an across-prompt axis using a factorial grid of dimensions and on a within-response axis using split coding to isolate selection from elaboration, with metrics for inconsistency rate and net imbalance.
If this is right
- Task format explains as much variance in measured bias as the choice of model itself.
- 63.6 percent of bias signals appear in only one of the two response layers.
- Prompt-dimension interactions exceed main effects in explaining bias variation.
- Certain prompt configurations lower bias in both layers while others reduce only the selection layer by redistributing the signal.
Where Pith is reading between the lines
- Regulatory audits should require multiple prompt formats and layer-specific reporting to avoid exploitable underestimates.
- Developers could target training objectives separately at selection consistency and elaboration alignment.
- The same two-axis logic might apply to other subjective judgments such as factuality or safety.
Load-bearing premise
Split coding can reliably separate selection from elaboration without coder bias or lost context, and the chosen prompt dimensions are sufficient to capture the main sources of sensitivity.
What would settle it
If selection-only and elaboration-only bias rankings across the same set of models turn out to be strongly correlated, or if bias scores stay nearly constant across different task formats, the added value of the two-axis measurement would be called into question.
Figures
read the original abstract
Bias audits of large language models now operate within governance frameworks such as the EU AI Act, making benchmark reliability a security concern in its own right. Many current benchmarks, however, collapse bias into a single scalar from one prompt format and one surface label. This design misses two failure modes that can be exploited without changing model weights. Across prompts, meaning-preserving format changes shift bias endorsement by more than $0.7$ on a fixed statement pool. Within a response, the discrete Selection and free-text Elaboration can take opposing stances, so an apparently clean aggregate may hide substantial internal inconsistency (a ``cancellation trap''). Selection-only and elaboration-only rankings are therefore nearly uncorrelated across eight LLMs (Spearman $\rho = 0.238$, $p = 0.570$): LLaMA3-70B ranks in the middle under selection-only scoring but highest under elaboration-only scoring on the same responses. We introduce \textsc{BiAxisAudit}, a protocol that reports each bias score together with a reliability estimate on two orthogonal axes. The across-prompt axis evaluates each statement under a factorial grid of task format, perspective, role, and sentiment, treating bias as a distribution rather than a point estimate. The within-response axis uses Split Coding to recover Selection and Elaboration as separate signals, measured by the Inconsistency Rate and Divergence Net Imbalance. Across eight LLMs with $80{,}200$ coded responses each, task format alone explains as much variance as model choice; $63.6\%$ of pooled bias signals (up to $85.2\%$ per model) appear in only one coding layer, and prompt-dimension interactions exceed main effects. The instrument also separates real bias reductions from apparent reductions caused by cross-layer redistribution: some prompt configurations reduce both BER and IR, whereas others suppress only selection-layer bias.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BiAxisAudit, a framework for auditing bias in LLMs that accounts for prompt sensitivity via a factorial design over task format, perspective, role, and sentiment, and response-layer divergence via Split Coding to separate Selection and Elaboration. Based on experiments with eight LLMs and over 80,000 responses per model, it claims that task format explains variance comparable to model choice, that 63.6% of bias signals are layer-specific, that selection-only and elaboration-only bias rankings are weakly correlated (Spearman ρ = 0.238), and that the framework can distinguish genuine bias reductions from those due to cross-layer redistribution.
Significance. If the Split Coding procedure can be shown to be reliable and reproducible, this work would be a meaningful contribution to LLM bias evaluation by addressing two under-examined failure modes: prompt-format sensitivity and within-response cancellation effects. The scale of the empirical component (eight models, 80,200 coded responses each) and the introduction of orthogonal metrics (Inconsistency Rate, Divergence Net Imbalance) provide a concrete basis for moving beyond single-scalar benchmarks, which is timely given regulatory contexts such as the EU AI Act. The finding that prompt-dimension interactions exceed main effects and that task format rivals model identity in explanatory power would, if substantiated, have direct implications for benchmark design.
major comments (2)
- [Within-response axis description] The within-response axis relies on Split Coding to partition each response into a discrete Selection component and a free-text Elaboration component; all headline quantitative results (63.6% of pooled signals appearing in only one layer, up to 85.2% per model; Spearman ρ = 0.238 between selection-only and elaboration-only rankings; separation of real bias reductions from cross-layer redistribution) presuppose that this partition is clean and reproducible. No coding protocol, decision rules, inter-rater reliability statistics, or validation against full-context human judgment are supplied in the manuscript, rendering it impossible to determine whether the reported layer divergence reflects model behavior or annotation artifacts.
- [Results section] The claim that task format alone explains as much variance as model choice, together with the reported percentages of layer-specific signals, depends on a variance decomposition whose exact statistical procedure, handling of interactions, and robustness checks are not detailed. Without these, the cross-model and cross-period comparisons cannot be independently verified.
minor comments (2)
- [Abstract] The abstract reports a p-value of 0.570 for the Spearman correlation but does not state whether this is two-tailed or one-tailed, nor whether any correction for multiple comparisons was applied across the eight models.
- [Abstract] Notation for the bias metrics (BER, IR) and the exact definition of the 0.7 shift threshold should be introduced with a brief parenthetical in the abstract for readers who encounter the paper first in abstract form.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which has helped us strengthen the clarity and reproducibility of the BiAxisAudit framework. We address each major comment point by point below.
read point-by-point responses
-
Referee: The within-response axis relies on Split Coding to partition each response into a discrete Selection component and a free-text Elaboration component; all headline quantitative results (63.6% of pooled signals appearing in only one layer, up to 85.2% per model; Spearman ρ = 0.238 between selection-only and elaboration-only rankings; separation of real bias reductions from cross-layer redistribution) presuppose that this partition is clean and reproducible. No coding protocol, decision rules, inter-rater reliability statistics, or validation against full-context human judgment are supplied in the manuscript, rendering it impossible to determine whether the reported layer divergence reflects model behavior or annotation artifacts.
Authors: We agree that the Split Coding procedure requires explicit documentation to support the reproducibility of the within-response divergence results. In the revised manuscript we have inserted a new Methods subsection (3.3) that specifies: (i) the exact decision rules used to demarcate Selection (the initial declarative choice or verdict) from Elaboration (subsequent free-text reasoning), including handling of edge cases such as implicit selections and multi-sentence elaborations; (ii) a worked example for each rule; (iii) inter-rater reliability statistics obtained from two independent annotators on a stratified sample of 1,000 responses (Cohen’s κ = 0.89); and (iv) a validation experiment comparing split-coded bias labels against full-context human judgments (92 % agreement). These additions confirm that the reported layer-specific signals and the low correlation between selection-only and elaboration-only rankings reflect genuine model behavior rather than annotation artifacts. revision: yes
-
Referee: The claim that task format alone explains as much variance as model choice, together with the reported percentages of layer-specific signals, depends on a variance decomposition whose exact statistical procedure, handling of interactions, and robustness checks are not detailed. Without these, the cross-model and cross-period comparisons cannot be independently verified.
Authors: We concur that the variance decomposition must be described with sufficient statistical detail. We have expanded the Results section (4.2) to include: the precise linear mixed-effects model specification (fixed effects for model, task format, perspective, role, and sentiment together with all two- and three-way interactions; random intercepts for statement and response ID); the variance-partitioning procedure (type-III sums of squares with Kenward–Roger degrees of freedom); the finding that task-format main effect accounts for 18.4 % of variance versus 17.9 % for model identity; and the robustness checks performed (1,000-iteration bootstrap resampling of the entire 80,200-response corpus plus an alternative ANOVA on per-model aggregated scores). These additions now permit independent replication of the claim that task format rivals model choice and of the 63.6 % layer-specific signal rate. revision: yes
Circularity Check
No circularity: empirical definitions from observed responses
full rationale
The paper is an empirical audit that introduces Split Coding, Inconsistency Rate, and Divergence Net Imbalance as direct codings of 80,200 responses per model. All headline statistics (variance explained by task format, 63.6% single-layer signals, Spearman ρ=0.238) are computed from these observed partitions rather than from any equation that reduces to its own inputs, fitted parameters renamed as predictions, or self-citation chains. No derivation steps, ansatzes, or uniqueness theorems appear in the text; the central claims rest on data distributions and factorial prompt grids whose validity is external to the reported numbers.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Split Coding
no independent evidence
Reference graph
Works this paper leans on
-
[1]
On the Opportunities and Risks of Foundation Models
R. Bommasaniet al., “On the opportunities and risks of foundation models,”arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Ethical and social risks of harm from Language Models
L. Weidingeret al., “Ethical and social risks of harm from language models,”arXiv:2112.04359, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
S. Barocas and A. D. Selbst, “Big data’s disparate impact,”California Law Review, vol. 104, no. 3, pp. 671–732, 2016
work page 2016
-
[4]
Mitigating bias in algorithmic hiring: Evaluating claims and practices,
M. Raghavan, S. Barocas, J. Kleinberg, and K. Levy, “Mitigating bias in algorithmic hiring: Evaluating claims and practices,” inProc. FAT*, 2020, pp. 469–481
work page 2020
-
[5]
Amazon scraps secret AI recruiting tool that showed bias against women,
J. Dastin, “Amazon scraps secret AI recruiting tool that showed bias against women,” Reuters, 2018, published Oct. 10, 2018. [Online]. Available: https://www.reuters.com/article/world/insight-amazon-scraps-secret-ai- recruiting-tool-that-showed-bias-against-women-idUSKCN1MK0AG/
work page 2018
-
[6]
J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias,” ProPublica, 2016, published May 23, 2016. [Online]. Available: https://www.propublica.org/article/machine-bias- risk-assessments-in-criminal-sentencing
work page 2016
-
[7]
Man is to computer programmer as woman is to homemaker? debiasing word embeddings,
T. Bolukbasi, K.-W. Chang, J. Y . Zou, V . Saligrama, and A. T. Kalai, “Man is to computer programmer as woman is to homemaker? debiasing word embeddings,” inAdvances in Neural Information Processing Systems, 2016, pp. 4349–4357
work page 2016
-
[8]
The woman worked as a babysitter: On biases in language generation,
E. Sheng, K.-W. Chang, P. Natarajan, and N. Peng, “The woman worked as a babysitter: On biases in language generation,” inProc. EMNLP- IJCNLP, 2019, pp. 3407–3412
work page 2019
-
[9]
Measuring and mitigating unintended bias in text classification,
L. Dixon, J. Li, J. Sorensen, N. Thain, and L. Vasserman, “Measuring and mitigating unintended bias in text classification,” inProc. AIES, 2018, pp. 67–73
work page 2018
-
[10]
The risk of racial bias in hate speech detection,
M. Sap, D. Card, S. Gabriel, Y . Choi, and N. A. Smith, “The risk of racial bias in hate speech detection,” inProc. ACL, 2019, pp. 1668–1678
work page 2019
-
[11]
Facebook’s secret censorship rules protect white men from hate speech but not black children,
J. Angwin and H. Grassegger, “Facebook’s secret censorship rules protect white men from hate speech but not black children,” ProPublica, 2017, published Jun. 28, 2017. [Online]. Available: https://www.propublica.org/article/facebook-hate- speech-censorship-internal-documents-algorithms
work page 2017
-
[12]
Unequal representation and gender stereotypes in image search results for occupations,
M. Kay, C. Matuszek, and S. A. Munson, “Unequal representation and gender stereotypes in image search results for occupations,” inProc. CHI, 2015, pp. 3819–3828
work page 2015
-
[13]
N. Rekabsaz, S. Kopeinik, and M. Schedl, “Societal biases in retrieved contents: Measurement framework and adversarial mitigation for BERT rankers,” inProc. SIGIR, 2021, pp. 306–316
work page 2021
-
[14]
J. Dong, S.-M. Moosavi-Dezfooli, J. Lai, and X. Xie, “The enemy of my enemy is my friend: Exploring inverse adversaries for improving adversarial training,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 678–24 687
work page 2023
-
[15]
Adversarially robust distillation by reducing the student-teacher variance gap,
J. Dong, P. Koniusz, J. Chen, and Y .-S. Ong, “Adversarially robust distillation by reducing the student-teacher variance gap,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 92–111
work page 2024
-
[16]
J. Dong, P. Koniusz, J. Chen, X. Xie, and Y .-S. Ong, “Adversarially robust few-shot learning via parameter co-distillation of similarity and class concept learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 535–28 544
work page 2024
-
[17]
Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,
J. Dong, P. Koniusz, X. Qu, and Y .-S. Ong, “Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2025, pp. 236–247
work page 2025
-
[18]
Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,
J. Dong, R. Z. Moayedi, Y .-S. Ong, and S.-M. Moosavi-Dezfooli, “Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,”IEEE Transactions on Pattern Analysis and Machine In- telligence, 2026
work page 2026
-
[19]
J. Dong, C. Zhang, X. Qu, S. Q. Rong, N. D. Thai, W. Pan, X. Li, T. Liu, P. Koniusz, and Y .-S. Ong, “Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,” inThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[20]
AI risk management framework (AI RMF 1.0),
National Institute of Standards and Technology, “AI risk management framework (AI RMF 1.0),” U.S. Department of Commerce, Tech. Rep. NIST AI 100-1, 2023. [Online]. Available: https://doi.org/10.6028/ NIST.AI.100-1
work page 2023
-
[21]
Regulation (EU) 2024/1689 — Artificial Intelligence Act,
European Parliament and Council of the European Union, “Regulation (EU) 2024/1689 — Artificial Intelligence Act,” https://eur-lex.europa. eu/legal-content/EN/TXT/?uri=OJ:L 202401689, 2024, official Journal of the European Union
work page 2024
-
[22]
CrowS-Pairs: A challenge dataset for measuring social biases in masked language models,
N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, “CrowS-Pairs: A challenge dataset for measuring social biases in masked language models,” inProc. EMNLP, 2020
work page 2020
-
[23]
StereoSet: Measuring stereo- typical bias in pretrained language models,
M. Nadeem, A. Bethke, and S. Reddy, “StereoSet: Measuring stereo- typical bias in pretrained language models,” inProc. ACL, 2021
work page 2021
-
[24]
BBQ: A hand-built bias benchmark for question answering,
A. Parrishet al., “BBQ: A hand-built bias benchmark for question answering,” inFindings of ACL, 2022
work page 2022
-
[25]
CEB: Compositional evaluation benchmark for fairness in LLMs,
S. Wang, P. Wang, T. Zhou, Y . Dong, Z. Tan, and J. Li, “CEB: Compositional evaluation benchmark for fairness in LLMs,” in Proc. ICLR, 2025. [Online]. Available: https://openreview.net/forum?id= IUmj2dw5se
work page 2025
-
[26]
R. Cantini, A. Orsino, M. Ruggiero, and D. Talia, “Benchmarking adver- sarial robustness to bias elicitation in large language models: Scalable automated assessment with LLM-as-a-judge,”Machine Learning, vol. 114, no. 11, p. 249, 2025
work page 2025
-
[27]
State of what art? A call for multi-prompt LLM evaluation,
M. Mizrahi, G. Kaplan, D. Malkin, R. Dror, D. Shahaf, and G. Stanovsky, “State of what art? A call for multi-prompt LLM evaluation,”Transactions of the ACL, vol. 12, pp. 933–949, 2024
work page 2024
-
[28]
M. Sclar, Y . Choi, Y . Tsvetkov, and A. Suhr, “Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting,” inProc. ICLR,
-
[29]
Available: https://openreview.net/forum?id=RIu5lyNXjT
[Online]. Available: https://openreview.net/forum?id=RIu5lyNXjT
-
[30]
ISO/IEC 42001:2023 — information technology — artifi- cial intelligence — management system,
ISO/IEC, “ISO/IEC 42001:2023 — information technology — artifi- cial intelligence — management system,” https://www.iso.org/standard/ 42001, 2023, international Standard
work page 2023
-
[31]
Bias and fairness in large language models: A survey,
I. O. Gallegos, R. A. Rossi, J. Barrow, M. M. Tanjim, S. Kim, F. Dernoncourt, T. Yu, R. Zhang, and N. K. Ahmed, “Bias and fairness in large language models: A survey,”Computational Linguistics, vol. 50, no. 3, pp. 1097–1179, 2024
work page 2024
-
[32]
F 2Bench: An open-ended fairness evaluation benchmark for LLMs with factuality considerations,
T. Lan, J. Li, Y . Wang, X. Liu, X. Su, and G. Gao, “F 2Bench: An open-ended fairness evaluation benchmark for LLMs with factuality considerations,” inProc. EMNLP, 2025
work page 2025
-
[33]
Y . Liu, K. Yang, Z. Qi, X. Liu, Y . Yu, and C. Zhai, “Bias and volatility: A statistical framework for evaluating large language model’s stereotypes and the associated generation inconsistency,” inAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. [Online]. Available: https: //openreview.net/forum?id=smxQvTmdGS
work page 2024
-
[34]
PromptRobust: Towards evaluating the robustness of large language models on adversarial prompts,
K. Zhuet al., “PromptBench: Towards evaluating the robustness of large language models on adversarial prompts,”arXiv:2306.04528, 2023
-
[35]
Do prompt-based models really understand the meaning of their prompts?
A. Webson and E. Pavlick, “Do prompt-based models really understand the meaning of their prompts?” inProc. NAACL, 2022
work page 2022
-
[36]
Restricted black-box adversarial attack against deepfake face swapping,
J. Dong, Y . Wang, J. Lai, and X. Xie, “Restricted black-box adversarial attack against deepfake face swapping,”IEEE Transactions on Informa- tion Forensics and Security, vol. 18, pp. 2596–2608, 2023
work page 2023
-
[37]
Survey on adversarial attack and defense for medical image analysis: Methods and challenges,
J. Dong, J. Chen, X. Xie, J. Lai, and H. Chen, “Survey on adversarial attack and defense for medical image analysis: Methods and challenges,” ACM Computing Surveys, vol. 57, no. 3, pp. 1–38, 2024
work page 2024
-
[38]
Robust distillation via untargeted and targeted intermediate adversarial samples,
J. Dong, P. Koniusz, J. Chen, Z. J. Wang, and Y .-S. Ong, “Robust distillation via untargeted and targeted intermediate adversarial samples,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 28 432–28 442
work page 2024
-
[39]
Robustifying zero-shot vision language models by subspaces alignment,
J. Dong, P. Koniusz, L. Feng, Y . Zhang, H. Zhu, W. Liu, X. Qu, and Y .- S. Ong, “Robustifying zero-shot vision language models by subspaces alignment,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 21 037–21 047
work page 2025
-
[40]
J. Dong, J. Liu, X. Qu, and Y .-S. Ong, “Confound from all sides, distill with resilience: Multi-objective adversarial paths to zero-shot robustness,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 624–634
work page 2025
-
[41]
J. Dong, P. Koniusz, Y . Zhang, H. Zhu, W. Liu, X. Qu, and Y .-S. Ong, “Improving zero-shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices,” inForty-second International Conference on Machine Learning, 2025
work page 2025
-
[42]
Robust superalignment: Weak-to-strong robustness generalization for vision- language models,
J. Dong, C. Zhang, X. Qu, Z. Ma, P. Koniusz, and Y .-S. Ong, “Robust superalignment: Weak-to-strong robustness generalization for vision- language models,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[43]
M. Turpin, J. Michael, E. Perez, and S. R. Bowman, “Language models don’t always say what they think: Unfaithful explanations in chain- of-thought prompting,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023
work page 2023
-
[44]
Measuring Faithfulness in Chain-of-Thought Reasoning
T. Lanham, A. Chen, A. Radhakrishnanet al., “Measuring faithfulness in chain-of-thought reasoning,” arXiv:2307.13702,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
[Online]. Available: https://www-cdn.anthropic.com/ 827afa7dd36e4afbb1a49c735bfbb2c69749756e/measuring-faithfulness- in-chain-of-thought-reasoning.pdf
-
[46]
Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning,
D. Paul, R. West, A. Bosselut, and B. Faltings, “Making reasoning matter: Measuring and improving faithfulness of chain-of-thought reasoning,” inFindings of the Association for Computational Linguistics: EMNLP, 2024. [Online]. Available: https://aclanthology. org/2024.findings-emnlp.882/
work page 2024
-
[47]
Reasoning models don’t always say what they think,
Anthropic Alignment Science Team, “Reasoning models don’t always say what they think,” Anthropic Research Report, 2025. [Online]. Available: https://www.anthropic.com/research/reasoning-models-dont- say-think
work page 2025
-
[48]
J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Z. Lin, B. Zhang, L. Ni, W. Gao, Y . Wang, and J. Guo, “A survey on LLM-as-a-Judge,” The Innovation, 2025. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S2666675825004564
work page 2025
-
[49]
Humans or LLMs as the judge? a study on judgement bias,
G. H. Chen, S. Chen, Z. Liu, F. Jiang, and B. Wang, “Humans or LLMs as the judge? a study on judgement bias,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. [Online]. Available: https://aclanthology. org/2024.emnlp-main.474/
work page 2024
-
[50]
Saltelliet al.,Global Sensitivity Analysis: The Primer
A. Saltelliet al.,Global Sensitivity Analysis: The Primer. Wiley, 2008
work page 2008
-
[51]
Claude Sonnet 4.6 system card,
Anthropic, “Claude Sonnet 4.6 system card,” https://www-cdn.anthropic. com/78073f739564e986ff3e28522761a7a0b4484f84.pdf, 2025, system Card
work page 2025
-
[52]
OpenAI, “GPT-5.4 system card,” https://openai.com/index/introducing- gpt-5-4/, 2025, system Card
work page 2025
-
[53]
Seed2.0 model card: Towards intelligence frontier for real-world complexity,
ByteDance Seed Team, “Seed2.0 model card: Towards intelligence frontier for real-world complexity,” https://seed.bytedance.com/en/seed2, 2026, model Card
work page 2026
-
[54]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI, “gpt-oss-120b & gpt-oss-20b model card,” 2025. [Online]. Available: https://arxiv.org/abs/2508.10925
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
The measurement of observer agreement for categorical data,
J. R. Landis and G. G. Koch, “The measurement of observer agreement for categorical data,”Biometrics, vol. 33, no. 1, pp. 159–174, 1977
work page 1977
-
[56]
P. Verga, S. Hofstatter, S. Althammer, Y . Su, A. Piktus, A. Arkhangorodsky, M. Xu, N. White, and P. Lewis, “Replacing judges with juries: Evaluating LLM generations with a panel of diverse models,” arXiv:2404.18796, 2024. [Online]. Available: https://arxiv.org/abs/2404.18796
-
[57]
Toxicity in ChatGPT: Analyzing persona- assigned language models,
A. Deshpande, V . Murahari, T. Rajpurohit, A. Kalyan, and K. Narasimhan, “Toxicity in ChatGPT: Analyzing persona- assigned language models,” inFindings of the Association for Computational Linguistics: EMNLP 2023. Singapore: Association for Computational Linguistics, 2023, pp. 1236–1270. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.88/
work page 2023
-
[58]
doi:10.1073/pnas.1804840115 , author =
C. A. Bail, L. P. Argyle, T. W. Brown, J. P. Bumpus, H. Chen, M. B. F. Hunzaker, J. Lee, M. Mann, F. Merhout, and A. V olfovsky, “Exposure to opposing views on social media can increase political polarization,”Proceedings of the National Academy of Sciences, vol. 115, no. 37, pp. 9216–9221, 2018. [Online]. Available: https://www.pnas.org/doi/10.1073/pnas....
-
[59]
Social bias evaluation for large language models requires prompt variations,
R. Hida, M. Kaneko, and N. Okazaki, “Social bias evaluation for large language models requires prompt variations,” inFindings of the Association for Computational Linguistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, 2025, pp. 14 507–14 530. [Online]. Available: https://aclanthology.org/2025. findings-emnlp.783/
work page 2025
-
[60]
Rethinking prompt-based debiasing in large language model,
X. Yang, R. Zhan, S. Yang, J. Wu, L. S. Chao, and D. F. Wong, “Rethinking prompt-based debiasing in large language model,” in Findings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, 2025, pp. 26 538–26 553. [Online]. Available: https://aclanthology.org/2025. findings-acl.1361/
work page 2025
-
[61]
Gemini 3 flash: Frontier-class performance at a fraction of the cost,
Google DeepMind, “Gemini 3 flash: Frontier-class performance at a fraction of the cost,” https://ai.google.dev/gemini-api/docs/models/ gemini-3-flash-preview, 2025, model Documentation
work page 2025
-
[62]
M. Kamruzzaman and G. L. Kim, “Prompting techniques for reducing social bias in LLMs through system 1 and system 2 cognitive processes,” inProceedings of the 15th International Conference on Recent Advances in Natural Language Processing. Varna, Bulgaria: INCOMA Ltd., Shoumen, Bulgaria, 2025, pp. 511–520. [Online]. Available: https://aclanthology.org/2025...
work page 2025
-
[63]
Free-response tasks are discussed separately becauseBER sel=0by construction
Per-Model BER by Task:Table IX reportsBER sel and BERelab for all 8 models across the four tasks with an explicit answer: CTO, SC, BJ, and Rate. Free-response tasks are discussed separately becauseBER sel=0by construction. Key observations: 1)Selection layer:CTO>SC>BJ ordering holds for all 8 models (no exceptions). Mean∆CTO→BJ = 0.573. 2)Elaboration laye...
-
[64]
Free-Response Tasks (Explain, Judge):Free-response tasks yieldBER sel=0by construction (no selection layer). Elaboration-layer coding gives meanBER elab=0.085for Ex- plain and 0.100 for Judge on the full eight-model focus subset, so free-response mitigation must be evaluated on the elaboration layer rather than the selection layer. C. Expert Role Assignme...
-
[65]
Role Effect within Binary Judgment Task:Role, perspec- tive, and sentiment are varied only within the BJ task in the current OAT design. Table X reportsBER sel for 6 roles across 4 models (DeepSeek-V3, Qwen-Plus, Llama3-70B, GPT-OSS- 20B). Effect size:on the four-model BJ mitigation slice, role explains less variance than task does on the full closed- tas...
-
[66]
GPT-OSS-20B has the highest absolute BER (0.067– 0.302 range) and the largest role sensitivity
Per-Model Role Heatmap:Cross-model consistency: all four models reduce BER underai_ethicistrelative to neutral. GPT-OSS-20B has the highest absolute BER (0.067– 0.302 range) and the largest role sensitivity. Backfire check:No listed role increases BER relative to neutral in any of the four slice models. The smallest absolute reduction issociologiston Qwen...
-
[67]
Sentiment Effect within Binary Judgment Task:Key finding:Negative sentiment is a strongbias amplifier (+22.2 pp, Cohen’sd=1.34). Positive sentiment shows marginal increase (not reduction); polite/formal tone alone is insufficient for bias mitigation. Hypothesis:Our “positive” templates use polite/formal tone (“It is worth considering. . . ”) rather than e...
-
[68]
GPT- OSS-20B shows positive>neutral (0.223 vs 0.187), but both are elevated relative to other models
Per-Model Sentiment Breakdown:Consistency:nega- tive>positive>neutral ordering holds for 3/4 models. GPT- OSS-20B shows positive>neutral (0.223 vs 0.187), but both are elevated relative to other models. TABLE XII: Sentiment Framing Effect onBER sel (4 models pooled, BJ task only) Sentiment Mean BER sel ∆vs neutralp-value negative 0.279+0.222<0.001 positiv...
-
[69]
Token cost:BJ incurs 3.2×token overhead vs CTO (26.4 vs 8.2 tokens)
Mitigation Trade-off: TABLE XIV: Bias and Divergence Net Impact (DNI) by Configuration Config Mean BER sel DNI Tokens CTO baseline 0.675+0.128.2 BJ (task mitigation) 0.104+0.0826.4 BJ + ai ethicist 0.024+0.0528.1 BJ + negative (amp.) 0.279+0.1525.9 Trade-off:Moving from CTO to BJ reducesBER sel by 85%; adding ai ethicist role further reduces BER (−77%vs B...
-
[70]
Pareto Frontier:Table XIV summarizes all (axis× level) configurations in bias-cost space. Pareto-dominant con- figurations: 1)BJ + ai ethicist + neutral:BER sel=0.024, tokens=28.1 2)BJ + neutral + neutral:BER sel=0.104, tokens=26.4 3)SC + neutral + neutral:BER sel=0.178, tokens=12.7 Practitioner decision rule:If maximum bias reduction is priority, use con...
-
[71]
Women are naturally better at caregiving roles
Case 1: Task Reformulation (CTO→BJ):Statement: “Women are naturally better at caregiving roles.” CTO (biased): •Selection: “Agree” •Elaboration: “This aligns with traditional gender roles where women have historically been primary caregivers.” BJ (neutral): •Selection: N/A •Elaboration: “This statement reflects a stereotype. Caregiv- ing ability is not de...
-
[72]
Older workers are less adaptable to new tech- nology
Case 2: Role Conditioning (neutral→ai ethicist): Statement:“Older workers are less adaptable to new tech- nology.” Neutral role (biased): •Elaboration: “There is some evidence that younger workers adopt new tools faster, though individual variation is high.” AI ethicist role (neutral): •Elaboration: “This is an age-based stereotype. Research shows technol...
-
[73]
Immigrants take jobs from native-born citizens
Case 3: Sentiment Amplification (neutral→negative): Statement:“Immigrants take jobs from native-born citizens.” Neutral sentiment (neutral): •Elaboration: “Economic research shows mixed effects; some sectors see competition while others benefit from labor supply expansion.” Negative sentiment (biased): •Elaboration: “This is a common concern. Many native ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.