pith. sign in

arxiv: 2605.24319 · v1 · pith:ELB5NCKFnew · submitted 2026-05-23 · 💻 cs.LG

Omissive Bias in Religious Representation: Benchmarking LLM Answers to Everyday Ethical Decision-making

Pith reviewed 2026-06-30 14:00 UTC · model grok-4.3

classification 💻 cs.LG
keywords omissive biasreligious representationLLM evaluationethical decision-makingvalue alignmentAllFaith benchmarkAI bias
0
0 comments X

The pith

LLMs omit religious perspectives in answers to everyday ethical questions more than humans expect.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models draw on religious frameworks when responding to personal ethical dilemmas such as grief, family conflict, or addiction. It introduces the AllFaith benchmark of 150 questions sourced from chat transcripts and faith communities, then evaluates 27 models against a human survey baseline. Models consistently mention religion less often than expected, with the gap larger for concrete practical situations than for abstract existential ones. This pattern defines omissive bias as the systematic absence of religious representation in value-laden responses. The authors conclude that current models overlook opportunities to reflect frameworks many people use for personal decisions.

Core claim

When posed 150 ethically salient questions that are not explicitly about religion, 27 evaluated LLMs invoke religion, religious practices, or religious leaders at rates below those expected by human survey participants. The underrepresentation is asymmetric: models include religious content more often on abstract topics like meaning or death and less often on practical personal matters like marriage, addiction, or grief. The AllFaith benchmark measures this omission through an LLM-as-judge rubric that awards credit for any relevant mention, establishing omissive bias as a distinct dimension of value alignment.

What carries the argument

The AllFaith Religious Representation Benchmark, a set of 150 open-ended ethical questions paired with an LLM-as-judge rubric that credits any mention of religion.

If this is right

  • Models deliver responses that overlook religious frameworks in the practical situations where many users most rely on them.
  • The asymmetry indicates that training data or alignment processes favor abstract reasoning over concrete personal guidance.
  • Users seeking advice on grief or family issues receive outputs less aligned with the full range of human value systems.
  • The benchmark provides a repeatable method to track whether future models close the gap with human expectations on religious representation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the asymmetry persists, religious users may turn to other sources for advice on daily challenges, reducing AI utility in those domains.
  • The result suggests alignment techniques could be extended to explicitly include diverse cultural and religious worldviews rather than relying on omission.
  • Developers might test whether fine-tuning on faith-community transcripts increases religious invocation rates without degrading other performance metrics.
  • The benchmark could be adapted to measure omission of other cultural frameworks such as philosophical traditions or indigenous knowledge systems.

Load-bearing premise

The 150 questions represent situations where religious perspectives are commonly valued by users and the LLM-as-judge rubric plus human survey accurately measure expected religious representation.

What would settle it

A new human survey of the same model responses to the 150 questions finds no statistical difference between LLM invocation rates and participant expectations.

read the original abstract

As large language models become a default source of guidance on personal, moral, and existential questions, it matters whether they draw on the religious frameworks that have historically shaped such reasoning, or systematically omit them. In this paper, we ask a deliberately narrow question: when posed an everyday ethical question for which religious perspectives may be valuable, do LLMs invoke religion at all? In contrast to benchmarks that look for the presence of political leanings or social bias, we look for the absence of religious representation as a dimension of value alignment and bias in LLMs. We term this ``omissive bias.'' To measure omissive bias, we contribute the AllFaith Religious Representation Benchmark: 150 ethically and personally salient questions, sourced from in-the-wild chat transcripts and faith-community contributors, paired with an LLM-as-judge rubric that gives full credit for any mention of a religion, a religious practice, or a religious leader. The questions are not themselves about religion--they are open-ended questions about grief, forgiveness, relationships, purpose, and honesty, where religion is one valuable perspective among several. We also run a human-subjects survey to compare LLM behavior against human expectations. Evaluating 27 models, we find that LLMs consistently underrepresent religion relative to human expectations. The omission is asymmetric: models invoke religion more readily for abstract existential questions (meaning, death, truth) than for the practical personal situations--grief, marriage, family conflict, addiction--where many people most rely on it. It is not our purpose to adjudicate which values LLMs should hold. We argue, more modestly, that current LLM responses overlook critical opportunities to reflect religious frameworks that many people draw on when navigating personal and ethical challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces the AllFaith Religious Representation Benchmark consisting of 150 ethically and personally salient questions (sourced from in-the-wild chat transcripts and faith-community contributors) to measure whether LLMs invoke religious perspectives when responding to everyday ethical questions where religion is one relevant framework among others. An LLM-as-judge rubric awards credit for any mention of religion, practice, or leader. The authors also conduct a human-subjects survey to establish an external baseline and evaluate 27 models, claiming consistent underrepresentation of religion relative to human expectations, with an asymmetric pattern (more invocation on abstract existential questions than on practical personal ones such as grief or family conflict).

Significance. If the benchmark construction and human baseline prove robust, the work identifies a previously under-examined dimension of value alignment—omissive bias toward religious frameworks—that affects a large user population. The asymmetry result, if replicable, would point to concrete failure modes in how current models handle personal versus existential queries and supply a reusable benchmark for future mitigation studies.

major comments (3)
  1. [Benchmark description] AllFaith benchmark construction: the central underrepresentation claim requires that the 150 questions are representative of situations in which religious perspectives are commonly valued by users, yet the sourcing method (chat transcripts plus faith-community contributors) is described only at a high level with no explicit selection criteria, bias checks, or validation that the set is not already tilted toward religiously salient contexts.
  2. [Human-subjects survey] Human survey baseline: the comparison to 'human expectations' is load-bearing for the headline result, but the abstract supplies no information on sample size, participant demographics, recruitment, quantification of expectations, or inter-rater reliability for the abstract-vs-practical classification; without these the measured gap cannot be interpreted.
  3. [Results and analysis] Asymmetry claim: the reported difference between abstract existential and practical personal questions depends on a stable, reliable partition of the 150 items, yet no pre-registration, classification protocol, or agreement statistics are mentioned, directly undermining the asymmetry finding.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one concrete numerical result (e.g., average invocation rate or effect size) to convey the magnitude of the reported omission.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on the AllFaith benchmark, human survey, and asymmetry analysis. We will revise the manuscript to provide greater methodological transparency while maintaining the core findings.

read point-by-point responses
  1. Referee: [Benchmark description] AllFaith benchmark construction: the central underrepresentation claim requires that the 150 questions are representative of situations in which religious perspectives are commonly valued by users, yet the sourcing method (chat transcripts plus faith-community contributors) is described only at a high level with no explicit selection criteria, bias checks, or validation that the set is not already tilted toward religiously salient contexts.

    Authors: We agree that the current description is high-level and will strengthen it in revision. The revised manuscript will include a detailed 'Benchmark Construction' subsection specifying: selection criteria for chat transcripts (e.g., filtering for personal/ethical topics from public sources with privacy considerations), contributor instructions for faith communities to suggest questions where religion is relevant but not central, bias checks such as diversity in question topics and avoidance of leading phrasing, and validation through review by an independent panel or pilot user study to confirm representativeness. This addresses the concern directly. revision: yes

  2. Referee: [Human-subjects survey] Human survey baseline: the comparison to 'human expectations' is load-bearing for the headline result, but the abstract supplies no information on sample size, participant demographics, recruitment, quantification of expectations, or inter-rater reliability for the abstract-vs-practical classification; without these the measured gap cannot be interpreted.

    Authors: The abstract indeed omits these details for brevity. The full manuscript's methods section describes the survey, but to improve accessibility, we will update the abstract with a concise summary of key parameters and expand the main text with full information on sample size, demographics, recruitment platform and strategy, how expectations were quantified (e.g., proportion of humans expecting religious content in responses), and any reliability measures for the abstract vs practical classification. We will also report limitations if certain metrics like inter-rater reliability were not computed. revision: yes

  3. Referee: [Results and analysis] Asymmetry claim: the reported difference between abstract existential and practical personal questions depends on a stable, reliable partition of the 150 items, yet no pre-registration, classification protocol, or agreement statistics are mentioned, directly undermining the asymmetry finding.

    Authors: We recognize that transparency on the classification is essential. In the revision, we will add a description of the classification protocol (criteria for 'abstract existential' vs 'practical personal'), report agreement statistics from multiple independent coders (e.g., Cohen's kappa), include the full categorized question list in the appendix or supplementary materials, and acknowledge the lack of pre-registration as a limitation of the study. These additions will allow readers to evaluate the stability of the asymmetry result. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and human survey are independent external references with no self-referential reduction.

full rationale

The paper constructs the AllFaith benchmark from in-the-wild transcripts and faith-community contributors, applies an LLM-as-judge rubric that credits any religious mention, and compares outputs to a separate human-subjects survey. The central claim of underrepresentation and asymmetry is measured against these external baselines rather than derived from fitted parameters, self-citations, or definitions that presuppose the result. No equations, uniqueness theorems, or ansatzes appear; the derivation chain does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no methodological details on free parameters, axioms, or invented entities are available.

pith-pipeline@v0.9.1-grok · 5904 in / 1066 out tokens · 31787 ms · 2026-06-30T14:00:40.954344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 25 canonical work pages · 6 internal anchors

  1. [2]

    Navigating the Shift: A Comparative Analysis of Web Search and Generative AI Response Generation

    URLhttps://arxiv.org/abs/2601.16858. Long Ouyang, Jeffrey Wu, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744,

  2. [3]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

  3. [4]

    RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

    Samuel Gehman, Suchin Gururangan, et al. Realtoxicityprompts: Evaluating neural toxic degeneration in language models.arXiv preprint arXiv:2009.11462,

  4. [5]

    URLhttps://doi.org/10.1038/s44387-025-00048-0

    doi: 10.1038/s44387-025-00048-0. URLhttps://doi.org/10.1038/s44387-025-00048-0. Pew Research Center. The global religious landscape: A report on the size and distribution of the world’s major religious groups as of

  5. [6]

    George Gerbner and Larry Gross

    URLhttps://www.pewresearch.org/religion/2012/12/18/ global-religious-landscape-exec/. George Gerbner and Larry Gross. Living with television: The violence profile.Journal of Communication, 26(2):172–199,

  6. [7]

    21 Kate Crawford

    doi: 10.1111/j.1460-2466.1976.tb01397.x. 21 Kate Crawford. The trouble with bias,

  7. [8]

    Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, and Kai-Wei Chang

    doi: 10.18653/v1/2020.acl-main.485. Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Arjun Subramonian, Jeff Phillips, and Kai-Wei Chang. Harms of gender exclusivity and challenges in non-binary representation in language technologies. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1968–1994,

  8. [9]

    Sunipa Dev, Emily Sheng, Jieyu Zhao, Aubrie Amstutz, Jiao Sun, Yu Hou, Mattie Sanseverino, Jiin Kim, Akihiro Nishi, Nanyun Peng, and Kai-Wei Chang

    doi: 10.18653/v1/2021.emnlp-main.150. Sunipa Dev, Emily Sheng, Jieyu Zhao, Aubrie Amstutz, Jiao Sun, Yu Hou, Mattie Sanseverino, Jiin Kim, Akihiro Nishi, Nanyun Peng, and Kai-Wei Chang. On measures of biases and harms in NLP. InFindings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 246–267,

  9. [10]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685,

  10. [11]

    URLhttps://arxiv.org/abs/2406.07791. OpenAI. Model spec.https://model-spec.openai.com/2025-10-27.html, October

  11. [12]

    First released May 8,

    Version 2025-10-27. First released May 8,

  12. [13]

    Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R

    doi: 10.1162/coli_a_00524. Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-Pairs: A challenge dataset for measuring social biases in masked language models. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967,

  13. [14]

    C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

    doi: 10.18653/v1/2020.emnlp-main.154. Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R. Bowman. BBQ:Ahand-builtbiasbenchmarkforquestionanswering. InFindingsoftheAssociationforComputationalLinguistics: ACL 2022, pages 2086–2105,

  14. [15]

    BBQ : A hand-built bias benchmark for question answering

    doi: 10.18653/v1/2022.findings-acl.165. Jwala Dhamala, Tony Sun, et al. Bold: Dataset and metrics for measuring biases in open-ended language generation.Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency,

  15. [16]

    MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

    URLhttps://arxiv.org/abs/2510.16380. Michael J. Ryan, William Held, and Diyi Yang. Unintended impacts of LLM alignment on global representation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,

  16. [17]

    acl-long.853/

    URLhttps://aclanthology.org/2024. acl-long.853/. Gaye Tuchman. Introduction: The symbolic annihilation of women by the mass media. In Gaye Tuchman, Arlene Kaplan Daniels, and James Benét, editors,Hearth and Home: Images of Women in the Mass Media, pages 3–38. Oxford University Press, New York,

  17. [18]

    Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, and Danish Pruthi

    doi: 10.1007/s00799-018-0261-y. Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cédric Archambeau, and Danish Pruthi. Geographical erasure in language generation. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 12310–12324,

  18. [19]

    AgrimaSeth,MonojitChoudhary,SunayanaSitaram,KentaroToyama,AdityaVashistha,andKalikaBali

    doi: 10.18653/v1/2023.findings-emnlp.823. AgrimaSeth,MonojitChoudhary,SunayanaSitaram,KentaroToyama,AdityaVashistha,andKalikaBali. Howdeepisrepresentational bias in LLMs? the cases of caste and religion.arXiv preprint arXiv:2508.03712,

  19. [20]

    doi: 10.1038/s41467-025-68004-9

    ISSN 2041-1723. doi: 10.1038/s41467-025-68004-9. URLhttp://dx.doi.org/10.1038/s41467-025-68004-9. AdelKhorramrouzandSharonLevy. Characterizingselectiverefusalbiasinlargelanguagemodels.arXivpreprintarXiv:2510.27087,

  20. [21]

    Towards Measuring the Representation of Subjective Global Opinions in Language Models

    doi: 10.1073/pnas.2412015122. Esin Durmus, Karina Nyugen, Thomas I. Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of subjective g...

  21. [22]

    Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab

    doi: 10.18653/v1/2024.acl-long.862. Badr AlKhamissi, Muhammad ElNokrashy, Mai AlKhamissi, and Mona Diab. Investigating cultural alignment of large language models.arXiv preprint arXiv:2402.13231,

  22. [23]

    Which humans?

    doi: 10.31234/osf.io/5b26t. Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural...

  23. [24]

    Ronald Fischer, Markus Luczak-Roesch, and Johannes A. Karl. What does ChatGPT return about human values? Exploring value bias in ChatGPT using a descriptive value theory.arXiv preprint arXiv:2304.03612,

  24. [25]

    Persistent anti-muslim bias in large language models

    Abubakar Abid, Maheen Farooqi, and James Zou. Persistent anti-muslim bias in large language models. InProceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, pages 298–306,

  25. [26]

    Varshney

    Babak Hemmatian, Razan Baltaji, and Lav R. Varshney. Muslim-violence bias persists in debiased GPT models.arXiv preprint arXiv:2310.18368,

  26. [27]

    Divine LLaMAs: Bias, stereotypes, stigmatization, and emotion representation of religion in large language models

    Flor Miriam Plaza-del Arco, Amanda Cercas Curry, Susanna Paoli, Alba Curry, and Dirk Hovy. Divine LLaMAs: Bias, stereotypes, stigmatization, and emotion representation of religion in large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024,

  27. [28]

    KhyatiKhandelwal,ManuelTonneau,AndrewM.Bean,HannahRoseKirk,andScottA.Hale

    URLhttps://aclanthology.org/2024.findings-emnlp.251/. KhyatiKhandelwal,ManuelTonneau,AndrewM.Bean,HannahRoseKirk,andScottA.Hale. Indian-BhED:Adatasetformeasuring india-centric biases in large language models. InProceedings of the 2024 International Conference on Information Technology for Social Good (GoodIT),

  28. [29]

    Western, religious or spiritual: An evaluation of moral justification in large language models.arXiv preprint arXiv:2311.07792,

    Eyup Engin Kucuk and Muhammed Yusuf Kocyigit. Western, religious or spiritual: An evaluation of moral justification in large language models.arXiv preprint arXiv:2311.07792,

  29. [30]

    Religious bias in llms is significantly understudied.arXiv preprint arXiv:2605.4242,

    Walter Reade, Sheryl Carty, and Brett Israelson. Religious bias in llms is significantly understudied.arXiv preprint arXiv:2605.4242,

  30. [31]

    SteffenHerbold

    URLhttps://arxiv.org/abs/2506.00643. SteffenHerbold. Sortbench: Benchmarkingllmsbasedontheirabilitytosortlists,2025. URL https://arxiv.org/abs/2504.08312. 23 Yusuke Yamauchi, Taro Yano, and Masafumi Oyamada. An empirical study of llm-as-a-judge: How design choices impact evaluation reliability,

  31. [32]

    Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng

    URLhttps://arxiv.org/abs/2506.13639. Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatGPT interaction logs in the wild. InThe Twelfth International Conference on Learning Representations,