{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2021:3FY2AOQXS2OUDBRMFLBJI4NJPB","short_pith_number":"pith:3FY2AOQX","schema_version":"1.0","canonical_sha256":"d971a03a17969d41862c2ac29471a9785ed277cfbff812406694dd2deb4ed2e9","source":{"kind":"arxiv","id":"2110.08193","version":2},"attestation_state":"computed","paper":{"title":"BBQ: A Hand-Built Bias Benchmark for Question Answering","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Question answering models rely on social stereotypes, showing higher accuracy when correct answers align with biases.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Alicia Parrish, Angelica Chen, Jana Thompson, Jason Phang, Nikita Nangia, Phu Mon Htut, Samuel R. Bowman, Vishakh Padmakumar","submitted_at":"2021-10-15T16:43:46Z","abstract_excerpt":"It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two levels: (i) given an under-informative context, we test how strongly responses reflect social biases, and (ii) given "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2110.08193","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2021-10-15T16:43:46Z","cross_cats_sorted":[],"title_canon_sha256":"b6cb76fc550908048a9e2f435ac156c2f5acb9ce960aeaea505b7cd0c834ae47","abstract_canon_sha256":"5680f90f58226f4faab044ff3bf59bd8a204c421364cd67b536c97895efb0746"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.773165Z","signature_b64":"Z7A6d1iNtMVs3+0fEenFVEcINV6sRbHYqfLN8cF1jts5YmvU5vrxaJptJ2MTEhAHaBSRBYvbWVyvdhM90qUxDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"d971a03a17969d41862c2ac29471a9785ed277cfbff812406694dd2deb4ed2e9","last_reissued_at":"2026-05-17T23:38:46.772647Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.772647Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"BBQ: A Hand-Built Bias Benchmark for Question Answering","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Question answering models rely on social stereotypes, showing higher accuracy when correct answers align with biases.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Alicia Parrish, Angelica Chen, Jana Thompson, Jason Phang, Nikita Nangia, Phu Mon Htut, Samuel R. Bowman, Vishakh Padmakumar","submitted_at":"2021-10-15T16:43:46Z","abstract_excerpt":"It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. Our task evaluates model responses at two levels: (i) given an under-informative context, we test how strongly responses reflect social biases, and (ii) given "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conflicts, with this difference widening to over 5 points on examples targeting gender for most models tested.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The hand-constructed questions accurately capture attested real-world social biases in U.S. English contexts without introducing artificial patterns that models exploit differently from natural text.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"BBQ is a new benchmark dataset showing that QA models often default to social stereotypes, achieving up to 3.4 points higher accuracy when the correct answer aligns with bias.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Question answering models rely on social stereotypes, showing higher accuracy when correct answers align with biases.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"048e92a8e384cc427a14308a8fcd7558bd54574c9e7f23c12a9e821594140102"},"source":{"id":"2110.08193","kind":"arxiv","version":2},"verdict":{"id":"f963f81b-196a-494d-94eb-043130986284","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T19:52:09.470499Z","strongest_claim":"Models often rely on stereotypes when the context is under-informative, meaning the model's outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conflicts, with this difference widening to over 5 points on examples targeting gender for most models tested.","one_line_summary":"BBQ is a new benchmark dataset showing that QA models often default to social stereotypes, achieving up to 3.4 points higher accuracy when the correct answer aligns with bias.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The hand-constructed questions accurately capture attested real-world social biases in U.S. English contexts without introducing artificial patterns that models exploit differently from natural text.","pith_extraction_headline":"Question answering models rely on social stereotypes, showing higher accuracy when correct answers align with biases."},"references":{"count":65,"sample":[{"doi":"","year":2009,"title":"Kevin Bartz. 2009. https://blogs.iq.harvard.edu/english_first_n English first names for chinese americans . Harvard University Social Science Statistics Blog. Accessed July 2021","work_id":"4bb48d73-eebb-4fa3-990a-ffbbb212015f","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Su Lin Blodgett, Solon Barocas, Hal Daum \\'e III, and Hanna Wallach. 2020. https://aclanthology.org/2020.acl-main.485/ Language (technology) is power: A critical survey of\" bias\" in NLP . In Proceedin","work_id":"ba726d32-bf30-44df-8ad6-f0327957c9d3","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1126/science.aal4230","year":2017,"title":"Semantics derived automatically from language corpora contain human-like biases","work_id":"5f928ccc-c92a-435c-a75a-67f4e14fecce","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Jorida Cila, Richard N Lalonde, Joni Y Sasaki, Raymond A Mar, and Ronda F Lo. 2021. https://psycnet.apa.org/fulltext/2020-69298-001.html Zahra or Zoe , Arjun or Andrew ? Bicultural baby names reflect ","work_id":"4b779e1c-7a5e-4dd6-945d-655514f642b6","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","ref_index":8,"cited_arxiv_id":"1803.05457","is_internal_anchor":true}],"resolved_work":65,"snapshot_sha256":"e6c883bd3746756bfcf318c74fb07ead25a2bc5151180bda13cc162581ad0f51","internal_anchors":3},"formal_canon":{"evidence_count":2,"snapshot_sha256":"af453d798209442993f4c73b1a6f8803233df7450bee0f27565cb71f3b64b34d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2110.08193","created_at":"2026-05-17T23:38:46.772741+00:00"},{"alias_kind":"arxiv_version","alias_value":"2110.08193v2","created_at":"2026-05-17T23:38:46.772741+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2110.08193","created_at":"2026-05-17T23:38:46.772741+00:00"},{"alias_kind":"pith_short_12","alias_value":"3FY2AOQXS2OU","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"3FY2AOQXS2OUDBRM","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"3FY2AOQX","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":24,"internal_anchor_count":24,"sample":[{"citing_arxiv_id":"2312.11805","citing_title":"Gemini: A Family of Highly Capable Multimodal Models","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2401.04088","citing_title":"Mixtral of Experts","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2412.16720","citing_title":"OpenAI o1 System Card","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2504.16155","citing_title":"PRIMETIME : Limits of LLMs in Temporal Primitives","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2408.00724","citing_title":"Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models","ref_index":258,"is_internal_anchor":true},{"citing_arxiv_id":"2511.10287","citing_title":"OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2601.03267","citing_title":"OpenAI GPT-5 System Card","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12530","citing_title":"In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03121","citing_title":"An Independent Safety Evaluation of Kimi K2.5","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06233","citing_title":"Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2305.10403","citing_title":"PaLM 2 Technical Report","ref_index":109,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10639","citing_title":"Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22749","citing_title":"Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00382","citing_title":"Social Bias in LLM-Generated Code: Benchmark and Mitigation","ref_index":154,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20677","citing_title":"Intersectional Fairness in Large Language Models","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12946","citing_title":"Parcae: Scaling Laws For Stable Looped Language Models","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2206.07682","citing_title":"Emergent Abilities of Large Language Models","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13103","citing_title":"Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2604.08963","citing_title":"Aligned Agents, Biased Swarm: Measuring Bias Amplification in Multi-Agent Systems","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09034","citing_title":"The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2207.05221","citing_title":"Language Models (Mostly) Know What They Know","ref_index":74,"is_internal_anchor":true},{"citing_arxiv_id":"2204.05862","citing_title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2508.10925","citing_title":"gpt-oss-120b & gpt-oss-20b Model Card","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17008","citing_title":"BIASEDTALES-ML: A Multilingual Dataset for Analyzing Narrative Attribute Distributions in LLM-Generated Stories","ref_index":20,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3FY2AOQXS2OUDBRMFLBJI4NJPB","json":"https://pith.science/pith/3FY2AOQXS2OUDBRMFLBJI4NJPB.json","graph_json":"https://pith.science/api/pith-number/3FY2AOQXS2OUDBRMFLBJI4NJPB/graph.json","events_json":"https://pith.science/api/pith-number/3FY2AOQXS2OUDBRMFLBJI4NJPB/events.json","paper":"https://pith.science/paper/3FY2AOQX"},"agent_actions":{"view_html":"https://pith.science/pith/3FY2AOQXS2OUDBRMFLBJI4NJPB","download_json":"https://pith.science/pith/3FY2AOQXS2OUDBRMFLBJI4NJPB.json","view_paper":"https://pith.science/paper/3FY2AOQX","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2110.08193&json=true","fetch_graph":"https://pith.science/api/pith-number/3FY2AOQXS2OUDBRMFLBJI4NJPB/graph.json","fetch_events":"https://pith.science/api/pith-number/3FY2AOQXS2OUDBRMFLBJI4NJPB/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3FY2AOQXS2OUDBRMFLBJI4NJPB/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3FY2AOQXS2OUDBRMFLBJI4NJPB/action/storage_attestation","attest_author":"https://pith.science/pith/3FY2AOQXS2OUDBRMFLBJI4NJPB/action/author_attestation","sign_citation":"https://pith.science/pith/3FY2AOQXS2OUDBRMFLBJI4NJPB/action/citation_signature","submit_replication":"https://pith.science/pith/3FY2AOQXS2OUDBRMFLBJI4NJPB/action/replication_record"}},"created_at":"2026-05-17T23:38:46.772741+00:00","updated_at":"2026-05-17T23:38:46.772741+00:00"}