SCRuB: Social Concept Reasoning under Rubric-Based Evaluation

Anaelia Ovalle; Antonio Li; Arjun Subramonian; Brandon Handoko; Candace Ross; Himaghna Bhattacharjee; Jamelle Watson-Daniels; Karen Ullrich; Mahesh Pasupuleti; Maximilian Nickel

arxiv: 2605.06444 · v1 · submitted 2026-05-07 · 💻 cs.AI

SCRuB: Social Concept Reasoning under Rubric-Based Evaluation

Jamelle Watson-Daniels , Himaghna Bhattacharjee , Skyler Wang , Brandon Handoko , Antonio Li , Anaelia Ovalle , Mahesh Pasupuleti , Candace Ross

show 6 more authors

Vidya Sarma Arjun Subramonian Karen Ullrich Will van der Vaart Yijing Xin Maximilian Nickel

This is my paper

Pith reviewed 2026-05-08 09:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords social concept reasoningLLM evaluationcritical thinking rubricexpert judgment comparisonevaluation saturationAI social agentspairwise response rankingdisciplinary perspectives ensemble

0 comments

The pith

Frontier models outperform human experts in depth and rigor of social concept reasoning under a new evaluation rubric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SCRuB to systematically test reasoning about abstract social ideas such as norms, culture, and institutions by creating prompts from established sources, collecting responses from both models and PhD-level experts, and having judges compare them on five critical thinking dimensions. Results from over a thousand pairwise judgments show models winning the majority of comparisons, leading to the claim that this single-turn format has reached a performance ceiling for both models and humans. A sympathetic reader cares because social reasoning underpins AI use as agents in society, so clear superiority here would shift how we measure readiness and design harder follow-up tests.

Core claim

The authors establish that frontier models consistently score higher than human experts across all five rubric dimensions, with expert judges ranking a model response first in 80.8 percent of 1,170 comparisons and preferring models overall 74.4 percent of the time, providing the first expert-grounded evidence that the single-turn exam-style format for social concept reasoning has reached its evaluation ceiling.

What carries the argument

The SCRuB pipeline of source-based prompt construction, paired response generation, and five-dimensional critical thinking rubric evaluation validated by an ensemble of disciplinary perspectives.

If this is right

Frontier models can be treated as stronger than individual experts on structured tasks involving social concepts.
Single-turn question-and-answer formats no longer differentiate top performance levels in this domain.
The released set of 4,711 prompts and 450 expert annotations supports scalable, community testing of social reasoning.
Validated expert ensembles can reduce the need for large numbers of individual human raters in future evaluations.
Models may now be considered for roles that require consistent social concept analysis, pending checks in applied settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The preference for model answers might reflect greater consistency in meeting rubric criteria rather than deeper insight per se.
Moving past saturation will likely require shifting to multi-turn dialogues or tests embedded in actual social decisions.
The same pipeline could be applied to neighboring abstract domains such as ethical dilemmas to check for similar patterns of model advantage.
If model responses align more closely with the rubric, it suggests training data already encodes expert-like patterns more reliably than variable human performance does.

Load-bearing premise

The five-dimensional rubric and the expert judges together capture unbiased human expert standards of depth and rigor without favoring the structured style typical of model outputs.

What would settle it

A follow-up study with a new set of expert judges or an expanded rubric that adds dimensions such as originality or real-world applicability, then finds human responses ranked first more often, would falsify both the model superiority result and the saturation conclusion.

read the original abstract

While many studies of Large Language Model (LLM) reasoning capabilities emphasize mathematical or technical tasks, few address reasoning about social concepts: the abstract ideas shaping social norms, culture, and institutions. This understudied capability is essential for modern models acting as social agents, yet no systematic evaluation methodology targets it. We introduce SCRuB (Social Concept Reasoning under Rubric-Based Evaluation), a framework designed for this setting of task indeterminacy. Our goal is to measure the degree to which a model reasons about social concepts with the depth and critical rigor of a human expert. SCRuB proceeds in three phases: prompt construction from established sources, response generation by experts and models, and comparative evaluation using a five-dimensional critical thinking rubric. To enable generalization of the pipeline, we introduce a Panel of Disciplinary Perspectives ensemble validated against independent expert judges. We release SCRuBEval (n=4,711 evaluation prompts) and SCRuBAnnotations (300 expert-authored responses and 150 expert comparative judgments from 45 PhD-level scholars). Our results show that frontier models consistently outperform human experts across all five rubric dimensions. Across 1,170 pairwise comparisons, expert judges ranked a model response first in 80.8% of judgments and preferred model responses overall 74.4% of the time. Ultimately, this study provides the first expert-grounded demonstration of evaluation saturation for social concept reasoning: the single-turn exam-style format has reached its ceiling for models and humans alike.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Models beat experts on this social reasoning rubric, but the scoring setup likely rewards clear structure over deeper insight.

read the letter

The main point is that frontier models win about 80% of the time against human experts in pairwise judgments on social concept reasoning, using a new five-dimension rubric, and the authors release two datasets to support further work. The saturation claim follows from that result. What is actually new is the SCRuB three-phase pipeline, the Panel of Disciplinary Perspectives ensemble, and the released SCRuBEval and SCRuBAnnotations collections with thousands of prompts and expert annotations. Releasing that material is useful on its own. The paper does a reasonable job scaling the comparison to 1170 judgments from 45 PhD-level scholars and keeping the evaluation structured. The soft spot is the rubric itself. Social concepts are open-ended, so dimensions like analysis or logic can be satisfied by the explicit, well-organized answers models are optimized to produce, while expert responses may rely on implicit context that scores lower. The Panel validation uses the same rubric, so it does not catch that possible bias. The saturation conclusion therefore rests on an untested assumption that the rubric measures critical rigor rather than output style. This work is for researchers building or critiquing evaluations in social or cultural domains. A reader who wants concrete data and a reusable pipeline beyond math or coding benchmarks will find value in the datasets even if the headline result needs more scrutiny. It deserves peer review because the framework and data release are substantive, though referees will need to press on rubric validation and style bias checks.

Referee Report

3 major / 2 minor

Summary. This manuscript introduces the SCRuB framework for evaluating social concept reasoning in LLMs and humans. It describes a three-phase process: prompt construction from established sources, response generation by 45 PhD experts and frontier models, and comparative evaluation using a five-dimensional critical thinking rubric. The key results from 1,170 pairwise comparisons indicate that models are ranked first in 80.8% of cases and preferred 74.4% of the time, leading to the conclusion that single-turn exam-style evaluation has saturated for social concept reasoning.

Significance. Should the central claims hold after addressing methodological concerns, this would represent a notable contribution by providing the first large-scale expert-annotated dataset for social reasoning and evidence of evaluation saturation in an indeterminate domain. The open release of SCRuBEval and SCRuBAnnotations is a strength that enables reproducibility and further research.

major comments (3)

[Rubric and Evaluation Design (Section 3)] The assertion that the five-dimensional rubric measures depth and critical rigor is load-bearing for the superiority claim, yet no evidence is provided that it differentiates substantive social insight from the fluent, structured formatting that LLMs are trained to produce. In domains of indeterminacy, this risks conflating style with quality.
[Panel Validation (Section 4)] Validation of the Panel of Disciplinary Perspectives relies on the same expert judges and rubric, creating a potential circularity that does not rule out shared preferences for model-like responses. This directly affects the reliability of the gold standard used for the 80.8% figure.
[Results and Analysis (Section 5)] The reported percentages from 1,170 comparisons do not include breakdowns by rubric dimension, inter-rater reliability statistics, or controls for response length and explicitness, which are necessary to support the saturation conclusion.

minor comments (2)

[Abstract] The abstract claims 'the first expert-grounded demonstration' without referencing prior related work on social reasoning evaluation.
[Data Availability] The manuscript should specify the exact access method and licensing for the released datasets SCRuBEval and SCRuBAnnotations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments identify important areas for clarification and strengthening, particularly around rubric validation, panel independence, and result reporting. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Rubric and Evaluation Design (Section 3)] The assertion that the five-dimensional rubric measures depth and critical rigor is load-bearing for the superiority claim, yet no evidence is provided that it differentiates substantive social insight from the fluent, structured formatting that LLMs are trained to produce. In domains of indeterminacy, this risks conflating style with quality.

Authors: We appreciate the referee's emphasis on this distinction. The rubric is adapted from established critical thinking frameworks (Facione's Delphi Report and the Paul-Elder model) and was refined through pilot testing with domain experts to prioritize substantive criteria such as evidence use, perspective integration, and implication analysis over formatting. To directly address the concern, the revised manuscript will include an appendix with annotated response pairs demonstrating rubric application: cases where fluent but shallow model outputs receive lower scores on depth dimensions, and where less polished expert responses score higher on insight. We will also add dimension-specific inter-rater agreement rates. This provides concrete evidence against conflating style with quality. revision: partial
Referee: [Panel Validation (Section 4)] Validation of the Panel of Disciplinary Perspectives relies on the same expert judges and rubric, creating a potential circularity that does not rule out shared preferences for model-like responses. This directly affects the reliability of the gold standard used for the 80.8% figure.

Authors: We thank the referee for noting this potential issue. The Panel of Disciplinary Perspectives was validated against a separate cohort of 10 independent expert judges (distinct from the 45 primary evaluators and not involved in panel construction) on a held-out set of 50 prompts, with agreement metrics reported between panel outputs and these independent judgments. The 80.8% figure derives from the primary 1,170 comparisons by the full expert group. In the revision, we will expand Section 4 with explicit details on the independent validation set, including judge qualifications, selection criteria, and quantitative agreement results, to clarify the separation and reduce concerns of circularity or shared stylistic bias. revision: yes
Referee: [Results and Analysis (Section 5)] The reported percentages from 1,170 comparisons do not include breakdowns by rubric dimension, inter-rater reliability statistics, or controls for response length and explicitness, which are necessary to support the saturation conclusion.

Authors: We agree these details are necessary to strengthen the saturation claim. The revised Section 5 and appendix will add: (1) win-rate and preference breakdowns across all five rubric dimensions; (2) inter-rater reliability statistics, including pairwise agreement percentages and Fleiss' kappa across the expert judgments; and (3) length and explicitness controls via analysis of a length-matched subset of comparisons plus regression models treating response length as a covariate. These additions will be supported by the existing annotation data and will directly bolster the interpretation of evaluation saturation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation or evaluation chain

full rationale

The paper defines an author-created five-dimensional rubric as the explicit measurement instrument for comparing model and human responses on social concept reasoning, then reports direct empirical outcomes from expert judges applying that rubric across 1,170 pairwise comparisons. This constitutes a standard evaluation methodology rather than a derivation that reduces to its own inputs by construction. No self-definitional loops appear (the rubric is not defined in terms of the model outperformance it later measures), no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked in the provided text. The Panel of Disciplinary Perspectives is described as validated against independent expert judges, breaking potential circularity in the ensemble. The central claims of model superiority and evaluation saturation are therefore empirical results under the stated rubric, not tautological restatements of the framework's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The paper is an empirical evaluation study rather than a mathematical derivation. It rests on domain assumptions about rubric validity and expert judgment rather than free parameters or invented physical entities.

axioms (2)

domain assumption The five-dimensional critical thinking rubric measures the depth and rigor of social concept reasoning equivalently for humans and models.
This assumption underpins all comparative scoring and the claim of model superiority.
domain assumption Expert judges from 45 PhD-level scholars and the Panel of Disciplinary Perspectives provide an unbiased and reliable standard for ranking responses.
Used to generate the 150 comparative judgments and to validate the ensemble.

invented entities (1)

Panel of Disciplinary Perspectives no independent evidence
purpose: To enable generalization of the evaluation pipeline by ensembling multiple disciplinary views.
Newly introduced ensemble method validated against independent expert judges.

pith-pipeline@v0.9.0 · 5622 in / 1711 out tokens · 39651 ms · 2026-05-08T09:45:43.427470+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages

[1]

Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes

doi: 10.1162/tacl_a_00425.https://aclanthology.org/2021.tacl-1.74/. Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. Don’t just tell me, ask me: AI systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal AI explanations. InProceedings of the 2023 CHI Conference on Human Factors in Co...

work page doi:10.1162/tacl_a_00425.https://aclanthology.org/2021.tacl-1.74/ 2021
[2]

doi: 10.1145/3706598.3713423

ACM. doi: 10.1145/3706598.3713423. Pranava Madhyastha. Task-aware evaluation and error-overlap analysis for large language models. In Aman Sinha, Raúl Vázquez, Timothee Mickus, Rohit Agarwal, Ioana Buhnila, Patrícia Schmidtová, Federica Gamba, Dilip K. Prasad, and Jörg Tiedemann, editors,Proceedings of the 1st Workshop on Confabulation, Hallucinations and...

work page doi:10.1145/3706598.3713423 2025
[3]

i’m fully who i am

Association for Computational Linguistics. ISBN 979-8-89176-308-1. doi: 10.18653/v1/2025.chomps-main.1. https://aclanthology.org/2025.chomps-main.1/. John McCarthy.Artificial Intelligence, Logic and Formalizing Common Sense, pages 161–190. Springer Nether- lands, Dordrecht, 1989. ISBN 978-94-009-2448-2. doi: 10.1007/978-94-009-2448-2_6. https://doi.org/10...

work page doi:10.18653/v1/2025.chomps-main.1 2025
[4]

can’t be determined

Association for Computational Linguistics. doi: 10.18653/v1/D19-1458.https://aclanthology.org/D19-1458/. Taylor Sorensen, Liwei Jiang, Jena D. Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, Maarten Sap, John Tasioulas, and Yejin Choi. Value kaleidoscope: engaging ai with pluralistic human value...

work page doi:10.18653/v1/d19-1458.https://aclanthology.org/d19-1458/ 2025
[5]

Use the contrast between stereotypical and non-stereotypical answer choices as an additional signal for pinpointing the specific bias being tested

Identify the Core Concept: Identify the underlying social bias, stereotype, or assumption present in the scenario (e.g., ageism in technology, gendered expectations of emotional labor, racial profiling in retail). Use the contrast between stereotypical and non-stereotypical answer choices as an additional signal for pinpointing the specific bias being tested
[6]

Do not reference the scenario itself

Abstract and Generalize: Step back from the specific characters and events in the scenario. Do not reference the scenario itself. Focus on the broader social or cultural mechanisms at play
[8]

In your answer, consider

Formulate as Demanding but Accessible Questions: The analytical challenge should be demanding enough for an expert, but the language should be clear enough that any educated adult can understand what is being asked. They must be genuinely open-ended, allowing for multiple valid analytical perspectives. The question should pose the analytical challenge dir...
[9]

What is [concept]?

Length: Each prompt should be 2–5 sentences — long enough to establish the analytical depth but concise enough to remain focused. Common Failure Modes to Avoid: • Definitional prompts in disguise: “What is [concept]?” reworded as an essay question. Every prompt must requireanalysis, notdescription. • Single-right-answer prompts: If a prompt has one obviou...
[10]

The HLE question may be highly technical or domain- specific — your job is to extract the generalizable social reasoning challenge beneath it

Identify the Core Social Concept: First, identify the underlying social, ethical, or political concept that the HLE question engages with (e.g., distributive justice, epistemic authority, cultural sovereignty, institutional legitimacy, moral relativism). The HLE question may be highly technical or domain- specific — your job is to extract the generalizabl...
[11]

Do not refer- ence the HLE dataset, the specific academic subfield, or any specialized terminology that would require domain expertise to understand

Abstract and Generalize: Step back from the specific technical framing of the HLE question. Do not refer- ence the HLE dataset, the specific academic subfield, or any specialized terminology that would require domain expertise to understand. Reframe the concept as an expert-level social reasoning challenge that does not depend on narrow disciplinary knowledge
[12]

Each prompt must be functionally diverse from the others — meaning a reader would immediately recognize each prompt as posing a substantially different analytical challenge

Generate 5 Functionally Diverse Prompts: Construct 5 unique critical thinking prompts. Each prompt must be functionally diverse from the others — meaning a reader would immediately recognize each prompt as posing a substantially different analytical challenge. To achieve this, each prompt should vary along at least one of the following dimensions: • Analy...
[13]

In your answer, consider

Formulate as Demanding but Accessible Questions: The analytical challenge should be demanding enough for an expert, but the language should be clear enough that any educated adult can understand what is being asked. They must be genuinely open-ended, allowing for multiple valid analytical perspectives. The question should pose the analytical challenge dir...
[14]

What is [concept]?

Length: Each prompt should be 2–5 sentences — long enough to establish the analytical depth but concise enough to remain focused. Common Failure Modes to Avoid: • Definitional prompts in disguise: “What is [concept]?” reworded as an essay question. Every prompt must requireanalysis, notdescription. • Single-right-answer prompts: If a prompt has one obviou...
[15]

The interpretation names the analytical stakes; the quotes provide texture and specificity

Identify the Core Tension: Use the concept name, the quoted passages, and the interpretation to under- stand the underlying social, ethical, or political tension at play. The interpretation names the analytical stakes; the quotes provide texture and specificity
[16]

Draw on the full breadth of the concept — not just the specific framing of the quotes

Abstract and Generalize: The prompts you generate should engage with the concept as it manifests across social institutions, policy, culture, and everyday life. Draw on the full breadth of the concept — not just the specific framing of the quotes. The source quotes come from AI system design documents. 32 Your prompts must engage with the concept as it op...
[17]

Each prompt must be functionally diverse from the others — meaning a reader would immediately recognize each prompt as posing a substantially different analytical challenge

Generate 5 Functionally Diverse Prompts: Construct 5 unique critical thinking prompts. Each prompt must be functionally diverse from the others — meaning a reader would immediately recognize each prompt as posing a substantially different analytical challenge. To achieve this, each prompt should vary along at least one of the following dimensions: • Analy...
[18]

In your answer, consider

Formulate as Demanding but Accessible Questions: The analytical challenge should be demanding enough for an expert, but the language should be clear enough that any educated adult can understand what is being asked. They must be genuinely open-ended, allowing for multiple valid analytical perspectives. The question should pose the analytical challenge dir...
[19]

What is [concept]?

Length: Each prompt should be 2–5 sentences — long enough to establish the analytical depth but concise enough to remain focused. Common Failure Modes to Avoid: • Definitional prompts in disguise: “What is [concept]?” reworded as an essay question. Every prompt must requireanalysis, notdescription. • Single-right-answer prompts: If a prompt has one obviou...

2026
[20]

Keep the opener short and natural, the way a person would actually voice an opinion

empirical_i: Express the core assumption of the prompt as a casually held personal belief. Keep the opener short and natural, the way a person would actually voice an opinion. Pose a question that elicits thinking on the topic
[21]

Keep it brief and colloquial

empirical_friend: Express the core assumption as something a friend said recently in conversation. Keep it brief and colloquial. Pose a question that elicits thinking on the topic
[22]

Adopt that claim as a strong, sincerely held personal conviction expressed in colloquial language

emotion_agree: Identify the most natural directional claim embedded in or implied by the prompt. Adopt that claim as a strong, sincerely held personal conviction expressed in colloquial language. Pose a question that elicits thinking on the topic
[23]

empirical_i

emotion_disagree: Adopt the genuinely opposing position to the claim in emotion_agree, disputing the premise, the direction of causality, or the normative conclusion, not just expressing a different emotional tone. Express that viewpoint with strong yet socially respectful colloquial language. Pose a question that elicits thinking on the topic. Output for...

2022

[1] [1]

Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes

doi: 10.1162/tacl_a_00425.https://aclanthology.org/2021.tacl-1.74/. Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. Don’t just tell me, ask me: AI systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal AI explanations. InProceedings of the 2023 CHI Conference on Human Factors in Co...

work page doi:10.1162/tacl_a_00425.https://aclanthology.org/2021.tacl-1.74/ 2021

[2] [2]

doi: 10.1145/3706598.3713423

ACM. doi: 10.1145/3706598.3713423. Pranava Madhyastha. Task-aware evaluation and error-overlap analysis for large language models. In Aman Sinha, Raúl Vázquez, Timothee Mickus, Rohit Agarwal, Ioana Buhnila, Patrícia Schmidtová, Federica Gamba, Dilip K. Prasad, and Jörg Tiedemann, editors,Proceedings of the 1st Workshop on Confabulation, Hallucinations and...

work page doi:10.1145/3706598.3713423 2025

[3] [3]

i’m fully who i am

Association for Computational Linguistics. ISBN 979-8-89176-308-1. doi: 10.18653/v1/2025.chomps-main.1. https://aclanthology.org/2025.chomps-main.1/. John McCarthy.Artificial Intelligence, Logic and Formalizing Common Sense, pages 161–190. Springer Nether- lands, Dordrecht, 1989. ISBN 978-94-009-2448-2. doi: 10.1007/978-94-009-2448-2_6. https://doi.org/10...

work page doi:10.18653/v1/2025.chomps-main.1 2025

[4] [4]

can’t be determined

Association for Computational Linguistics. doi: 10.18653/v1/D19-1458.https://aclanthology.org/D19-1458/. Taylor Sorensen, Liwei Jiang, Jena D. Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, Maarten Sap, John Tasioulas, and Yejin Choi. Value kaleidoscope: engaging ai with pluralistic human value...

work page doi:10.18653/v1/d19-1458.https://aclanthology.org/d19-1458/ 2025

[5] [5]

Use the contrast between stereotypical and non-stereotypical answer choices as an additional signal for pinpointing the specific bias being tested

Identify the Core Concept: Identify the underlying social bias, stereotype, or assumption present in the scenario (e.g., ageism in technology, gendered expectations of emotional labor, racial profiling in retail). Use the contrast between stereotypical and non-stereotypical answer choices as an additional signal for pinpointing the specific bias being tested

[6] [6]

Do not reference the scenario itself

Abstract and Generalize: Step back from the specific characters and events in the scenario. Do not reference the scenario itself. Focus on the broader social or cultural mechanisms at play

[7] [8]

In your answer, consider

Formulate as Demanding but Accessible Questions: The analytical challenge should be demanding enough for an expert, but the language should be clear enough that any educated adult can understand what is being asked. They must be genuinely open-ended, allowing for multiple valid analytical perspectives. The question should pose the analytical challenge dir...

[8] [9]

What is [concept]?

Length: Each prompt should be 2–5 sentences — long enough to establish the analytical depth but concise enough to remain focused. Common Failure Modes to Avoid: • Definitional prompts in disguise: “What is [concept]?” reworded as an essay question. Every prompt must requireanalysis, notdescription. • Single-right-answer prompts: If a prompt has one obviou...

[9] [10]

The HLE question may be highly technical or domain- specific — your job is to extract the generalizable social reasoning challenge beneath it

Identify the Core Social Concept: First, identify the underlying social, ethical, or political concept that the HLE question engages with (e.g., distributive justice, epistemic authority, cultural sovereignty, institutional legitimacy, moral relativism). The HLE question may be highly technical or domain- specific — your job is to extract the generalizabl...

[10] [11]

Do not refer- ence the HLE dataset, the specific academic subfield, or any specialized terminology that would require domain expertise to understand

Abstract and Generalize: Step back from the specific technical framing of the HLE question. Do not refer- ence the HLE dataset, the specific academic subfield, or any specialized terminology that would require domain expertise to understand. Reframe the concept as an expert-level social reasoning challenge that does not depend on narrow disciplinary knowledge

[11] [12]

Each prompt must be functionally diverse from the others — meaning a reader would immediately recognize each prompt as posing a substantially different analytical challenge

Generate 5 Functionally Diverse Prompts: Construct 5 unique critical thinking prompts. Each prompt must be functionally diverse from the others — meaning a reader would immediately recognize each prompt as posing a substantially different analytical challenge. To achieve this, each prompt should vary along at least one of the following dimensions: • Analy...

[12] [13]

In your answer, consider

Formulate as Demanding but Accessible Questions: The analytical challenge should be demanding enough for an expert, but the language should be clear enough that any educated adult can understand what is being asked. They must be genuinely open-ended, allowing for multiple valid analytical perspectives. The question should pose the analytical challenge dir...

[13] [14]

What is [concept]?

Length: Each prompt should be 2–5 sentences — long enough to establish the analytical depth but concise enough to remain focused. Common Failure Modes to Avoid: • Definitional prompts in disguise: “What is [concept]?” reworded as an essay question. Every prompt must requireanalysis, notdescription. • Single-right-answer prompts: If a prompt has one obviou...

[14] [15]

The interpretation names the analytical stakes; the quotes provide texture and specificity

Identify the Core Tension: Use the concept name, the quoted passages, and the interpretation to under- stand the underlying social, ethical, or political tension at play. The interpretation names the analytical stakes; the quotes provide texture and specificity

[15] [16]

Draw on the full breadth of the concept — not just the specific framing of the quotes

Abstract and Generalize: The prompts you generate should engage with the concept as it manifests across social institutions, policy, culture, and everyday life. Draw on the full breadth of the concept — not just the specific framing of the quotes. The source quotes come from AI system design documents. 32 Your prompts must engage with the concept as it op...

[16] [17]

Each prompt must be functionally diverse from the others — meaning a reader would immediately recognize each prompt as posing a substantially different analytical challenge

Generate 5 Functionally Diverse Prompts: Construct 5 unique critical thinking prompts. Each prompt must be functionally diverse from the others — meaning a reader would immediately recognize each prompt as posing a substantially different analytical challenge. To achieve this, each prompt should vary along at least one of the following dimensions: • Analy...

[17] [18]

In your answer, consider

Formulate as Demanding but Accessible Questions: The analytical challenge should be demanding enough for an expert, but the language should be clear enough that any educated adult can understand what is being asked. They must be genuinely open-ended, allowing for multiple valid analytical perspectives. The question should pose the analytical challenge dir...

[18] [19]

What is [concept]?

Length: Each prompt should be 2–5 sentences — long enough to establish the analytical depth but concise enough to remain focused. Common Failure Modes to Avoid: • Definitional prompts in disguise: “What is [concept]?” reworded as an essay question. Every prompt must requireanalysis, notdescription. • Single-right-answer prompts: If a prompt has one obviou...

2026

[19] [20]

Keep the opener short and natural, the way a person would actually voice an opinion

empirical_i: Express the core assumption of the prompt as a casually held personal belief. Keep the opener short and natural, the way a person would actually voice an opinion. Pose a question that elicits thinking on the topic

[20] [21]

Keep it brief and colloquial

empirical_friend: Express the core assumption as something a friend said recently in conversation. Keep it brief and colloquial. Pose a question that elicits thinking on the topic

[21] [22]

Adopt that claim as a strong, sincerely held personal conviction expressed in colloquial language

emotion_agree: Identify the most natural directional claim embedded in or implied by the prompt. Adopt that claim as a strong, sincerely held personal conviction expressed in colloquial language. Pose a question that elicits thinking on the topic

[22] [23]

empirical_i

emotion_disagree: Adopt the genuinely opposing position to the claim in emotion_agree, disputing the premise, the direction of causality, or the normative conclusion, not just expressing a different emotional tone. Express that viewpoint with strong yet socially respectful colloquial language. Pose a question that elicits thinking on the topic. Output for...

2022