arxiv: 2502.14739 · v4 · submitted 2025-02-20 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

M-A-P Team , Xinrun Du , Yifan Yao , Kaijing Ma , Bingli Wang , Tianyu Zheng , King Zhu , Minghao Liu

show 89 more authors

Yiming Liang Xiaolong Jin Zhenlin Wei Chujie Zheng Kaixin Deng Shawn Gavin Shian Jia Sichao Jiang Yiyan Liao Rui Li Qinrui Li Sirun Li Yizhi Li Yunwen Li David Ma Yuansheng Ni Haoran Que Qiyao Wang Zhoufutu Wen Siwei Wu Tyshawn Hsing Ming Xu Zhenzhu Yang Zekun Moore Wang Junting Zhou Yuelin Bai Xingyuan Bu Chenglin Cai Liang Chen Yifan Chen Chengtuo Cheng Tianhao Cheng Keyi Ding Siming Huang Yun Huang Yaoru Li Yizhe Li Zhaoqun Li Tianhao Liang Chengdong Lin Hongquan Lin Yinghao Ma Tianyang Pang Zhongyuan Peng Zifan Peng Qige Qi Shi Qiu Xingwei Qu Shanghaoran Quan Yizhou Tan Zili Wang Chenqing Wang Hao Wang Yiya Wang Yubo Wang Jiajun Xu Kexin Yang Ruibin Yuan Yuanhao Yue Tianyang Zhan Chun Zhang Jinyang Zhang Xiyue Zhang Xingjian Zhang Yue Zhang Yongchi Zhao Xiangyu Zheng Chenghua Zhong Yang Gao Zhoujun Li Dayiheng Liu Qian Liu Tianyu Liu Shiwen Ni Junran Peng Yujia Qin Wenbo Su Guoyin Wang Shi Wang Jian Yang Min Yang Meng Cao Xiang Yue Zhaoxiang Zhang Wangchunshu Zhou Jiaheng Liu Qunshu Lin Wenhao Huang Ge Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 00:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationbenchmarkgraduate-level knowledgemulti-discipline assessmentHuman-LLM collaborationreasoning capabilitiesAI limitations

0 comments

The pith

SuperGPQA benchmark shows top LLMs reach only 61.82 percent accuracy across 285 graduate disciplines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SuperGPQA as a new evaluation set that covers graduate-level knowledge and reasoning in 285 disciplines, far beyond the usual focus on mathematics, physics, and computer science. It relies on a Human-LLM collaborative filtering process that iteratively removes trivial or ambiguous questions using both model answers and expert review. Results indicate that even the strongest current models, such as the reasoning-focused DeepSeek-R1, achieve at most 61.82 percent accuracy. This gap matters because it quantifies how far current systems remain from expert performance in specialized areas like agriculture, light industry, and service fields. The work also shares practical lessons from managing a large annotation effort involving over eighty experts.

Core claim

SuperGPQA is a benchmark spanning 285 disciplines that uses a Human-LLM collaborative filtering mechanism to produce high-quality graduate-level questions; evaluations of state-of-the-art LLMs on this set reach a maximum accuracy of 61.82 percent, demonstrating substantial room for improvement relative to artificial general intelligence.

What carries the argument

The Human-LLM collaborative filtering mechanism, which iteratively refines candidate questions by combining LLM responses with expert feedback to eliminate trivial or ambiguous items.

If this is right

LLMs still exhibit large performance gaps in specialized disciplines outside mainstream academic areas.
The benchmark supplies a concrete metric for tracking progress toward broader expert-level capabilities.
Insights from the eighty-expert annotation process can guide the design of future large-scale evaluation efforts.
Discipline-by-discipline score differences can highlight which knowledge areas require targeted model improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training data for future models will likely need to incorporate more material from underrepresented fields such as agriculture and service disciplines to raise scores.
Real-world deployment in specialized professional settings may remain unreliable until accuracy on this type of benchmark rises well above 80 percent.
The filtering approach could be adapted to create similar graduate-level tests in languages other than English or for professional certification exams.

Load-bearing premise

The Human-LLM collaborative filtering process produces questions that are genuinely graduate-level, unambiguous, and representative of each discipline without introducing selection bias.

What would settle it

Independent expert review of a random sample of SuperGPQA questions finds that a large fraction can be answered correctly by undergraduates or contain hidden ambiguities that allow multiple valid answers.

read the original abstract

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SuperGPQA reaches 285 disciplines via Human-LLM filtering and shows top models at 62%, but lacks the stats needed to confirm the questions are solid graduate level.

read the letter

The main thing to know is that this paper scales a GPQA-style benchmark to 285 disciplines and reports that even the strongest model, DeepSeek-R1, only reaches 61.82% accuracy. The Human-LLM iterative filtering is the key new piece, meant to drop trivial or unclear items through repeated LLM checks and expert review. They ran this with over 80 annotators, which is a real operational lift for a benchmark paper. The coverage into agriculture, light industry, and service fields is the clearest advance over prior tests that stay inside math, physics, and CS. That breadth could actually shift where people look for capability gaps if the questions hold up. The abstract gives a clean headline number and some notes on running large annotation, which is useful for anyone planning similar work. The soft spots sit in the validation details. No error bars, no inter-annotator agreement figures, no rejection-rate tables by stage or discipline, and no sample of kept versus discarded questions. The stress-test concern is fair here: if the loop mostly removes items LLMs already get right rather than fixing factual or level problems, the low scores could overstate the gap. Without those diagnostics the central claim stays plausible but not yet tight. This is for groups that track broad LLM knowledge maps or run their own large-scale eval projects. A reader focused on specialized domains or annotation pipelines will find the scale and process notes worth seeing once the full dataset and stats appear. It deserves peer review because the ambition is genuine and the method extends existing work in a straightforward way; referees can ask for the missing agreement numbers and a few example questions without killing the contribution.

Referee Report

2 major / 2 minor

Summary. The paper introduces SuperGPQA, a benchmark evaluating LLM graduate-level knowledge and reasoning across 285 disciplines. It describes a Human-LLM collaborative filtering process to remove trivial or ambiguous questions via iterative refinement with LLM responses and expert feedback, reports that the best model (DeepSeek-R1) reaches only 61.82% accuracy, and provides methodological insights from managing annotation with over 80 experts.

Significance. If the filtering successfully yields unambiguous graduate-level items representative of the 285 disciplines, the benchmark would meaningfully expand evaluation beyond mainstream fields and document a substantial capability gap. The large-scale annotation management details could also inform future benchmark construction efforts.

major comments (2)

[Abstract / Filtering mechanism] Abstract and methods description of the Human-LLM collaborative filtering: no quantitative diagnostics are supplied (rejection rates per stage or discipline, inter-expert agreement, pre/post-filtering accuracy curves, or examples of discarded vs. retained items). This is load-bearing for the central claim that retained questions are genuinely graduate-level and that the 61.82% ceiling reflects a true AGI gap rather than residual ambiguity or selection bias.
[Results] Results section reporting model accuracies: headline figures such as DeepSeek-R1 at 61.82% are given without error bars, confidence intervals, or per-discipline variance. This prevents assessment of whether observed differences across models or fields are statistically reliable.

minor comments (2)

[Abstract] Clarify the exact count of disciplines (abstract states 'over 200' while title and body use 285) and provide a breakdown of coverage by broad field.
Include at least one or two sample questions per major discipline cluster to allow readers to judge graduate-level difficulty directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses

Referee: [Abstract / Filtering mechanism] Abstract and methods description of the Human-LLM collaborative filtering: no quantitative diagnostics are supplied (rejection rates per stage or discipline, inter-expert agreement, pre/post-filtering accuracy curves, or examples of discarded vs. retained items). This is load-bearing for the central claim that retained questions are genuinely graduate-level and that the 61.82% ceiling reflects a true AGI gap rather than residual ambiguity or selection bias.

Authors: We agree that the current description would benefit from additional quantitative support. In the revised manuscript we will add rejection rates per filtering stage and per discipline, inter-expert agreement statistics (including percentage agreement and Cohen’s kappa where multiple experts reviewed the same items), pre- and post-filtering accuracy curves for the LLMs used in the process, and representative examples of discarded versus retained questions. These additions will directly substantiate that the retained items are graduate-level and reduce concerns about residual ambiguity or selection bias. revision: yes
Referee: [Results] Results section reporting model accuracies: headline figures such as DeepSeek-R1 at 61.82% are given without error bars, confidence intervals, or per-discipline variance. This prevents assessment of whether observed differences across models or fields are statistically reliable.

Authors: We acknowledge that statistical reliability measures are important for interpreting the results. In the revision we will report 95% binomial confidence intervals for all headline accuracy figures and include error bars on the main result plots. We will also add per-discipline standard deviations and, for disciplines with sufficient question counts, per-discipline error bars. These changes will allow readers to assess the reliability of differences across models and fields. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper constructs SuperGPQA via Human-LLM collaborative filtering and reports direct empirical accuracies (e.g., DeepSeek-R1 at 61.82%). No equations, parameters, or predictions exist that reduce to inputs by construction. The filtering process is presented as a methodological contribution without self-referential loops, uniqueness theorems, or ansatzes. Central claims rest on the benchmark results themselves rather than any tautological reduction. This matches the expected non-circular outcome for a pure empirical dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new physical or mathematical entities. It relies on standard assumptions that expert annotators can reliably judge graduate-level difficulty and that LLM responses provide useful signals for filtering.

axioms (1)

domain assumption Expert annotators can consistently identify and remove trivial or ambiguous graduate-level questions
Invoked in the description of the Human-LLM collaborative filtering mechanism

pith-pipeline@v0.9.0 · 5894 in / 1228 out tokens · 48499 ms · 2026-05-16T00:41:53.807346+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scaling Latent Reasoning via Looped Language Models
cs.CL 2025-10 unverdicted novelty 7.0

Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.
SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
cs.LG 2026-05 unverdicted novelty 6.0

Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...
Rotation-Preserving Supervised Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
cs.AI 2026-05 unverdicted novelty 6.0

ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
cs.AI 2026-04 unverdicted novelty 6.0

Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond
cs.LG 2026-03 conditional novelty 6.0

SafeSci creates a large objective benchmark and training resource that reveals safety weaknesses in current LLMs for science and demonstrates measurable improvement through targeted fine-tuning.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Heterogeneous Scientific Foundation Model Collaboration
cs.AI 2026-04 unverdicted novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
Qwen3.5-Omni Technical Report
cs.CL 2026-04 unverdicted novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
Seed1.8 Model Card: Towards Generalized Real-World Agency
cs.AI 2026-03 unverdicted novelty 5.0

Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
Kimi K2: Open Agentic Intelligence
cs.LG 2025-07 unverdicted novelty 5.0

Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
Qwen3 Technical Report
cs.CL 2025-05 unverdicted novelty 5.0

Pith review generated a malformed one-line summary.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
cs.CL 2026-05 unverdicted novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
Supplement Generation Training for Enhancing Agentic Task Performance
cs.LG 2026-04 unverdicted novelty 4.0

SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs
cs.CL 2026-02 unverdicted novelty 4.0

MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · cited by 18 Pith papers · 1 internal anchor

[1]

URL https://api.semanticscholar.org/CorpusID:274656307. AI-MO. Aimo validation aime, 2024. URL https://huggingface.co/datasets/AI -MO/aimo-validation-aime . Validation set containing 90 AIME problems from 2022-2024 contests. AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/lla ma3/blob/main/MODEL_CARD.md. Anthropic. Claude 3.5 sonnet m...

work page doi:10.48550/arxiv.2412.03205 2024
[2]

Yi: Open Foundation Models by 01.AI

doi: 10.18653/v1/d18-1259. URL https://doi.org/10.18653/v1/d18-1 259. A. A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, K. Yu, P . Liu, Q. Liu, S. Yue, S. Yang, S. Yang, T. Yu, W. Xie, W. Huang, X. Hu, X. Ren, X. Niu, P . Nie, Y. Xu, Y. Liu, Y. Wang, Y. Cai, Z. Gu, Z. Liu, and Z. Dai. Yi: Open foundation models b...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/d18-1259 2024
[3]

According to Danto’s definition, context is an art world with modern aspects

work page
[4]

La Bayadère

“La Bayadère” is a ballet created during the French July Revolution

work page
[5]

The ballet “Sylvia” is a dance drama created during the Paris Commune period in 1871

work page
[6]

Korean court dance, when calculated according to temporal principles, does not belong to secondary civilization. Options: A) 1, 3 B) 1, 4 C) 1,2,4 D) 2,3 E) 1,2,3 F) 3 G) 4 H) 3,4 I) 1,2,3,4 J) 2,4 Answer: 3, 4 Answer letter: H Discipline: Literature and Arts Field: Art Studies Subfield: Dance Studies Difficulty: easy Middle Sample Question: A deck of pla...

work page
[7]

Material Review: • Ensure that the materials meet the aforementioned requirements, including accuracy, diversity, neutrality (free from regional bias), non-image-based content, and compliance with copyright regulations

work page
[8]

Transcribe the question content word-for-word, ensuring no key information is omitted, with particular attention to the accurate transcription of formulas

Question Transcription: • Use OCR tools to recognize the text in the materials or directly paste the origi- nal content. Transcribe the question content word-for-word, ensuring no key information is omitted, with particular attention to the accurate transcription of formulas

work page
[9]

Transcribe the options word-for-word, ensuring no key information is omitted, with particular attention to the accurate transcription of formulas

Option Transcription: • Use OCR tools to recognize the text in the materials or directly paste the original content. Transcribe the options word-for-word, ensuring no key information is omitted, with particular attention to the accurate transcription of formulas

work page
[11]

The number of options should be between 4 and 10, with 42 more options being preferred to increase the difficulty and discriminatory power of the question

Distractor Addition: • Add distractors while maintaining the quality of the options (avoiding mean- ingless or excess correct answers) until the annotator deems the level of confusion sufficient. The number of options should be between 4 and 10, with 42 more options being preferred to increase the difficulty and discriminatory power of the question

work page
[12]

discipline

Field Completion: • Complete the category fields ("discipline","field","subfield") and select the appropriate difficulty level to ensure the integrity and accuracy of the question information. B.2.2. Non-Choice Conversion Method ὒ7Non-Choice Conversion Description: • Convert non-multiple-choice questions (such as calculation questions, fill-in-the- blank ...

work page
[13]

• Ensure that the conditions of the question are complete and do not cause ambiguity

Material Review: • Ensure that the materials meet the aforementioned requirements, including accuracy, diversity, neutrality (free from regional bias), non-image-based content, and compliance with copyright regulations. • Ensure that the conditions of the question are complete and do not cause ambiguity. Add supplementary explanations if necessary. • Ensu...

work page
[14]

Question Transcription: • Use OCR tools to recognize the text in the material or directly paste the original text. Transcribe the content of the original question word by word, ensuring that all numbers, formulas, or condition information in the question are accurate and complete to avoid ambiguity or incorrect answers. • Add supplementary information if ...

work page
[15]

item Use OCR tools to recognize results in the material and confirm that all numerical and formula information in the answer is accurate

Answer Transcription: • Identify and select the correct result for the appropriate question based on the answer analysis process. item Use OCR tools to recognize results in the material and confirm that all numerical and formula information in the answer is accurate. 43

work page
[16]

• For numerical results, the setting of distractors should consider reasonable ranges of calculation errors to ensure differentiation between options

Distractor Addition: • On the basis of the correct answer, consider setting common calculation errors (such as rounding errors, sign mistakes, digit errors, etc.) as distractors. • For numerical results, the setting of distractors should consider reasonable ranges of calculation errors to ensure differentiation between options

work page
[18]

discipline

Field Completion: • Complete the category fields ("discipline","field","subfield") and select the appropriate difficulty level to ensure the integrity and accuracy of the question information. B.2.3. Statement Combination Method ὒ7Statement Combination Description: • By integrating stated expressions related to multiple concepts or knowledge points, a mul...

work page
[19]

Statement Extraction: • Extract core concepts, definitions, relationships between concepts, application cases, and common misconceptions from textbooks and related learning resources, ensuring that the statements are representative and comprehensive. • Extract important statements from multiple-choice questions, covering multi- ple knowledge points to avo...

work page
[20]

• Incorrect statements should be somewhat misleading, avoiding obvious or easily dismissible errors

Statement Adaptation: • Adapt the extracted statements to include both correct and incorrect versions. • Incorrect statements should be somewhat misleading, avoiding obvious or easily dismissible errors. 44 • Number the adapted statements (e.g., I, II, III, etc.). Arrange the statements in a reasonable order, avoiding bias towards any direction (e.g., alw...

work page
[21]

Which of the following statements about [knowledge point] is correct?

Question Design: • Depending on the need, questions may limit the scope or conditions being tested. • "Which of the following statements about [knowledge point] is correct?" • "Which of the following statements about [knowledge point] is incorrect?"

work page
[22]

• Ensure that there is exactly one correct answer among the options

Combination Design: • Combine the numbered statements to create options, such as I and II; II and III; III, VI, and VII. • Ensure that there is exactly one correct answer among the options

work page
[23]

Answer Identification: • Clearly mark the correct answer within the list of options, ensuring its accu- racy and uniqueness

work page
[24]

discipline

Field Completion: • Complete the category fields ("discipline","field","subfield") and select the appropriate difficulty level to ensure the integrity and accuracy of the question information. B.2.4. Confusion-options Generation During the annotation process, we use large models to assist in generating distractors. We select Claude-3.5, GPT-4, Doubao, and...

work page
[25]

• The distractor should introduce a subtle mathematical error while maintaining the formula structure

Generate Distractor: • The distractor must be incorrect (not the correct answer). • The distractor should introduce a subtle mathematical error while maintaining the formula structure. • It must be distinct from all existing options, including the correct answer. • Avoid any repetition or overlap with the existing answer options in terms of value or meani...

work page
[26]

• If the distractor matches any existing option, regenerate it

Uniqueness Check: • Thoroughly check all existing options (including the correct one) to ensure the distractor is unique. • If the distractor matches any existing option, regenerate it. • The distractor must not create ambiguity; the correct answer must remain the sole valid choice

work page
[27]

{}", Options: {}, Correct Answer:

Formatting: • Ensure the distractor matches the format of the existing options (e.g., fractions, exponents, etc.). • Avoid formatting inconsistencies such as misplaced symbols or spaces. Output Format: <distractor> your generated distractor here </distractor> Do not include explanations or extra information. Do not include numbering. Input: Question: "{}"...

work page
[28]

Rule-Based Pre-Check subsection C.1: Data undergoes initial screening based on predefined code rules, quickly identifying and eliminating clearly invalid data points, thereby enhancing efficiency and reducing subsequent workload

work page
[29]

LLM-Based Quality Inspection subsection C.2: Large language models are uti- lized to perform in-depth quality assessments of the data, identifying potential errors and inconsistencies, thereby improving the accuracy and completeness of the data

work page
[30]

all-MiniLM-L6- v2

Manual Quality Review subsection C.3 : Experienced data analysts conduct a final review to ensure the high quality and usability of the data, correcting potential misjudgments and incorporating domain knowledge for a comprehensive evaluation. C.1. Rule-Based Pre-Check Table 8 presents the rules for data filtering and pre-checking, including text normaliza...

work page
[31]

What is the value of X?

The question must explicitly pose a specific problem or ask a clear question that can be answered or calculated. If the question does not clearly ask something (e.g., if it is vague, incomplete, or does not directly ask for a specific answer), it should be considered invalid. A valid question should directly require an answer, such as asking for a numeric...

work page
[32]

If any part is missing or unclear (e.g., if the answer does not match any of the listed options), the question should be deemed invalid

The question, options, and answer must be fully defined. If any part is missing or unclear (e.g., if the answer does not match any of the listed options), the question should be deemed invalid. Note that the number of options is not a factor in determining completeness—having only one option is acceptable as long as the content is complete and coherent

work page
[33]

Option A is true because Option B is false

The question, options, and answer must be directly relevant to each other. The op- tions should not reference each other, and there should be no circular dependencies or inter-referencing between options (e.g., “Option A is true because Option B is false”). Each option must stand independently, with no cross-references to other options. 49

work page
[34]

The following options are incorrect

Negative phrasing such as “The following options are incorrect” or “None of the above” is not allowed in the question. The question should use positive phrasing (e.g., “Which of the following is correct?”) to avoid confusion. If negative phrasing is detected in the question, it should be deemed invalid

work page
[35]

is_valid

The question must be in a valid multiple-choice format. If the question is not suitable for a multiple-choice format (e.g., it is a free-form answer question like a problem or essay), it should be deemed invalid. Ensure that the question is designed specifically for multiple-choice answering, rather than being a question that requires an open-ended respon...

work page
[36]

Which of the following is NOT...?

The question must not be a negation (e.g., "Which of the following is NOT...?" or "What is not...?")

work page
[37]

Which of the following is the best/worst...?

The question should avoid vague or ambiguous phrasing, such as "Which of the following is the best/worst...?" or other uncertain expressions

work page
[38]

All of the above

The answer choices should not be all affirmations or all negations, such as "All of the above" or "None of the above," as these are considered inappropriate. Please return the result in the following JSON format: { "is_valid": true/false } Ensure that the output is valid JSON and do not return any unrelated information. Example input 1: { question: Which ...

work page
[39]

incorrect

Analyze the final part of the question (the actual question being asked) to check for specific phrases: • "incorrect" in a context like "which of the following is incorrect" • "most" in a context like"which is most likely"

work page
[40]

If either of these phrases appears in the question’s final asking part, outputtrue, followed by a brief explanation of why the phrase is present

work page
[41]

A new technology is introduced in a manufacturing plant. Which of the following is most likely to occur?

If neither phrase is found in the question’s final part, outputfalse, followed by a brief explanation. Output Format: Provide a response with: • true orfalse as the first part of the output • A brief explanation that justifies your answer Example Input: Question: "A new technology is introduced in a manufacturing plant. Which of the following is most like...

work page
[42]

• If discipline is appropriate, evaluate if thefield is relevant

Evaluate the classification level by level: • First, check if the discipline is relevant to the question. • If discipline is appropriate, evaluate if thefield is relevant. • If field is appropriate, evaluate if thesubfield is relevant

work page
[43]

Special rule: • If thesubfield is found to be highly relevant to the question, the entire clas- sification (discipline, field, and subfield) is considered appropriate, regardless of minor mismatches indiscipline orfield

work page
[44]

If no levels meet the above criteria, or if any level other thansubfield is deemed inappropriate withoutsubfield being highly relevant, the classification is consid- ered not relevant

work page
[45]

is_relevant

Output strictly and exclusively in the following JSON format and don’t output any explanation, just output JSON: { "is_relevant": true/false 55 } C.2.5. Completeness Assessment ☼ Purpose • Determine if the question is solvable, considering the following two dimensions: • Confidence: The level of confidence in the answer, categorized as "High," "Medium," o...

work page
[46]

Attempt to solve the problem and assess your confidence level in the answer

work page
[47]

If you cannot provide an accurate answer, evaluate whether it is due to missing information (e.g., diagrams, formulas, known conditions) that makes the problem unsolvable, or if it’s due to other reasons (e.g., high difficulty or uncertainty in solving)

work page
[48]

final_answer_letter

Output format: { "final_answer_letter": "A", "confidence": "High", "missing_info": false } Rules: • If the problem is missing a "diagram" (such as a geometry question requiring visual input), mark it as missing_info: true and explain that the absence of the diagram or visual representation makes the question unsolvable. • If the problem contains sufficien...

work page 2000
[49]

DeepSeek-R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

work page
[50]

DeepSeek-R1-Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

work page
[51]

o1-2024-12-17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

work page 2024
[52]

o3-mini-2025-01-31-high . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

work page 2025
[53]

Doubao-1.5-pro-32k-250115 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

work page
[54]

o3-mini-2025-01-31-medium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

work page 2025
[55]

Doubao-1.5-pro-32k-241225 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

work page
[56]

qwen-max-2025-01-25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

work page 2025
[57]

claude-3-5-sonnet-20241022 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

work page
[58]

o3-mini-2025-01-31-low . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

work page 2025
[59]

gemini-2.0-flash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

work page
[60]

DeepSeek-V3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

work page
[61]

o1-mini-2024-09-12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

work page 2024
[62]

MiniMax-Text-01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

work page
[63]

gpt-4o-2024-11-20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

work page 2024
[64]

QwQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

work page
[65]

Llama-3.1-405B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

work page
[66]

gpt-4o-2024-08-06 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

work page 2024
[67]

Qwen2.5-72B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

work page
[68]

Mistral-Large-Instruct-2411 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 96

work page
[69]

qwen-max-2024-09-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

work page 2024
[70]

gpt-4o-2024-05-13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

work page 2024
[71]

Qwen2.5-32B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

work page
[72]

Llama-3.3-70B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

work page
[73]

phi-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

work page
[74]

Qwen2.5-14B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

work page
[75]

Llama-3.1-70B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

work page
[76]

Qwen2.5-72B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

work page
[77]

Yi-Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

work page
[78]

Qwen2.5-32B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

work page
[79]

DeepSeek-V3-Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

work page
[80]

Qwen2.5-14B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

work page
[81]

Mixtral-8x22B-Instruct-v0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

work page
[82]

Qwen2.5-7B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

work page

Showing first 80 references.