pith. machine review for the scientific record. sign in

arxiv: 2502.14739 · v4 · submitted 2025-02-20 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Authors on Pith no claims yet

Pith reviewed 2026-05-16 00:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM evaluationbenchmarkgraduate-level knowledgemulti-discipline assessmentHuman-LLM collaborationreasoning capabilitiesAI limitations
0
0 comments X

The pith

SuperGPQA benchmark shows top LLMs reach only 61.82 percent accuracy across 285 graduate disciplines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SuperGPQA as a new evaluation set that covers graduate-level knowledge and reasoning in 285 disciplines, far beyond the usual focus on mathematics, physics, and computer science. It relies on a Human-LLM collaborative filtering process that iteratively removes trivial or ambiguous questions using both model answers and expert review. Results indicate that even the strongest current models, such as the reasoning-focused DeepSeek-R1, achieve at most 61.82 percent accuracy. This gap matters because it quantifies how far current systems remain from expert performance in specialized areas like agriculture, light industry, and service fields. The work also shares practical lessons from managing a large annotation effort involving over eighty experts.

Core claim

SuperGPQA is a benchmark spanning 285 disciplines that uses a Human-LLM collaborative filtering mechanism to produce high-quality graduate-level questions; evaluations of state-of-the-art LLMs on this set reach a maximum accuracy of 61.82 percent, demonstrating substantial room for improvement relative to artificial general intelligence.

What carries the argument

The Human-LLM collaborative filtering mechanism, which iteratively refines candidate questions by combining LLM responses with expert feedback to eliminate trivial or ambiguous items.

If this is right

  • LLMs still exhibit large performance gaps in specialized disciplines outside mainstream academic areas.
  • The benchmark supplies a concrete metric for tracking progress toward broader expert-level capabilities.
  • Insights from the eighty-expert annotation process can guide the design of future large-scale evaluation efforts.
  • Discipline-by-discipline score differences can highlight which knowledge areas require targeted model improvements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training data for future models will likely need to incorporate more material from underrepresented fields such as agriculture and service disciplines to raise scores.
  • Real-world deployment in specialized professional settings may remain unreliable until accuracy on this type of benchmark rises well above 80 percent.
  • The filtering approach could be adapted to create similar graduate-level tests in languages other than English or for professional certification exams.

Load-bearing premise

The Human-LLM collaborative filtering process produces questions that are genuinely graduate-level, unambiguous, and representative of each discipline without introducing selection bias.

What would settle it

Independent expert review of a random sample of SuperGPQA questions finds that a large fraction can be answered correctly by undergraduates or contain hidden ambiguities that allow multiple valid answers.

read the original abstract

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SuperGPQA, a benchmark evaluating LLM graduate-level knowledge and reasoning across 285 disciplines. It describes a Human-LLM collaborative filtering process to remove trivial or ambiguous questions via iterative refinement with LLM responses and expert feedback, reports that the best model (DeepSeek-R1) reaches only 61.82% accuracy, and provides methodological insights from managing annotation with over 80 experts.

Significance. If the filtering successfully yields unambiguous graduate-level items representative of the 285 disciplines, the benchmark would meaningfully expand evaluation beyond mainstream fields and document a substantial capability gap. The large-scale annotation management details could also inform future benchmark construction efforts.

major comments (2)
  1. [Abstract / Filtering mechanism] Abstract and methods description of the Human-LLM collaborative filtering: no quantitative diagnostics are supplied (rejection rates per stage or discipline, inter-expert agreement, pre/post-filtering accuracy curves, or examples of discarded vs. retained items). This is load-bearing for the central claim that retained questions are genuinely graduate-level and that the 61.82% ceiling reflects a true AGI gap rather than residual ambiguity or selection bias.
  2. [Results] Results section reporting model accuracies: headline figures such as DeepSeek-R1 at 61.82% are given without error bars, confidence intervals, or per-discipline variance. This prevents assessment of whether observed differences across models or fields are statistically reliable.
minor comments (2)
  1. [Abstract] Clarify the exact count of disciplines (abstract states 'over 200' while title and body use 285) and provide a breakdown of coverage by broad field.
  2. Include at least one or two sample questions per major discipline cluster to allow readers to judge graduate-level difficulty directly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below and describe the revisions we will incorporate to improve the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Filtering mechanism] Abstract and methods description of the Human-LLM collaborative filtering: no quantitative diagnostics are supplied (rejection rates per stage or discipline, inter-expert agreement, pre/post-filtering accuracy curves, or examples of discarded vs. retained items). This is load-bearing for the central claim that retained questions are genuinely graduate-level and that the 61.82% ceiling reflects a true AGI gap rather than residual ambiguity or selection bias.

    Authors: We agree that the current description would benefit from additional quantitative support. In the revised manuscript we will add rejection rates per filtering stage and per discipline, inter-expert agreement statistics (including percentage agreement and Cohen’s kappa where multiple experts reviewed the same items), pre- and post-filtering accuracy curves for the LLMs used in the process, and representative examples of discarded versus retained questions. These additions will directly substantiate that the retained items are graduate-level and reduce concerns about residual ambiguity or selection bias. revision: yes

  2. Referee: [Results] Results section reporting model accuracies: headline figures such as DeepSeek-R1 at 61.82% are given without error bars, confidence intervals, or per-discipline variance. This prevents assessment of whether observed differences across models or fields are statistically reliable.

    Authors: We acknowledge that statistical reliability measures are important for interpreting the results. In the revision we will report 95% binomial confidence intervals for all headline accuracy figures and include error bars on the main result plots. We will also add per-discipline standard deviations and, for disciplines with sufficient question counts, per-discipline error bars. These changes will allow readers to assess the reliability of differences across models and fields. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or fitted predictions

full rationale

The paper constructs SuperGPQA via Human-LLM collaborative filtering and reports direct empirical accuracies (e.g., DeepSeek-R1 at 61.82%). No equations, parameters, or predictions exist that reduce to inputs by construction. The filtering process is presented as a methodological contribution without self-referential loops, uniqueness theorems, or ansatzes. Central claims rest on the benchmark results themselves rather than any tautological reduction. This matches the expected non-circular outcome for a pure empirical dataset paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new physical or mathematical entities. It relies on standard assumptions that expert annotators can reliably judge graduate-level difficulty and that LLM responses provide useful signals for filtering.

axioms (1)
  • domain assumption Expert annotators can consistently identify and remove trivial or ambiguous graduate-level questions
    Invoked in the description of the Human-LLM collaborative filtering mechanism

pith-pipeline@v0.9.0 · 5894 in / 1228 out tokens · 48499 ms · 2026-05-16T00:41:53.807346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Scaling Latent Reasoning via Looped Language Models

    cs.CL 2025-10 unverdicted novelty 7.0

    Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.

  2. SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

    cs.LG 2026-05 unverdicted novelty 6.0

    Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...

  3. Rotation-Preserving Supervised Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

  4. Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

    cs.AI 2026-05 unverdicted novelty 6.0

    ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...

  5. SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...

  6. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  7. SafeSci: Safety Evaluation of Large Language Models in Science Domains and Beyond

    cs.LG 2026-03 conditional novelty 6.0

    SafeSci creates a large objective benchmark and training resource that reveals safety weaknesses in current LLMs for science and demonstrates measurable improvement through targeted fine-tuning.

  8. SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

    cs.CV 2026-05 unverdicted novelty 5.0

    SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

  9. Heterogeneous Scientific Foundation Model Collaboration

    cs.AI 2026-04 unverdicted novelty 5.0

    Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

  10. Qwen3.5-Omni Technical Report

    cs.CL 2026-04 unverdicted novelty 5.0

    Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...

  11. Seed1.8 Model Card: Towards Generalized Real-World Agency

    cs.AI 2026-03 unverdicted novelty 5.0

    Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.

  12. MiMo-V2-Flash Technical Report

    cs.CL 2026-01 unverdicted novelty 5.0

    MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...

  13. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  14. Qwen3 Technical Report

    cs.CL 2025-05 unverdicted novelty 5.0

    Pith review generated a malformed one-line summary.

  15. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  16. Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

    cs.CL 2026-05 unverdicted novelty 4.0

    Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

  17. Supplement Generation Training for Enhancing Agentic Task Performance

    cs.LG 2026-04 unverdicted novelty 4.0

    SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.

  18. MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

    cs.CL 2026-02 unverdicted novelty 4.0

    MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · cited by 18 Pith papers · 1 internal anchor

  1. [1]

    URL https://api.semanticscholar.org/CorpusID:274656307. AI-MO. Aimo validation aime, 2024. URL https://huggingface.co/datasets/AI -MO/aimo-validation-aime . Validation set containing 90 AIME problems from 2022-2024 contests. AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/lla ma3/blob/main/MODEL_CARD.md. Anthropic. Claude 3.5 sonnet m...

  2. [2]

    Yi: Open Foundation Models by 01.AI

    doi: 10.18653/v1/d18-1259. URL https://doi.org/10.18653/v1/d18-1 259. A. A. Young, B. Chen, C. Li, C. Huang, G. Zhang, G. Zhang, H. Li, J. Zhu, J. Chen, J. Chang, K. Yu, P . Liu, Q. Liu, S. Yue, S. Yang, S. Yang, T. Yu, W. Xie, W. Huang, X. Hu, X. Ren, X. Niu, P . Nie, Y. Xu, Y. Liu, Y. Wang, Y. Cai, Z. Gu, Z. Liu, and Z. Dai. Yi: Open foundation models b...

  3. [3]

    According to Danto’s definition, context is an art world with modern aspects

  4. [4]

    La Bayadère

    “La Bayadère” is a ballet created during the French July Revolution

  5. [5]

    The ballet “Sylvia” is a dance drama created during the Paris Commune period in 1871

  6. [6]

    Korean court dance, when calculated according to temporal principles, does not belong to secondary civilization. Options: A) 1, 3 B) 1, 4 C) 1,2,4 D) 2,3 E) 1,2,3 F) 3 G) 4 H) 3,4 I) 1,2,3,4 J) 2,4 Answer: 3, 4 Answer letter: H Discipline: Literature and Arts Field: Art Studies Subfield: Dance Studies Difficulty: easy Middle Sample Question: A deck of pla...

  7. [7]

    Material Review: • Ensure that the materials meet the aforementioned requirements, including accuracy, diversity, neutrality (free from regional bias), non-image-based content, and compliance with copyright regulations

  8. [8]

    Transcribe the question content word-for-word, ensuring no key information is omitted, with particular attention to the accurate transcription of formulas

    Question Transcription: • Use OCR tools to recognize the text in the materials or directly paste the origi- nal content. Transcribe the question content word-for-word, ensuring no key information is omitted, with particular attention to the accurate transcription of formulas

  9. [9]

    Transcribe the options word-for-word, ensuring no key information is omitted, with particular attention to the accurate transcription of formulas

    Option Transcription: • Use OCR tools to recognize the text in the materials or directly paste the original content. Transcribe the options word-for-word, ensuring no key information is omitted, with particular attention to the accurate transcription of formulas

  10. [11]

    The number of options should be between 4 and 10, with 42 more options being preferred to increase the difficulty and discriminatory power of the question

    Distractor Addition: • Add distractors while maintaining the quality of the options (avoiding mean- ingless or excess correct answers) until the annotator deems the level of confusion sufficient. The number of options should be between 4 and 10, with 42 more options being preferred to increase the difficulty and discriminatory power of the question

  11. [12]

    discipline

    Field Completion: • Complete the category fields ("discipline","field","subfield") and select the appropriate difficulty level to ensure the integrity and accuracy of the question information. B.2.2. Non-Choice Conversion Method ὒ7Non-Choice Conversion Description: • Convert non-multiple-choice questions (such as calculation questions, fill-in-the- blank ...

  12. [13]

    • Ensure that the conditions of the question are complete and do not cause ambiguity

    Material Review: • Ensure that the materials meet the aforementioned requirements, including accuracy, diversity, neutrality (free from regional bias), non-image-based content, and compliance with copyright regulations. • Ensure that the conditions of the question are complete and do not cause ambiguity. Add supplementary explanations if necessary. • Ensu...

  13. [14]

    Question Transcription: • Use OCR tools to recognize the text in the material or directly paste the original text. Transcribe the content of the original question word by word, ensuring that all numbers, formulas, or condition information in the question are accurate and complete to avoid ambiguity or incorrect answers. • Add supplementary information if ...

  14. [15]

    item Use OCR tools to recognize results in the material and confirm that all numerical and formula information in the answer is accurate

    Answer Transcription: • Identify and select the correct result for the appropriate question based on the answer analysis process. item Use OCR tools to recognize results in the material and confirm that all numerical and formula information in the answer is accurate. 43

  15. [16]

    • For numerical results, the setting of distractors should consider reasonable ranges of calculation errors to ensure differentiation between options

    Distractor Addition: • On the basis of the correct answer, consider setting common calculation errors (such as rounding errors, sign mistakes, digit errors, etc.) as distractors. • For numerical results, the setting of distractors should consider reasonable ranges of calculation errors to ensure differentiation between options

  16. [18]

    discipline

    Field Completion: • Complete the category fields ("discipline","field","subfield") and select the appropriate difficulty level to ensure the integrity and accuracy of the question information. B.2.3. Statement Combination Method ὒ7Statement Combination Description: • By integrating stated expressions related to multiple concepts or knowledge points, a mul...

  17. [19]

    Statement Extraction: • Extract core concepts, definitions, relationships between concepts, application cases, and common misconceptions from textbooks and related learning resources, ensuring that the statements are representative and comprehensive. • Extract important statements from multiple-choice questions, covering multi- ple knowledge points to avo...

  18. [20]

    • Incorrect statements should be somewhat misleading, avoiding obvious or easily dismissible errors

    Statement Adaptation: • Adapt the extracted statements to include both correct and incorrect versions. • Incorrect statements should be somewhat misleading, avoiding obvious or easily dismissible errors. 44 • Number the adapted statements (e.g., I, II, III, etc.). Arrange the statements in a reasonable order, avoiding bias towards any direction (e.g., alw...

  19. [21]

    Which of the following statements about [knowledge point] is correct?

    Question Design: • Depending on the need, questions may limit the scope or conditions being tested. • "Which of the following statements about [knowledge point] is correct?" • "Which of the following statements about [knowledge point] is incorrect?"

  20. [22]

    • Ensure that there is exactly one correct answer among the options

    Combination Design: • Combine the numbered statements to create options, such as I and II; II and III; III, VI, and VII. • Ensure that there is exactly one correct answer among the options

  21. [23]

    Answer Identification: • Clearly mark the correct answer within the list of options, ensuring its accu- racy and uniqueness

  22. [24]

    discipline

    Field Completion: • Complete the category fields ("discipline","field","subfield") and select the appropriate difficulty level to ensure the integrity and accuracy of the question information. B.2.4. Confusion-options Generation During the annotation process, we use large models to assist in generating distractors. We select Claude-3.5, GPT-4, Doubao, and...

  23. [25]

    • The distractor should introduce a subtle mathematical error while maintaining the formula structure

    Generate Distractor: • The distractor must be incorrect (not the correct answer). • The distractor should introduce a subtle mathematical error while maintaining the formula structure. • It must be distinct from all existing options, including the correct answer. • Avoid any repetition or overlap with the existing answer options in terms of value or meani...

  24. [26]

    • If the distractor matches any existing option, regenerate it

    Uniqueness Check: • Thoroughly check all existing options (including the correct one) to ensure the distractor is unique. • If the distractor matches any existing option, regenerate it. • The distractor must not create ambiguity; the correct answer must remain the sole valid choice

  25. [27]

    {}", Options: {}, Correct Answer:

    Formatting: • Ensure the distractor matches the format of the existing options (e.g., fractions, exponents, etc.). • Avoid formatting inconsistencies such as misplaced symbols or spaces. Output Format: <distractor> your generated distractor here </distractor> Do not include explanations or extra information. Do not include numbering. Input: Question: "{}"...

  26. [28]

    Rule-Based Pre-Check subsection C.1: Data undergoes initial screening based on predefined code rules, quickly identifying and eliminating clearly invalid data points, thereby enhancing efficiency and reducing subsequent workload

  27. [29]

    LLM-Based Quality Inspection subsection C.2: Large language models are uti- lized to perform in-depth quality assessments of the data, identifying potential errors and inconsistencies, thereby improving the accuracy and completeness of the data

  28. [30]

    all-MiniLM-L6- v2

    Manual Quality Review subsection C.3 : Experienced data analysts conduct a final review to ensure the high quality and usability of the data, correcting potential misjudgments and incorporating domain knowledge for a comprehensive evaluation. C.1. Rule-Based Pre-Check Table 8 presents the rules for data filtering and pre-checking, including text normaliza...

  29. [31]

    What is the value of X?

    The question must explicitly pose a specific problem or ask a clear question that can be answered or calculated. If the question does not clearly ask something (e.g., if it is vague, incomplete, or does not directly ask for a specific answer), it should be considered invalid. A valid question should directly require an answer, such as asking for a numeric...

  30. [32]

    If any part is missing or unclear (e.g., if the answer does not match any of the listed options), the question should be deemed invalid

    The question, options, and answer must be fully defined. If any part is missing or unclear (e.g., if the answer does not match any of the listed options), the question should be deemed invalid. Note that the number of options is not a factor in determining completeness—having only one option is acceptable as long as the content is complete and coherent

  31. [33]

    Option A is true because Option B is false

    The question, options, and answer must be directly relevant to each other. The op- tions should not reference each other, and there should be no circular dependencies or inter-referencing between options (e.g., “Option A is true because Option B is false”). Each option must stand independently, with no cross-references to other options. 49

  32. [34]

    The following options are incorrect

    Negative phrasing such as “The following options are incorrect” or “None of the above” is not allowed in the question. The question should use positive phrasing (e.g., “Which of the following is correct?”) to avoid confusion. If negative phrasing is detected in the question, it should be deemed invalid

  33. [35]

    is_valid

    The question must be in a valid multiple-choice format. If the question is not suitable for a multiple-choice format (e.g., it is a free-form answer question like a problem or essay), it should be deemed invalid. Ensure that the question is designed specifically for multiple-choice answering, rather than being a question that requires an open-ended respon...

  34. [36]

    Which of the following is NOT...?

    The question must not be a negation (e.g., "Which of the following is NOT...?" or "What is not...?")

  35. [37]

    Which of the following is the best/worst...?

    The question should avoid vague or ambiguous phrasing, such as "Which of the following is the best/worst...?" or other uncertain expressions

  36. [38]

    All of the above

    The answer choices should not be all affirmations or all negations, such as "All of the above" or "None of the above," as these are considered inappropriate. Please return the result in the following JSON format: { "is_valid": true/false } Ensure that the output is valid JSON and do not return any unrelated information. Example input 1: { question: Which ...

  37. [39]

    incorrect

    Analyze the final part of the question (the actual question being asked) to check for specific phrases: • "incorrect" in a context like "which of the following is incorrect" • "most" in a context like"which is most likely"

  38. [40]

    If either of these phrases appears in the question’s final asking part, outputtrue, followed by a brief explanation of why the phrase is present

  39. [41]

    A new technology is introduced in a manufacturing plant. Which of the following is most likely to occur?

    If neither phrase is found in the question’s final part, outputfalse, followed by a brief explanation. Output Format: Provide a response with: • true orfalse as the first part of the output • A brief explanation that justifies your answer Example Input: Question: "A new technology is introduced in a manufacturing plant. Which of the following is most like...

  40. [42]

    • If discipline is appropriate, evaluate if thefield is relevant

    Evaluate the classification level by level: • First, check if the discipline is relevant to the question. • If discipline is appropriate, evaluate if thefield is relevant. • If field is appropriate, evaluate if thesubfield is relevant

  41. [43]

    Special rule: • If thesubfield is found to be highly relevant to the question, the entire clas- sification (discipline, field, and subfield) is considered appropriate, regardless of minor mismatches indiscipline orfield

  42. [44]

    If no levels meet the above criteria, or if any level other thansubfield is deemed inappropriate withoutsubfield being highly relevant, the classification is consid- ered not relevant

  43. [45]

    is_relevant

    Output strictly and exclusively in the following JSON format and don’t output any explanation, just output JSON: { "is_relevant": true/false 55 } C.2.5. Completeness Assessment ☼ Purpose • Determine if the question is solvable, considering the following two dimensions: • Confidence: The level of confidence in the answer, categorized as "High," "Medium," o...

  44. [46]

    Attempt to solve the problem and assess your confidence level in the answer

  45. [47]

    If you cannot provide an accurate answer, evaluate whether it is due to missing information (e.g., diagrams, formulas, known conditions) that makes the problem unsolvable, or if it’s due to other reasons (e.g., high difficulty or uncertainty in solving)

  46. [48]

    final_answer_letter

    Output format: { "final_answer_letter": "A", "confidence": "High", "missing_info": false } Rules: • If the problem is missing a "diagram" (such as a geometry question requiring visual input), mark it as missing_info: true and explain that the absence of the diagram or visual representation makes the question unsolvable. • If the problem contains sufficien...

  47. [49]

    DeepSeek-R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

  48. [50]

    DeepSeek-R1-Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

  49. [51]

    o1-2024-12-17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

  50. [52]

    o3-mini-2025-01-31-high . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

  51. [53]

    Doubao-1.5-pro-32k-250115 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

  52. [54]

    o3-mini-2025-01-31-medium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

  53. [55]

    Doubao-1.5-pro-32k-241225 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

  54. [56]

    qwen-max-2025-01-25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

  55. [57]

    claude-3-5-sonnet-20241022 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

  56. [58]

    o3-mini-2025-01-31-low . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

  57. [59]

    gemini-2.0-flash . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

  58. [60]

    DeepSeek-V3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

  59. [61]

    o1-mini-2024-09-12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

  60. [62]

    MiniMax-Text-01 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

  61. [63]

    gpt-4o-2024-11-20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

  62. [64]

    QwQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

  63. [65]

    Llama-3.1-405B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

  64. [66]

    gpt-4o-2024-08-06 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

  65. [67]

    Qwen2.5-72B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

  66. [68]

    Mistral-Large-Instruct-2411 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 96

  67. [69]

    qwen-max-2024-09-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

  68. [70]

    gpt-4o-2024-05-13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

  69. [71]

    Qwen2.5-32B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

  70. [72]

    Llama-3.3-70B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

  71. [73]

    phi-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

  72. [74]

    Qwen2.5-14B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

  73. [75]

    Llama-3.1-70B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

  74. [76]

    Qwen2.5-72B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

  75. [77]

    Yi-Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

  76. [78]

    Qwen2.5-32B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

  77. [79]

    DeepSeek-V3-Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

  78. [80]

    Qwen2.5-14B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

  79. [81]

    Mixtral-8x22B-Instruct-v0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

  80. [82]

    Qwen2.5-7B-Instruct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

Showing first 80 references.