pith. machine review for the scientific record. sign in

arxiv: 2402.14008 · v2 · submitted 2024-02-21 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Authors on Pith no claims yet

Pith reviewed 2026-05-11 08:31 UTC · model grok-4.3

classification 💻 cs.CL
keywords OlympiadBenchmultimodal benchmarkscientific reasoningmathematics problemsphysics problemslarge multimodal modelsAGI evaluation
0
0 comments X

The pith

OlympiadBench tests AI models on 8,476 Olympiad math and physics problems, where GPT-4V scores 17.97 percent overall.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces OlympiadBench, a bilingual multimodal benchmark drawn from Olympiad competitions and the Chinese college entrance exam, containing 8,476 mathematics and physics problems each supplied with expert step-by-step reasoning annotations. It evaluates leading models using a structured assessment that accounts for partial correctness and detailed error types. The strongest model, GPT-4V, reaches only 17.97 percent average accuracy and 10.74 percent in physics, revealing persistent gaps in multimodal scientific reasoning. The authors document recurring failures such as hallucinations, omitted knowledge, and logical inconsistencies, framing the benchmark as a resource to steer future AGI development.

Core claim

OlympiadBench is a bilingual multimodal benchmark of 8,476 Olympiad-level mathematics and physics problems, each paired with expert-level step-by-step solution annotations. Comprehensive evaluation of current top-tier models shows GPT-4V attaining an average score of 17.97 percent, dropping to 10.74 percent on physics problems, which the paper presents as evidence of the benchmark's rigor and the specific difficulties of physical reasoning.

What carries the argument

OlympiadBench, the curated collection of competition problems with multimodal inputs and expert annotations that supports fine-grained evaluation of model reasoning chains on advanced scientific tasks.

If this is right

  • Physics problems remain markedly harder for models than mathematics problems.
  • Common failure modes include hallucinations, knowledge omissions, and logical fallacies that the annotations can help isolate.
  • The benchmark supplies training signals via its step-by-step solutions for improving model reasoning.
  • Progress on this resource is positioned as a concrete step toward AGI-level scientific problem solving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The performance gap may stem from insufficient integration of diagram interpretation with symbolic manipulation.
  • Extending similar annotated benchmarks to other domains could test whether the observed limitations are domain-specific.
  • Bilingual problem pairs enable direct measurement of cross-language consistency in scientific reasoning.

Load-bearing premise

The selected problems and expert annotations constitute a fair, unbiased measure of advanced scientific reasoning ability.

What would settle it

A model achieving expert-comparable scores above 60 percent on the full benchmark using only standard methods, or independent expert re-scoring of model outputs that finds the automated evaluation substantially underestimates correct reasoning.

read the original abstract

Recent advancements have seen Large Language Models (LLMs) and Large Multimodal Models (LMMs) surpassing general human capabilities in various tasks, approaching the proficiency level of human experts across multiple domains. With traditional benchmarks becoming less challenging for these models, new rigorous challenges are essential to gauge their advanced abilities. In this work, we present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam. Each problem is detailed with expert-level annotations for step-by-step reasoning. Evaluating top-tier models on OlympiadBench, we implement a comprehensive assessment methodology to accurately evaluate model responses. Notably, the best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics, highlighting the benchmark rigor and the intricacy of physical reasoning. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies. We hope that our challenging benchmark can serve as a valuable resource for helping future AGI research endeavors. The data and evaluation code are available at \url{https://github.com/OpenBMB/OlympiadBench}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents OlympiadBench, a new benchmark comprising 8,476 Olympiad-level bilingual (Chinese-English) multimodal problems in mathematics and physics, sourced from competitions including the Chinese college entrance exam. Each problem includes expert annotations for step-by-step reasoning. The authors evaluate several leading large multimodal models (LMMs), reporting that GPT-4V achieves the highest average score of 17.97%, with only 10.74% in physics. They provide qualitative error analysis identifying issues such as hallucinations, knowledge omissions, and logical fallacies in model responses, and release the dataset and evaluation code.

Significance. If the benchmark's construction and evaluation are robust, this work offers a valuable, challenging resource for assessing advanced scientific reasoning and multimodal capabilities in AI models, which current systems clearly struggle with based on the low scores. The public release of the data and code is a strength that enables reproducibility and further research. It highlights specific gaps in physical reasoning that could inform future model development toward AGI.

major comments (2)
  1. [Abstract and Dataset Construction] The paper claims the benchmark demonstrates 'rigor' and 'intricacy of physical reasoning' based on low model scores (e.g., GPT-4V at 17.97% overall and 10.74% in physics), but provides no details on problem selection criteria, sourcing process from specific Olympiads, or inter-annotator agreement for the expert annotations. This information is essential to rule out selection bias or annotation inconsistencies that could affect the validity of the performance claims.
  2. [Evaluation Methodology] The 'comprehensive assessment methodology' for scoring model responses is referenced but not described in sufficient detail, including how multimodal elements (e.g., diagrams in physics problems) are handled during input to models and how partial correctness or step-by-step reasoning is evaluated. This makes it hard to interpret the reported scores and error analysis.
minor comments (1)
  1. [Abstract] The abstract could benefit from a brief mention of the number of problems per category (mathematics vs. physics) to give readers a better sense of the benchmark composition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to enhance transparency.

read point-by-point responses
  1. Referee: [Abstract and Dataset Construction] The paper claims the benchmark demonstrates 'rigor' and 'intricacy of physical reasoning' based on low model scores (e.g., GPT-4V at 17.97% overall and 10.74% in physics), but provides no details on problem selection criteria, sourcing process from specific Olympiads, or inter-annotator agreement for the expert annotations. This information is essential to rule out selection bias or annotation inconsistencies that could affect the validity of the performance claims.

    Authors: We agree that additional details on dataset construction are warranted to strengthen claims of rigor. In the revised manuscript, we have expanded Section 3 to include explicit problem selection criteria (e.g., difficulty thresholds and topic coverage from IMO, IPhO, and Gaokao), the full sourcing process from competition archives, and inter-annotator agreement statistics for the expert step-by-step annotations (92% pairwise agreement). These additions address potential bias concerns while preserving the original curation approach. revision: yes

  2. Referee: [Evaluation Methodology] The 'comprehensive assessment methodology' for scoring model responses is referenced but not described in sufficient detail, including how multimodal elements (e.g., diagrams in physics problems) are handled during input to models and how partial correctness or step-by-step reasoning is evaluated. This makes it hard to interpret the reported scores and error analysis.

    Authors: We acknowledge the need for greater detail here. The revised Section 4 now specifies the multimodal input pipeline (diagrams provided as images via model-specific encoding, e.g., base64 for GPT-4V), the exact scoring rubric for partial credit on step-by-step solutions, and the standardized protocol for categorizing errors such as hallucinations. This expanded description enables clearer interpretation of scores without altering the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and external evaluation

full rationale

The paper collects Olympiad problems from external sources, adds expert annotations, and reports direct performance numbers for third-party models (GPT-4V at 17.97 % overall). No equations, fitted parameters, predictions, or self-referential derivations exist; the reported scores are simple empirical measurements once the dataset and rubric are fixed. The evaluation code is released, allowing independent verification outside any internal loop. This is a standard benchmark paper with no load-bearing self-citation chains or definitional reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that Olympiad problems are a valid proxy for advanced scientific reasoning and on the practical choice of which competitions to include; no free parameters are fitted to produce the benchmark itself.

axioms (1)
  • domain assumption Olympiad-level problems require expert-level step-by-step reasoning that current models lack.
    Invoked in the motivation and in the interpretation of low model scores.

pith-pipeline@v0.9.0 · 5578 in / 1264 out tokens · 58737 ms · 2026-05-11T08:31:50.403527+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    the best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

    cs.AI 2026-04 accept novelty 8.0

    MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...

  2. AIS: Adaptive Importance Sampling for Quantized RL

    stat.ML 2026-05 unverdicted novelty 7.0

    AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

  3. Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

    cs.CL 2026-05 unverdicted novelty 7.0

    A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.

  4. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...

  5. Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...

  6. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  7. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  8. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 7.0

    RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.

  9. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

  10. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.

  11. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  12. Scaling Latent Reasoning via Looped Language Models

    cs.CL 2025-10 unverdicted novelty 7.0

    Looped language models with latent iterative computation and entropy-regularized depth allocation achieve performance matching up to 12B standard LLMs through superior knowledge manipulation.

  13. Rotation-Preserving Supervised Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

  14. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

    cs.AI 2026-05 unverdicted novelty 6.0

    CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...

  15. Gradient Extrapolation-Based Policy Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...

  16. Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

    cs.CL 2026-05 unverdicted novelty 6.0

    RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...

  17. Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

    cs.LG 2026-05 unverdicted novelty 6.0

    LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.

  18. Controllable and Verifiable Process Data Synthesis for Process Reward Models

    cs.AI 2026-05 unverdicted novelty 6.0

    A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.

  19. ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.

  20. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  21. Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance

    cs.CL 2026-04 unverdicted novelty 6.0

    Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...

  22. GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.

  23. Hybrid Policy Distillation for LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...

  24. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  25. Characterizing Model-Native Skills

    cs.AI 2026-04 conditional novelty 6.0

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...

  26. PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

    cs.LG 2026-04 unverdicted novelty 6.0

    PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.

  27. When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

    cs.CL 2026-04 unverdicted novelty 6.0

    ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.

  28. The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...

  29. GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

    cs.CL 2026-01 unverdicted novelty 6.0

    GDPO decouples per-reward normalization in multi-reward RL to avoid advantage collapse and improve convergence over GRPO on tool-calling, math, and coding tasks.

  30. LLaDA2.0: Scaling Up Diffusion Language Models to 100B

    cs.LG 2025-12 conditional novelty 6.0

    LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.

  31. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  32. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  33. Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

  34. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.

  35. Humanity's Last Exam

    cs.LG 2025-01 unverdicted novelty 5.0

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

  36. SPREG: Structured Plan Repair with Entropy-Guided Test-Time Intervention for Large Language Model Reasoning

    cs.AI 2026-04 unverdicted novelty 4.0

    SPREG detects logical failures in LLM long-chain reasoning through real-time entropy spikes and performs structured plan repairs using historical distributions, reporting a 20% absolute accuracy gain on AIME25.

  37. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 33 Pith papers · 3 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Have llms advanced enough? a challenging problem solving benchmark for large language mod- els. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966. Daniel Bobrow et al. 1964. Natura...

  2. [2]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word prob- lems. arXiv preprint arXiv:2110.14168. Katherine M Collins, Albert Q Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz, Yuhuai Wu, Joshua B Tenenbaum, William Hart, et al. 2023. Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694. Simon Fr...

  3. [3]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

    Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt

  4. [4]

    Measuring Massive Multitask Language Understanding

    Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Ja- cob Steinhardt. 2021. Measuring mathematical prob- lem solving with the math dataset. arXiv preprint arXiv:2103.03874. Jinchang Hou, Chang Ao, Haihong Wu, Xiangtao Kon...

  5. [5]

    Yan Wang, Xiaojiang Liu, and Shuming Shi

    Scibench: Evaluating college-level scientific problem-solving abilities of large language models. Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Pro- ceedings of the 2017 Conference on Empirical Meth- ods in Natural Language Processing, pages 845–854, Copenhagen, Denmark. Association for Computa- tional Lingu...

  6. [6]

    Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, and Hongsheng Li

    Minif2f: a cross-system benchmark for formal olympiad-level mathematics. Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, and Hongsheng Li. 2023. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. A Dataset Details A.1 Data Sources Our da...

  7. [7]

    The Mathematics and Physics Olympiad problems are globally recognized for their complexity and quality

    Global Mathematics and Physics Olympiad Problems. The Mathematics and Physics Olympiad problems are globally recognized for their complexity and quality. These prob- lems often require multiple methods of solu- tion and the ability to integrate sub-disciplines from within the broader fields of mathematics and physics. The participants in these compe- titi...

  8. [8]

    In addition to maintaining a high level of difficulty, regional competi- tions and the CMO introduce elements spe- cific to the Chinese context

    Regional and National Chinese Mathemat- ics Competitions. In addition to maintaining a high level of difficulty, regional competi- tions and the CMO introduce elements spe- cific to the Chinese context. This inclusion is instrumental in furthering the development and research of Chinese-oriented and multilin- gual large models. By encompassing a wide arra...

  9. [9]

    Gaokao Mock Questions for Mathematics and Physics. Given that the resolution of Olympiad-level problems typically necessi- tates models with substantial parameter sizes, we also incorporate Gaokao simulation prob- lems to evaluate smaller models’ capabili- ties in answering free-form mathematics and physics questions. The integration of data from Gaokao s...

  10. [10]

    0125-preview

    claims to be the strongest open-source LMM, with enhancements in reasoning, OCR, and world knowledge. Despite being trained exclusively with English multi-modal data, it demonstrates an emer- gent zero-shot Chinese multi-modal capability on Chinese benchmarks. It should be noted that an image must be passed for Gemini-Pro-Vision, LLaV A-NeXT, and Yi-VL du...

  11. [11]

    This case mainly occurs in Physics-En_COMP that contains long-context problems of over 6,000 tokens

    Exceeding input limit: Some of the context of the problems are too long, which exceed the input token limitation for the API. This case mainly occurs in Physics-En_COMP that contains long-context problems of over 6,000 tokens

  12. [12]

    Inappropriate response: Some problems trig- ger inappropriate response, which are banned by the API to return

  13. [13]

    No response: Some problems continuously get no or empty response from the API

  14. [14]

    We removed the problems with unavailable re- sponse when calculating the accuracy

    Request timed out: Some problems continu- ously fail to get a response. We removed the problems with unavailable re- sponse when calculating the accuracy. C Additional Analysis and Examples C.1 Performance analysis of GPT-4V We analyzed GPT-4V’s performance (accuracy on open-ended problems) on different knowl- edge points based on the knowledge point labe...

  15. [15]

    Question Misunderstanding: GPT-4V some- times misunderstands the intention or settings of the question

  16. [16]

    Value Calculation Error: GPT-4V make sim- ple calculation mistakes sometimes, such as outputting b 2 + 7 = b+7 2 , these mistakes ap- pears more in Chinese and Math contents

  17. [17]

    Expression Calculation Error: Similar to value calculation error, but happens when transform- ing between two expressions

  18. [18]

    Logical Reasoning / Induction Error / Concep- tual Confusion: GPT-4V sometimes makes false reasoning or induction, as well as en- counters conceptual confusion (see Figure 7 for example)

  19. [19]

    Introducing Unnecessary variables or con- cepts: GPT-4V sometimes suddenly introduce variables or try to use concepts that have no contribution to solving the problem, which not only makes the output longer, but also may confuse GPT-4V itself and leads to incorrect output

  20. [20]

    The Power Theo- rem

    Conclusion Hallucination: GPT-4V some- times hallucinates for a conclusion that is not reached in former output, or hallucinates a theorem that does not really exist (for exam- ple, when solving geometric proving problem, GPT-4V always mention "The Power Theo- rem", which does not exist, and all the proof thereafter will lost their logic)

  21. [21]

    (which is not true), or degenerates after some tokens

    Unfinished Answering: GPT-4V sometimes says the question have confliction in settings 024681012141618 Unclassified Modern Physics Wave Physics Electromagnetism Mechanics Thermodynamics 01020304050607080 Unclassified Complex Numbers Derivatives Sequence Conic Sections Logic Algebra Elementary Functions Set Theory Combinations Probability and Statistics Num...

  22. [22]

    Insufficient Classification Discussions: When doing classification discussion, GPT-4V may miss some possible situation, or have over- lapped discussion (see Figure 6 for example)

  23. [23]

    Incorrect Judging: Sometimes GPT-4V gives the right answer, but is judged as incorrect due to the limitation of the automated scor- ing system: One important problem is that many problems, especially Physics problems, accept answers that fall in a specific range due to rounding up, rather than a fixed nu- merical answer, so a precision is needed for autom...

  24. [24]

    Given a simple solution, GPT-4V may choose a more complex method to solve the problem (see Figure 8)

  25. [25]

    Mainly observed for problems with a simple answer, such as the variables takes 0 as the answer

    Models may give correct answers with a false process. Mainly observed for problems with a simple answer, such as the variables takes 0 as the answer

  26. [26]

    ……-If the first roll is 8, the die will have faces 1, 2, 3, 2, 3, 4. The probability of rolling a 2 is!

    GPT-4V may success in giving correct overall idea, but fail in calculation (such as solving quadratic equations with extra negative signs), which leads to a wrong answer. Question GPT-4V’sSolution A die, with the numbers 1,2,3,4,6, and 8 on its six faces, is rolled. After this roll, if an odd number appears on the top face, all odd numbers on the die are ...

  27. [27]

    vertical to the plane

    GPT-4V may not fully utilize the information from the image (see Figure 9). D Automatic Scoring Pipeline The pipeline workflow is shown in Algorithm 1. Algorithm 1: Auto Scoring Judge Input: GroundTruth, ModelOutput; Output: Boolean value indicating match; Preprocess GroundTruth and ModelOutput; if GroundTruth equals ModelOutput then return True; else if ...