pith. sign in

arxiv: 2503.21380 · v3 · submitted 2025-03-27 · 💻 cs.CL

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

Pith reviewed 2026-05-22 22:10 UTC · model grok-4.3

classification 💻 cs.CL
keywords olympiad math benchmarklarge language modelsmathematical reasoning evaluationlean formalizationdual evaluationdata contamination preventionnumerical answer assessment
0
0 comments X

The pith

OlymMATH unifies numerical answer evaluation and Lean formal verification in a 350-problem Olympiad math benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OlymMATH as a new benchmark to evaluate large language models on Olympiad-level math problems that exceed the difficulty of existing benchmarks. It creates parallel English and Chinese versions of 350 problems drawn from printed publications. The benchmark splits into 200 problems for natural language evaluation based on numerical answers and 150 problems formalized in Lean 4 for checking the reasoning process. Experiments show models struggle significantly and sometimes use guessing strategies instead of proper reasoning, with performance varying by language.

Core claim

OlymMATH is the first benchmark to unify dual evaluation paradigms within a single suite: natural language evaluation through OlymMATH-EASY and OlymMATH-HARD, comprising 200 computational problems with numerical answers for objective rule-based assessment, and formal verification through OlymMATH-LEAN, offering 150 problems formalized in Lean 4 for rigorous process-level evaluation.

What carries the argument

The OlymMATH benchmark with EASY and HARD subsets for natural language numerical evaluation plus the LEAN subset for formal verification in Lean 4.

If this is right

  • Models exhibit significant performance drops on this benchmark compared to prior ones.
  • Consistent performance gaps appear between English and Chinese versions of the same problems.
  • Analysis identifies cases where models employ heuristic guessing rather than rigorous reasoning.
  • The released 582k reasoning trajectories and visualization tool enable further study of model behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dual evaluation setup could be applied to other domains to test whether models perform genuine reasoning or pattern matching.
  • Integration of formal verification tools with language models might become a standard way to improve proof reliability.
  • If models show steady gains on OlymMATH over time, it would indicate measurable progress in handling complex mathematical tasks.
  • Releasing parallel language versions allows direct study of how training data language affects reasoning consistency.

Load-bearing premise

The problems manually sourced from printed publications are absent from model training data and expert verification plus Lean formalization produces reliable test items that distinguish reasoning from guessing.

What would settle it

A model achieving high scores on the Lean subset while producing invalid or incomplete formal proofs would show the benchmark does not enforce process-level evaluation.

Figures

Figures reproduced from arXiv: 2503.21380 by Haoxiang Sun, Ji-Rong Wen, Wayne Xin Zhao, Yingqian Min, Zhipeng Chen.

Figure 1
Figure 1. Figure 1: Performance comparisons of mainstream reasoning models between our OlymMATH [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples from the MATH dataset and our OlymMATH dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pass@1 accuracy on OlymMATH EN (y) vs. ZH (x), the dashed line shows parity. Points above favor English, below favor Chinese. Solid circles (local dense models, colored by size) indicate larger models trend towards higher accuracy. Hollow diamonds are MoE or API evaluated models. Gemini 2.5 Pro Exp Gemini 2.5 Pro Exp o3-mini (high) o3-mini (high) Qwen3-235B-A22B Qwen3-235B-A22B Qwen3-30B-A3B Qwen3-30B-A3B … view at source ↗
Figure 4
Figure 4. Figure 4: Correlation of Pass@1 performance: OlymMATH-EN vs. AIME24. Dashed lines indicate linear trends per dataset. Solid shapes are local dense models (size = model size, color = release date). Hollow shapes denote MoE or API evaluated models. Stars mark the best overall model. HARD subset, are selected and designed so that their reasoning steps are difficult to “hack” through empirical guessing, thus providing a… view at source ↗
Figure 5
Figure 5. Figure 5: The OlymMATH-demo interface. It is currently being maintained on HuggingFace Spaces. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A geometry problem described precisely in text from OlymMATH. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An OlymMATH-HARD example testing model’s identification of all possible answers. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: This boxplot shows that our EASY dataset has AIME-level difficulty with a wider distribu [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example during our data collection. o3-mini (high) found the correct answer without [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example from AIME 2025. o3-mini (high) forgot to prove that [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: An example from Omni-MATH. The solution provided by Omni-MATH itself is flawed [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: An example from OlymMATH-EN-HARD subset. o3-mini (high) attempted to “guess” [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
read the original abstract

The rapid advancement of large reasoning models has saturated existing math benchmarks, underscoring the urgent need for more challenging evaluation frameworks. To address this, we introduce OlymMATH, a rigorously curated, Olympiad-level math benchmark comprising 350 problems, each with parallel English and Chinese versions. OlymMATH is the first benchmark to unify dual evaluation paradigms within a single suite: (1) natural language evaluation through OlymMATH-EASY and OlymMATH-HARD, comprising 200 computational problems with numerical answers for objective rule-based assessment, and (2) formal verification through OlymMATH-LEAN, offering 150 problems formalized in Lean 4 for rigorous process-level evaluation. All problems are manually sourced from printed publications to minimize data contamination, verified by experts, and span four core domains. Extensive experiments reveal the benchmark's significant challenge, and our analysis also uncovers consistent performance gaps between languages and identifies cases where models employ heuristic "guessing" rather than rigorous reasoning. To further support community research, we release 582k+ reasoning trajectories, a visualization tool, and expert solutions at https://github.com/RUCAIBox/OlymMATH.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces OlymMATH, a benchmark of 350 Olympiad-level math problems with parallel English and Chinese versions. It unifies two evaluation paradigms in one suite: natural-language evaluation on OlymMATH-EASY and OlymMATH-HARD (200 computational problems with numerical answers for rule-based scoring) and formal verification on OlymMATH-LEAN (150 problems formalized in Lean 4 for process-level checking). All items were manually sourced from printed publications, expert-verified, and drawn from four core domains. Experiments are reported to demonstrate that the benchmark remains challenging for current models, to document consistent cross-language performance gaps, and to identify instances of heuristic guessing rather than rigorous reasoning. The authors additionally release 582k+ reasoning trajectories, a visualization tool, and expert solutions.

Significance. If the reported results and sourcing claims hold, OlymMATH supplies a timely, higher-difficulty resource that combines outcome-based and process-based evaluation within a single, contamination-resistant collection. The open release of trajectories and tooling directly supports reproducibility and follow-on analysis of reasoning failures. These elements address saturation in existing math benchmarks and provide a concrete testbed for distinguishing genuine reasoning from surface heuristics.

minor comments (2)
  1. The abstract states that problems 'span four core domains' but does not enumerate them; the introduction or §2 should list the domains explicitly for clarity.
  2. The claim that OlymMATH is 'the first benchmark to unify dual evaluation paradigms' would be strengthened by a concise comparison table against the closest prior suites (e.g., those that separately offer Lean formalizations or numerical-answer sets).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive evaluation of OlymMATH and for recommending minor revision. The recognition that the benchmark addresses saturation in existing math evaluations through its bilingual design, dual evaluation paradigms, and contamination-resistant sourcing is appreciated. We note the recommendation and will incorporate any minor clarifications in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark paper that introduces OlymMATH by manually sourcing 350 problems from printed publications, followed by expert verification and Lean formalization. No derivations, equations, fitted parameters, predictions, or self-citations are used to justify any result; performance gaps and contamination claims rest on external sourcing and testing rather than internal construction. The central claim of unifying evaluation paradigms is realized through the benchmark's construction itself and does not reduce to any self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction paper rather than a theoretical derivation; therefore the ledger contains no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5746 in / 1151 out tokens · 81455 ms · 2026-05-22T22:10:36.851920+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

    cs.AI 2026-04 accept novelty 8.0

    MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...

  2. Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.

  3. MathDuels: Evaluating LLMs as Problem Posers and Solvers

    cs.CL 2026-04 unverdicted novelty 7.0

    Self-play between LLMs for problem authoring and solving, scored via Rasch modeling, shows that authoring and solving skills are partially decoupled and that the benchmark difficulty evolves with new models.

  4. Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

    cs.LG 2026-04 unverdicted novelty 7.0

    NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.

  5. Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism

    cs.LO 2026-04 unverdicted novelty 7.0

    ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.

  6. MathArena: Evaluating LLMs on Uncontaminated Math Competitions

    cs.AI 2025-05 unverdicted novelty 7.0

    MathArena evaluates over 50 LLMs on 162 fresh competition problems across seven contests, detects contamination in AIME 2024, and reports top models scoring below 40 percent on IMO 2025 proof tasks.

  7. Unified Data Selection for LLM Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    High-Entropy Sum (HES) selects high-quality reasoning data for LLMs by summing entropy of the top highest-entropy tokens, matching full-dataset performance with top 20% in SFT and outperforming baselines in RFT and RL.

  8. Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control

    cs.LG 2026-04 unverdicted novelty 6.0

    Entrocraft uses rejection sampling to enforce custom entropy curves in LLM RL, sustaining longer training, better generalization, and higher output diversity than prior regularization approaches.

  9. Addressing Performance Saturation for LLM RL via Precise Entropy Curve Control

    cs.LG 2026-04 unverdicted novelty 6.0

    Entrocraft uses rejection sampling to enforce precise entropy schedules in LLM RL by biasing advantages, enabling longer training, better generalization, and higher performance than baselines.

  10. OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

    cs.CV 2026-04 unverdicted novelty 6.0

    OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.

  11. TEMPO: Scaling Test-time Training for Large Reasoning Models

    cs.LG 2026-04 unverdicted novelty 6.0

    TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.

  12. PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research

    cs.LG 2026-04 unverdicted novelty 6.0

    PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.

  13. Riemann-Bench: A Benchmark for Moonshot Mathematics

    cs.AI 2026-04 conditional novelty 5.0

    Riemann-Bench is a private benchmark of 25 research-level math problems on which all tested frontier AI models score below 10%.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 12 Pith papers · 5 internal anchors

  1. [1]

    A survey of large language models, 2025

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2025

  2. [2]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  3. [3]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany...

  4. [4]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  5. [5]

    Openai o1 system card, 2024

    OpenAI. Openai o1 system card, 2024

  6. [6]

    An empirical study on eliciting and improving r1-like reasoning models, 2025

    Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, and Ji-Rong Wen. An empirical study on eliciting and improving r1-like reasoning models, 2025

  7. [7]

    Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

    An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  9. [9]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Bench- marks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021

  10. [10]

    Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024

    Meng Fang, Xiangpeng Wan, Fei Lu, Fei Xing, and Kai Zou. Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data, 2024

  11. [11]

    Have llms advanced enough? a challeng- ing problem solving benchmark for large language models, 2023

    Daman Arora, Himanshu Gaurav Singh, and Mausam. Have llms advanced enough? a challeng- ing problem solving benchmark for large language models, 2023

  12. [12]

    Openai o3-mini: Pushing the frontier of cost-effective reasoning, 1 2025

    OpenAI. Openai o3-mini: Pushing the frontier of cost-effective reasoning, 1 2025

  13. [13]

    Gemini 2.5: Our most intelligent ai model, 3 2025

    Google Deepmind. Gemini 2.5: Our most intelligent ai model, 3 2025

  14. [14]

    Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, 2024

    Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, and Ji-Rong Wen. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems, 2024

  15. [15]

    Aime 2024, 2024

    Mathematical Association of America. Aime 2024, 2024

  16. [16]

    Aime 2025, 2025

    Mathematical Association of America. Aime 2025, 2025. 10

  17. [17]

    Hmmt 202502, 2025

    HMMT. Hmmt 202502, 2025

  18. [18]

    Usamo 2025, 2025

    Mathematical Association of America. Usamo 2025, 2025

  19. [19]

    Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems

    Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. In Proceedings of the 62nd Annual Meeting of the Association for C...

  20. [20]

    Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

    Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, et al. Omni-math: A universal olympiad level mathematic benchmark for large language models. arXiv preprint arXiv:2410.07985, 2024

  21. [21]

    Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning

    RUCAIBox STILL Team. Still-3-1.5b-preview: Enhancing slow thinking abilities of small models through reinforcement learning. 2025

  22. [22]

    Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica

    Michael Luo, Sijun Tan, Justin Wong, Xiaoxiang Shi, William Y . Tang, Manan Roongta, Colin Cai, Jeffrey Luo, Li Erran Li, Raluca Ada Popa, and Ion Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl, 2025. Notion Blog

  23. [23]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

  24. [24]

    Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025

    Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, and Xiangzheng Zhang. Light-r1: Curriculum sft, dpo and rl for long cot from scratch and beyond, 2025

  25. [25]

    Open Thoughts

    OpenThoughts Team. Open Thoughts. https://open-thoughts.ai, January 2025

  26. [26]

    Skywork open reasoner series, 2025

    Jujie He, Jiacai Liu, Chris Yuhao Liu, Rui Yan, Chaojie Wang, Peng Cheng, Xiaoyu Zhang, Fuxiang Zhang, Jiacheng Xu, Wei Shen, Siyuan Li, Liang Zeng, Tianwen Wei, Cheng Cheng, Yang Liu, and Yahui Zhou. Skywork open reasoner series, 2025. Notion Blog

  27. [27]

    Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024

    Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, Hao Yu, Hongning Wang, Jiadai Sun, Jiajie Zhang, Jiale Cheng, Jiayi Gui, Jie Tang, Jing Zhang, Juanzi Li, Lei Zhao, Lindong Wu, Lucen Zhong, Mingdao Liu, Minlie Huang, Peng Zhang, Qinkai Zheng, Rui Lu, Shuaiqi Duan, Shudan Zhang, Shulin Cao, ...

  28. [28]

    Acemath: Advancing frontier math reasoning with post-training and reward modeling

    Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Acemath: Advancing frontier math reasoning with post-training and reward modeling. arXiv preprint, 2024

  29. [29]

    Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025

    Ivan Moshkov, Darragh Hanley, Ivan Sorokin, Shubham Toshniwal, Christof Henkel, Benedikt Schifferer, Wei Du, and Igor Gitman. Aimo-2 winning solution: Building state-of-the-art mathematical reasoning models with openmathreasoning dataset, 2025

  30. [30]

    many” iterations the answer to x2025 = m n (in lowest terms) satisfies m + n ≡ 248 (mod 1000) . In what follows we briefly describe one way (via a “miraculous

    Qwen Team. Qwen3, April 2025. 11 A Appendix This part presents the detailed content of the dataset and the case study examples mentioned before. Problem: Given that two vertices of an equilateral triangle are on the parabola y2 = 4x, and the third vertex is on the directrix of the parabola, and the distance from the center of the triangle to the directrix...