KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

Geewook Kim; Kee-Eung Kim; Sanghee Park

arxiv: 2606.10403 · v2 · pith:7DIL3M67new · submitted 2026-06-09 · 💻 cs.CL

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

Sanghee Park , Geewook Kim , Kee-Eung Kim This is my paper

Pith reviewed 2026-06-27 13:06 UTC · model grok-4.3

classification 💻 cs.CL

keywords math reasoning benchmarkhuman difficultytest-time scalingerror alignmentVLMsLLMsKCSATreasoning evaluation

0 comments

The pith

Models' accuracy collapses on items humans find hardest at low budgets, while test-time scaling increases tokens linearly with human error rates but yields non-monotonic accuracy gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents KCSAT-ML, a collection of real math problems from Korean college entrance exams that include official per-item error rates drawn from nationwide cohorts of hundreds of thousands of students. It pairs this benchmark with the Difficulty-aligned Reasoning Gain metric, which checks whether a model's mistakes fall on the same items that humans missed rather than simply counting total correct answers. Across vision-language and language models, low-budget performance drops sharply on the tail of items with high human error rates. Test-time scaling uses more tokens in rough proportion to human error rates, yet accuracy improvements do not rise steadily and can reverse direction within the same model family. Models that post nearly identical overall accuracy scores can still differ sharply in whether their errors match or diverge from human difficulty patterns.

Core claim

KCSAT-ML supplies 664 math problems with official per-item error rates from large human cohorts, together with the DRG metric that quantifies how well model errors match human difficulty patterns. This combination shows low-budget accuracy collapses on the high-human-error tail at every model size, test-time scaling raises token use roughly linearly with cohort error rate while accuracy gains follow a non-monotonic curve, and within a single family TTS flips between anti-scaling on the hardest items and overthinking on easier ones. Models with near-identical accuracy can occupy opposite positions on the DRG scale.

What carries the argument

The KCSAT-ML benchmark of exam problems carrying nationwide human error rates, paired with the Difficulty-aligned Reasoning Gain metric that measures alignment between model and human mistake patterns.

If this is right

Low-budget accuracy collapses on the high-human-error tail at every model size.
Test-time scaling raises token use roughly linearly with cohort error rate while accuracy gains follow a non-monotonic curve.
Within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones.
Models with near-identical accuracy can sit at near-opposite DRG values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives could reward alignment between model errors and human difficulty distributions in addition to raw accuracy.
Similar cohort-based difficulty signals could be collected for exams in other subjects to test whether the scaling patterns generalize.
Inference procedures might incorporate early estimates of item difficulty to reduce overthinking on easy problems while sustaining effort on hard ones.
Aggregate accuracy scores alone are insufficient for comparing reasoning behavior across models.

Load-bearing premise

Nationwide cohort error rates from the Korean exam provide an unbiased, model-independent measure of item difficulty that can be directly compared to model error patterns.

What would settle it

A model achieving high accuracy on the high-human-error items at low token budgets, without exposure to the exam data, would challenge the reported collapse pattern.

Figures

Figures reproduced from arXiv: 2606.10403 by Geewook Kim, Kee-Eung Kim, Sanghee Park.

**Figure 1.** Figure 1: Same score, opposite reasoning. Two models both solve 7/10, but their mistakes land on opposite ends of the human-difficulty axis: Model A fails where humans also fail (DRG `18), Model B fails on items humans find easy (DRG ´4). Accuracy alone cannot tell them apart; DRG (Sec. 4.7) can. et al., 2023; Qiao et al., 2024), and olympiadlevel (He et al., 2024; Gao et al., 2024) settings have multiplied accordi… view at source ↗

**Figure 2.** Figure 2: Cross-model temporal accuracy on KCSATML. Per-year mean accuracy under wo. TTS (dashed) and w. TTS (solid) for GPT-5, Claude-Sonnet-4.5, and Gemini-3-Pro (2014–2025); grey diamonds: human cohort. Shaded: post-cutoff window anchored on GPT5. Model TTS Avg.Tok Score Open-Weight LLMs (with OCR) Exaone-4.0-32B 6 3.8 Exaone-4.0-32B ✓ 2,396 39.8 Qwen3-30B-A3B-Instruct 5 7.4 Qwen3-30B-A3B-Instruct ✓ 4,063 62.2 … view at source ↗

**Figure 3.** Figure 3: Model scaling vs. test-time scaling (TTS) by human difficulty. KCSAT-ML accuracy across human error-rate bins for Qwen3-VL (4B vs. 32B) and the top-10 closed APIs, with and without TTS. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Estimated FLOPs (PetaFLOPs) 20 30 40 50 60 70 80 90 Accuracy(%) qwen3-vl-32b qwen3-vl-32b qwen3-vl-8b qwen3-vl-8b Easy: 0-50%, Hard: 51-100% Easy: wo. TTS Easy: w. TTS Hard: wo. TTS Hard: w. … view at source ↗

**Figure 4.** Figure 4: When scaling helps: Qwen3-VL-8B vs. 32B. Accuracy vs. estimated FLOPs on Easy and Hard subsets (split at 50% cohort error). 21-30 31-40 41-50 51-60 61-70 71-80 81-9091-100 Human Error Rate (%) 0 200 400 600 Token(Answer) wo. TTS Token(Answer) Time(Inference) Model Acc (%) 21-30 31-40 41-50 51-60 61-70 71-80 81-9091-100 Human Error Rate (%) 0 1,000 2,000 3,000 Token(Answer) w. TTS Token(Answer) Time(Infere… view at source ↗

**Figure 5.** Figure 5: Cost–accuracy trade-off of TTS. Output tokens (blue), inference time (orange), and accuracy (purple) by human error rate, under wo. TTS (top) and w. TTS (bottom). 4 Analysis and Findings 4.1 Overall Impact of Test-Time Scaling [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Cohort error rate predicts model behaviour more strongly than examiner-assigned tiers. Per-item mean accuracy (wo. TTS) vs. (a) examiner points and (b) cohort error rate; N“339. slope « 102 tokens per pp of cohort error): models spend more compute on items humans find hardest. The accuracy gain, however, follows a nonmonotonic curve, peaking at moderate difficulty (Sec. 4.4) and separating the cost of TT… view at source ↗

**Figure 7.** Figure 7: Three regimes of TTS difficulty alignment. Per-item model error rate (y) vs. cohort error rate (x), with wo. TTS (dashed) and w. TTS (solid) regression lines; s marks the fitted slope (intercepts in Appendix C). Top: weak/mid-tier VLMs; bottom: frontier closedsource APIs. Finding 4:Within a single family, TTS reverses direction at the difficulty extremes: anti-scaling on Hard, overthinking on Easy. No uni… view at source ↗

**Figure 8.** Figure 8: DRG vs. accuracy. For 22 model families with both wo. TTS and w. TTS runs, accuracy at max TTS effort (x) against DRG at first TTS turn-on (y); dashed: linear fit. Vertical gold bands highlight clusters with similar accuracy but wide DRG spread. Closed-weight (red) and open-weight (blue) distribute similarly along y (Mann–Whitney p“0.95). taken at the first TTS turn-on. Positive DRG means TTS preferentiall… view at source ↗

**Figure 9.** Figure 9: Example of a geometry-based KCSAT-ML item. A representative geometry problem (2025, #28) with sample wo. TTS and w. TTS responses from Gemini-2.5-Flash. The accompanying diagram (right) is omitted in this example and shown only as a placeholder. KCSAT-ML algebra item (2025 KCSAT, Common #6, 3 points) 6. Given cos´ π 2 ` θ ¯ “ ´ 1 5 , find the value of sin θ 1 ´ cos2 θ . [3 points] ① ´5 ② ´ ? 5 ③ 0 ④ ? 5 ⑤ … view at source ↗

**Figure 10.** Figure 10: Example of an algebraic KCSAT-ML item. A representative algebra problem (2025, #6) with sample wo. TTS and w. TTS responses from HyperCLOVAX-SEED-Think-14B. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Human vs. model error rates across question numbers. Error rates over 30 questions in KCSAT-ML, grouped by question format (multiple-choice vs. short-answer) and period: (A) 2014–2021 and (B) 2022–2025. Periods are split by the 2022 introduction of the common-plus-elective structure. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Estimated wall-clock time to complete a full KCSAT math exam (6000 s limit). Per-tier average inference latencies weighted by expected question counts (Easy/Medium/Hard plus a “Very Easy” bin for items without published statistics), with a fixed pipeline overhead. (A) Base models; (B) Thinking models. Models left of the dashed line finish within time. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt templates used in experiments. VLM and LLM settings under wo. TTS and w. TTS, plus the equivalence-judge prompt for answer verification. Mathematics p.19 27 1 2 3 4 5 28 1 2 3 4 5 29 28 1 2 3 4 5 bbox 30 1 2 3 4 5 { "meta": { "year": "2025", "subject": "Mathematics", "domain": "Geometry", "question_number": "28", "problem_type": "multiple-choice", "points": "4", "choices": ["$\sqrt{43}$", "$\sqrt{4… view at source ↗

**Figure 14.** Figure 14: Sample of structured metadata. Each KCSAT-ML item links a schematic of its exam page (left; our own depiction, not the original exam) to structured annotation (right): attributes, page/bounding-box references, answers, and nationwide-cohort human-error statistics. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

read the original abstract

Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KCSAT-ML supplies real per-item human error rates from large cohorts plus a DRG metric, but contamination from public exam data remains unaddressed.

read the letter

The paper's core contribution is the KCSAT-ML set of 664 Korean math problems, including 339 with official nationwide error rates from hundreds of thousands of students across a decade. Paired with DRG, which checks whether model errors track human difficulty rather than just overall score, it surfaces patterns like accuracy drops on the hardest items and non-monotonic gains under test-time scaling.

This is genuinely new. Prior math benchmarks rarely attach per-item human performance data at this scale, so the dataset itself and the DRG construction give evaluators a concrete alternative to synthetic difficulty measures.

The reported patterns on low-budget collapse and within-family TTS flips are worth seeing if the data hold. DRG also usefully separates models that look identical on accuracy alone.

The main soft spot is contamination. These exact problems and solutions have circulated on Korean web forums, textbooks, and prep sites for years. Any model trained on large web crawls could have encountered them, which would produce exactly the observed error patterns through memorization rather than reasoning alignment. The abstract gives no decontamination steps or overlap checks, so the central claims rest on an assumption that needs explicit testing.

DRG orthogonality also requires the full definition and any post-hoc choices to be shown clearly; without that, it is hard to confirm it adds information beyond accuracy.

This is for groups working on reasoning evaluation and test-time compute who want human-grounded signals. It deserves peer review because the dataset and metric idea are fresh enough to warrant referee time, provided the contamination and validation issues get addressed.

Referee Report

2 major / 2 minor

Summary. The paper introduces KCSAT-ML, a benchmark of 664 KCSAT mathematics problems (339-item core) with official per-item error rates from large nationwide human cohorts (2014-2025), and proposes the Difficulty-aligned Reasoning Gain (DRG) metric claimed to be score-orthogonal. It reports three patterns across VLMs/LLMs: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling increases token use linearly with cohort error rate but accuracy gains are non-monotonic; (iii) within a model family, TTS flips between anti-scaling on hardest items and overthinking on easier ones. Models with similar accuracy can show opposite DRG values.

Significance. If the nationwide cohort error rates provide an exogenous, model-independent difficulty axis and DRG is verifiably orthogonal without contamination artifacts, the benchmark would offer a useful human-grounded probe for reasoning alignment that aggregate accuracy metrics miss, with potential to expose scaling limitations on difficult items.

major comments (2)

[Methods / Data construction] The central claims (accuracy collapse on high-error tail, linear TTS token scaling, within-family sign flip, and DRG orthogonality) all treat KCSAT error rates as an unbiased exogenous difficulty measure. No validation is provided that problems/solutions have not appeared in Common Crawl-scale training data via Korean web forums, textbooks, or exam-prep sites; differential memorization could produce the exact observed patterns without any reasoning alignment. This assumption is load-bearing and requires explicit checks (e.g., contamination audits or exclusion criteria) in the methods.
[DRG definition (likely §3 or §4)] DRG is presented as score-orthogonal by construction, yet the abstract provides no equations, exclusion criteria, or validation that it is independent of accuracy; the reader's note indicates full methods are needed to confirm it does not reduce to fitted parameters or self-referential definitions.

minor comments (2)

[Abstract] The abstract states the dataset builder will be open-sourced but does not specify the exact release contents (e.g., raw cohort statistics, OCR pipeline, or per-item metadata) needed for reproducibility.
[Results] No table or figure numbers are referenced in the provided abstract for the three reported patterns; cross-referencing to specific results would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing methodological rigor. We address each major point below and will incorporate revisions to strengthen the claims regarding data validity and metric transparency.

read point-by-point responses

Referee: [Methods / Data construction] The central claims (accuracy collapse on high-error tail, linear TTS token scaling, within-family sign flip, and DRG orthogonality) all treat KCSAT error rates as an unbiased exogenous difficulty measure. No validation is provided that problems/solutions have not appeared in Common Crawl-scale training data via Korean web forums, textbooks, or exam-prep sites; differential memorization could produce the exact observed patterns without any reasoning alignment. This assumption is load-bearing and requires explicit checks (e.g., contamination audits or exclusion criteria) in the methods.

Authors: We agree that explicit contamination validation is necessary to confirm the error rates function as an exogenous, model-independent difficulty axis. The per-item error rates derive directly from official nationwide human cohorts (hundreds of thousands of examinees per year), independent of model training. However, the original submission did not include dedicated audits for leakage via Common Crawl, Korean forums, textbooks, or prep sites. In revision we will add a data-construction subsection reporting systematic searches for problem/solution overlaps, explicit exclusion criteria, and re-computed results on the decontaminated core set to verify that the reported patterns persist. revision: yes
Referee: [DRG definition (likely §3 or §4)] DRG is presented as score-orthogonal by construction, yet the abstract provides no equations, exclusion criteria, or validation that it is independent of accuracy; the reader's note indicates full methods are needed to confirm it does not reduce to fitted parameters or self-referential definitions.

Authors: Section 3 of the manuscript already contains the full DRG definition and the construction that enforces score-orthogonality via normalization against human error rates. To address the concern about transparency, we will (i) add a concise equation and orthogonality statement to the abstract, (ii) expand the methods with explicit exclusion criteria and empirical validation (correlation tables across models showing near-zero dependence on aggregate accuracy), and (iii) include additional checks confirming DRG does not collapse to fitted or self-referential parameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark with definitional metric

full rationale

The paper introduces KCSAT-ML as an external benchmark grounded in nationwide human cohort error rates (2014-2025 KCSAT data) and defines DRG explicitly as a score-orthogonal metric. Reported patterns are direct observational comparisons of model accuracy and token usage against these human rates, with no equations, fitted parameters, or derivations that reduce to inputs by construction. No self-citations are invoked as load-bearing premises, and the orthogonality of DRG is stated upfront rather than derived as a result. The analysis remains self-contained against the provided benchmark data without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central observations rest on treating official KCSAT error rates as ground-truth difficulty and on the assumption that DRG isolates alignment independent of raw accuracy.

axioms (1)

domain assumption Nationwide KCSAT per-item error rates constitute an unbiased proxy for human difficulty independent of model training distributions.
Invoked when defining DRG and when interpreting the three patterns as alignment failures.

invented entities (1)

Difficulty-aligned Reasoning Gain (DRG) no independent evidence
purpose: Score-orthogonal metric that quantifies whether model errors concentrate on high-human-error items.
Newly introduced in the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5805 in / 1284 out tokens · 21434 ms · 2026-06-27T13:06:05.360697+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 9 canonical work pages · 5 internal anchors

[1]

Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. 2024. https://doi.org/10.48550/arXiv.2412.09078 Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning . In International Conference on Machine Learning

work page doi:10.48550/arxiv.2412.09078 2024
[2]

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. Do Not Think That Much for 2+3=? On the Overthinking of o1-Like LLMs . arXiv preprint arXiv:2412.21187

Pith/arXiv arXiv 2024
[3]

DeepSeek-AI . 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning . arXiv preprint arXiv:2501.12948

Pith/arXiv arXiv 2025
[4]

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. 2024. https://doi.org/10.48550/arXiv.2410.07985 Omni-MATH: A Universal Olympiad Level Mathematic Benchm...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.07985 2024
[5]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Z. Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. https://doi.org/10.48550/arXiv.2402.14008 OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems . In Annual Meeting of the...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.14008 2024
[6]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset . In NeurIPS

2021
[7]

Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, C. Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. 2025. https://doi.org/10.48550/arXiv.2501.19324 Reward-Guided Speculative Decoding for Efficient LLM Reasoning . In International Conference on Machine Learning

work page doi:10.48550/arxiv.2501.19324 2025
[8]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let's Verify Step by Step . In International Conference on Learning Representations

2024
[9]

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2023. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts . In International Conference on Learning Representations

2023
[10]

McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, and 1 others

Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, and 1 others. 2023. Inverse Scaling: When Bigger Isn't Better . Transactions on Machine Learning Research (TMLR)

2023
[11]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Fei-Fei Li, Hanna Hajishirzi, Luke S. Zettlemoyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. 2025. https://doi.org/10.48550/arXiv.2501.19393 s1: Simple test-time scaling . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.19393 2025
[12]

Sanghee Park and Geewook Kim. 2025. https://doi.org/10.18653/v1/2025.naacl-short.56 Evaluating multimodal generative AI with K orean educational standards . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 671--688, Alb...

work page doi:10.18653/v1/2025.naacl-short.56 2025
[13]

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma Gongque, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. 2024. https://doi.org/10.48550/arXiv.2407.01284 We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reaso...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.01284 2024
[14]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. https://doi.org/10.48550/arXiv.2408.03314 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters . arXiv preprint arXiv:2408.03314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.03314 2024
[15]

Guijin Son, Seungone Kim, Catherine Arnett, and 1 others. 2026. Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs . arXiv preprint arXiv:2605.09063

Pith/arXiv arXiv 2026
[16]

Schuurmans, Quoc Le, Ed H

Xuezhi Wang, Jason Wei, D. Schuurmans, Quoc Le, Ed H. Chi, and Denny Zhou. 2022. Self-Consistency Improves Chain of Thought Reasoning in Language Models . In International Conference on Learning Representations

2022
[17]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022 a . Emergent Abilities of Large Language Models . Transactions on Machine Learning Research

2022
[18]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. 2022 b . Chain of Thought Prompting Elicits Reasoning in Large Language Models . In Neural Information Processing Systems

2022
[19]

Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. 2025. https://doi.org/10.48550/arXiv.2505.13417 AdaptThink: Reasoning Models Can Learn When to Think . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page doi:10.48550/arxiv.2505.13417 2025

[1] [1]

Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, and Yunhe Wang. 2024. https://doi.org/10.48550/arXiv.2412.09078 Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning . In International Conference on Machine Learning

work page doi:10.48550/arxiv.2412.09078 2024

[2] [2]

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. Do Not Think That Much for 2+3=? On the Overthinking of o1-Like LLMs . arXiv preprint arXiv:2412.21187

Pith/arXiv arXiv 2024

[3] [3]

DeepSeek-AI . 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning . arXiv preprint arXiv:2501.12948

Pith/arXiv arXiv 2025

[4] [4]

Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. 2024. https://doi.org/10.48550/arXiv.2410.07985 Omni-MATH: A Universal Olympiad Level Mathematic Benchm...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.07985 2024

[5] [5]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Z. Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. 2024. https://doi.org/10.48550/arXiv.2402.14008 OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems . In Annual Meeting of the...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.14008 2024

[6] [6]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset . In NeurIPS

2021

[7] [7]

Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, C. Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. 2025. https://doi.org/10.48550/arXiv.2501.19324 Reward-Guided Speculative Decoding for Efficient LLM Reasoning . In International Conference on Machine Learning

work page doi:10.48550/arxiv.2501.19324 2025

[8] [8]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2024. Let's Verify Step by Step . In International Conference on Learning Representations

2024

[9] [9]

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. 2023. MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts . In International Conference on Learning Representations

2023

[10] [10]

McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, and 1 others

Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, and 1 others. 2023. Inverse Scaling: When Bigger Isn't Better . Transactions on Machine Learning Research (TMLR)

2023

[11] [11]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Fei-Fei Li, Hanna Hajishirzi, Luke S. Zettlemoyer, Percy Liang, Emmanuel J. Candès, and Tatsunori Hashimoto. 2025. https://doi.org/10.48550/arXiv.2501.19393 s1: Simple test-time scaling . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.19393 2025

[12] [12]

Sanghee Park and Geewook Kim. 2025. https://doi.org/10.18653/v1/2025.naacl-short.56 Evaluating multimodal generative AI with K orean educational standards . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 671--688, Alb...

work page doi:10.18653/v1/2025.naacl-short.56 2025

[13] [13]

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai Song, Zhuoma Gongque, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, and Honggang Zhang. 2024. https://doi.org/10.48550/arXiv.2407.01284 We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reaso...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.01284 2024

[14] [14]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

C. Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. https://doi.org/10.48550/arXiv.2408.03314 Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters . arXiv preprint arXiv:2408.03314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.03314 2024

[15] [15]

Guijin Son, Seungone Kim, Catherine Arnett, and 1 others. 2026. Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs . arXiv preprint arXiv:2605.09063

Pith/arXiv arXiv 2026

[16] [16]

Schuurmans, Quoc Le, Ed H

Xuezhi Wang, Jason Wei, D. Schuurmans, Quoc Le, Ed H. Chi, and Denny Zhou. 2022. Self-Consistency Improves Chain of Thought Reasoning in Language Models . In International Conference on Learning Representations

2022

[17] [17]

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022 a . Emergent Abilities of Large Language Models . Transactions on Machine Learning Research

2022

[18] [18]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, F. Xia, Quoc Le, and Denny Zhou. 2022 b . Chain of Thought Prompting Elicits Reasoning in Large Language Models . In Neural Information Processing Systems

2022

[19] [19]

Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, and Juanzi Li. 2025. https://doi.org/10.48550/arXiv.2505.13417 AdaptThink: Reasoning Models Can Learn When to Think . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

work page doi:10.48550/arxiv.2505.13417 2025