pith. sign in

arxiv: 2510.14232 · v2 · submitted 2025-10-16 · 💻 cs.LG · cs.AI· cs.CL

Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

Pith reviewed 2026-05-18 06:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords GenClustertest-time computeIOIopen-weight modelscompetitive programmingbehavioral clusteringround-robin submissionLLM reasoning
0
0 comments X

The pith

GenCluster enables an open-weight model to reach IOI gold medal performance by scaling test-time compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GenCluster to scale test-time compute for achieving high performance on IOI problems with open-weight models. It generates large numbers of candidate programs, groups them according to observed behavior, ranks the groups, and submits candidates in a round-robin schedule to respect validation budgets. This setup allows the open model gpt-oss-120b to reach gold medal level at IOI 2025. A sympathetic reader cares because the approach is reproducible and shows that additional inference-time resources can close the gap to proprietary systems without changing the base model.

Core claim

GenCluster attains IOI gold-level performance using the open-weight model gpt-oss-120b by combining large-scale generation, behavioral clustering, ranking, and round-robin submission to efficiently explore diverse solution spaces under limited validation budgets.

What carries the argument

GenCluster, a test-time compute framework that uses behavioral clustering to group solutions by behavior and round-robin submission to test diverse candidates within budget limits.

If this is right

  • Performance scales consistently with additional test-time compute resources.
  • The performance gap between open-weight and closed models narrows on IOI tasks.
  • Transparent and reproducible evaluation of LLM reasoning reaches gold-medal levels for the first time.
  • Open models become viable for elite competitive programming benchmarks through inference scaling alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clustering and submission strategy could transfer to other high-stakes reasoning tasks that involve many candidate outputs and limited verification budgets.
  • Future improvements might focus on making the behavioral clustering step itself more compute-efficient to stretch smaller validation budgets further.
  • If the method generalizes, organizations without access to proprietary models could still compete at the top of programming olympiads by investing in test-time resources.

Load-bearing premise

Behavioral clustering and round-robin submission can reliably identify and prioritize correct solutions from a large generated set under the specific validation budget and judging constraints of IOI problems.

What would settle it

Running the full GenCluster pipeline on the official IOI 2025 problem set and measuring a final score below the gold-medal threshold despite the claimed compute allocation.

Figures

Figures reproduced from arXiv: 2510.14232 by Aleksander Ficek, Boris Ginsburg, Mehrzad Samadi, Sean Narenthiran, Siddhartha Jain, Somshubra Majumdar, Vahid Noroozi, Wasi Uddin Ahmad.

Figure 1
Figure 1. Figure 1: The overall pipeline of GENCLUSTER for a single subtask, a process to be repeated for every subtask in the IOI benchmark. models (OpenAI et al., 2025), using specialized test-time compute approaches. o1-ioi is a dedicated version of their o1 model fine-tuned for competitive programming and leveraging external tools. Additionally, OpenAI claimed a gold medal and a 6th-place human-equivalent ranking at IOI 2… view at source ↗
Figure 2
Figure 2. Figure 2: Final scores of different models on IOI 2025 when generating [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of gpt-oss-120b on IOI 2025 with and without the 50-submission limit, varying generation counts. Results are averaged over five runs (except K=5000, single run). Medal thresholds are indicated on the chart, and the score reported by OpenAI is shown for comparison. Qwen3-235B-A22B-Thinking outperforms gpt-oss-20b and DeepSeek-R1-0528 at smaller generation budgets, its performance scales less fav… view at source ↗
Figure 4
Figure 4. Figure 4: Shown are the cluster purity (F1-score), average cluster size, and average number of clus [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of the number of games per cluster on final score. 17 25 28 29 29 30 31 32 32 35 35 15 20 25 30 35 40 1 5 10 15 20 25 30 35 40 45 50 Number of Subtasks (k) Highest score solution is in top k clusters after tournament Number of subtasks (39) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Score@K for different maximum number of tokens in generation with different models. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The prompt used for generating the solutions [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The prompt used for generating the test data generators [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The prompt used for generating the test data validators [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The prompt used for comparing the solutions in the tournament [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present GenCluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents GenCluster, a test-time compute framework for competitive programming that combines large-scale solution generation with an open-weight model (gpt-oss-120b), behavioral clustering of program outputs, ranking, and round-robin submission under limited validation budgets. It claims this approach scales consistently with compute and achieves IOI gold-medal performance for the first time with an open-weight model on IOI 2025 problems.

Significance. If the central empirical claim holds under rigorous validation, the work would be significant for establishing a reproducible, open-weight baseline for scaling test-time compute on a high-stakes reasoning benchmark like IOI, where prior gold-level results have been limited to proprietary systems. The scaling observation and explicit pipeline could serve as a useful reference point for future studies on LLM reasoning limits.

major comments (3)
  1. [§4] §4 (Methods), behavioral clustering subsection: the pipeline assumes that equivalence on the observed validation inputs (used for clustering and round-robin) implies correctness on the full hidden test distribution. IOI problems employ adversarial hidden cases and partial scoring; the manuscript provides no ablation or analysis showing that the chosen distance/signature metric groups only fully correct programs together, leaving the gold-medal claim vulnerable to the possibility that high-ranking clusters contain solutions that fail on unseen tests.
  2. [§5] §5 (Experiments), results tables and scaling plots: the reported performance scaling with compute is presented without error bars, multiple random seeds, or breakdown by problem difficulty/subtask. It is therefore impossible to assess whether the gold-medal threshold is robust or driven by a small number of favorable problems or lucky submissions within the fixed validation budget.
  3. [§3.3] §3.3 (Round-robin submission strategy): the description does not specify the exact validation budget per problem, the number of submissions allowed, or how ties in behavioral clusters are broken. These parameters are load-bearing for the claim that the method efficiently explores the solution space under IOI constraints.
minor comments (2)
  1. [Introduction] The abstract and introduction cite prior proprietary IOI results but do not include a direct comparison table of open vs. closed model performance on the same IOI 2025 problem set.
  2. [§4] Notation for the clustering distance function and ranking score is introduced without an explicit equation or pseudocode, making the pipeline harder to reproduce from the text alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Methods), behavioral clustering subsection: the pipeline assumes that equivalence on the observed validation inputs (used for clustering and round-robin) implies correctness on the full hidden test distribution. IOI problems employ adversarial hidden cases and partial scoring; the manuscript provides no ablation or analysis showing that the chosen distance/signature metric groups only fully correct programs together, leaving the gold-medal claim vulnerable to the possibility that high-ranking clusters contain solutions that fail on unseen tests.

    Authors: We agree that this is an important consideration. Behavioral clustering is based on program outputs on the provided validation inputs, which in IOI are typically comprehensive but not exhaustive of all hidden cases. To strengthen the manuscript, we will include additional analysis in the revised version, such as examining the failure rates of top clusters on post-competition revealed test cases where available, and discuss the limitations of validation-based clustering. This will provide more transparency on the robustness of the approach. revision: partial

  2. Referee: [§5] §5 (Experiments), results tables and scaling plots: the reported performance scaling with compute is presented without error bars, multiple random seeds, or breakdown by problem difficulty/subtask. It is therefore impossible to assess whether the gold-medal threshold is robust or driven by a small number of favorable problems or lucky submissions within the fixed validation budget.

    Authors: The referee is correct that error bars and multi-seed results would improve the presentation. Given the substantial compute required for each scaling point with the 120b model, we conducted experiments with fixed random seeds for reproducibility. In the revision, we will add a breakdown of performance by problem difficulty and subtask, and include error bars from repeated runs at lower compute scales to demonstrate consistency. We believe this will address the concern about robustness. revision: yes

  3. Referee: [§3.3] §3.3 (Round-robin submission strategy): the description does not specify the exact validation budget per problem, the number of submissions allowed, or how ties in behavioral clusters are broken. These parameters are load-bearing for the claim that the method efficiently explores the solution space under IOI constraints.

    Authors: We appreciate this observation. The round-robin strategy allocates a budget of 100 submissions per problem, with ties broken by selecting the cluster with the highest average validation score. We will update §3.3 with these specific details to make the experimental setup fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the empirical scaling claim

full rationale

The paper presents GenCluster as an empirical test-time compute framework whose IOI gold-medal performance with gpt-oss-120b is reported as an experimental outcome of scaling generation, behavioral clustering, ranking, and round-robin submission under fixed validation budgets. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations are invoked to derive the result by construction from its own inputs. The central claim therefore remains an observed scaling relationship rather than a tautological reduction, consistent with the reader's low circularity assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No specific free parameters, axioms, or invented entities are identifiable from the abstract alone; the approach implicitly assumes sufficient solution diversity in the base model and effective clustering under IOI constraints.

pith-pipeline@v0.9.0 · 5775 in / 1098 out tokens · 52996 ms · 2026-05-18T06:17:18.702519+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Ensemble-Based Uncertainty Estimation for Code Correctness Estimation

    cs.SE 2026-03 unverdicted novelty 6.0

    Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.

  2. Majority Voting for Code Generation

    cs.LG 2026-04 unverdicted novelty 5.0

    Functional Majority Voting selects code by runtime agreement on tests, boosting LiveCodeBench performance and serving as an aggregation method for label-free test-time RL without exceeding base model limits.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 2 Pith papers · 10 internal anchors

  1. [1]

    Program Synthesis with Large Language Models

    URLhttps://arxiv.org/abs/2108.07732. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R ´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint,

  2. [2]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    arXiv:2407.21787 [cs.LG]. Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. InThe Eleventh International Conference on Learning Representations,

  3. [3]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021a. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri ...

  4. [4]

    URLhttps://arxiv.org/abs/2505.02387. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems,

  5. [5]

    Training Verifiers to Solve Math Word Problems

    URLhttps://arxiv. org/abs/2110.14168. The Decoder. Openai’s ai system wins a gold medal-level score at the international olympiad in informatics 2025.https://the-decoder.com/openais-ai-system-wins- a-gold-medal-level-score-at-the-international-olympiad-in- informatics-2025/,

  6. [6]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z

    Accessed: 2025-10-15. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhan...

  7. [7]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    URLhttps://arxiv.org/abs/2501.12948. Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence,

  8. [8]

    Deep Think with Confidence

    URLhttps://arxiv.org/abs/2508.15260. Hugging Face. IOI Dataset (open-r1/ioi).https://huggingface.co/datasets/open- r1/ioi,

  9. [9]

    Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnaci ´on, Shuvendu Lahiri, Madanlal Musuvathi, and Jianfeng Gao

    Accessed: 2025-10-14. Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnaci ´on, Shuvendu Lahiri, Madanlal Musuvathi, and Jianfeng Gao. Fault-aware neural code rankers.Advances in Neural Information Processing Systems, 35:13419–13432,

  10. [10]

    Results of IOI 2025.https://stats

    International Olympiad in Informatics. Results of IOI 2025.https://stats. ioinformatics.org/results/2025, 2025a. Accessed: 2025-10-14. International Olympiad in Informatics. Ioi 2025 tasks.https://ioi2025.obi.org.bo/ tasks.html, 2025b. Accessed: 2025-10-14. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-L...

  11. [11]

    Competition-Level Code Generation with AlphaCode

    URLhttps://storage.googleapis.com/deepmind-media/AlphaCode2/ AlphaCode2_Tech_Report.pdf. Accessed: 2025-01-14. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R ´emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang...

  12. [12]

    From System 1 to System 2: A Survey of Reasoning Large Language Models

    URLhttps: //arxiv.org/abs/2502.17419. Hanzhao (Maggie) Lin and Heng-Tze Cheng. Gemini achieves gold-level performance at the international collegiate programming contest world finals.https://deepmind. google/discover/blog/gemini-achieves-gold-level-performance-at- the-international-collegiate-programming-contest-world-finals/, September

  13. [13]

    Accessed: 2025-10-14

    DeepMind Blog. Accessed: 2025-10-14. Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Pairjudge rm: Perform best-of-n sampling with knockout tournament.arXiv preprint arXiv:2501.13007,

  14. [14]

    URL https://arxiv.org/abs/2410.12832. OpenAI. Learning to reason with llms.https://openai.com/index/learning-to- reason-with-llms/,

  15. [15]

    URLhttps://arxiv.org/abs/ 2508.10925. OpenAI, :, Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg M ¨urk, Rhythm Garg, Rui Shu, ...

  16. [16]

    URLhttps: //arxiv.org/abs/2502.06807. Guilherme Penedo, Anton Lozhkov, Hynek Kydl ´ıˇcek, Loubna Ben Allal, Edward Beeching, Agust´ın Piqueres Lajar ´ın, Quentin Gallou ´edec, Nathan Habib, Lewis Tunstall, and Lean- dro von Werra. CodeForces Dataset (open-r1/codeforces).https://huggingface.co/ datasets/open-r1/codeforces,

  17. [17]

    Accessed: 2025-10-

    Hugging Face repository. Accessed: 2025-10-

  18. [19]
  19. [20]

    Hung To, Minh Nguyen, and Nghi Bui

    URLhttps://qwenlm.github.io/blog/qwen3/. Hung To, Minh Nguyen, and Nghi Bui. Functional overlap reranking for neural code generation. In Findings of the Association for Computational Linguistics: ACL 2024, pp. 3686–3704,

  20. [21]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H

    URLhttps://arxiv.org/abs/2507.17797. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS) 35, pp. 24824–24837. Curran Associates, Inc.,

  21. [22]

    URLhttps://arxiv.org/abs/2408.00724. An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical ex- pert model via self-improvement.arXiv preprint arXiv:2409.12122,

  22. [23]

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agar- wal

    URLhttps://arxiv.org/ abs/2305.14591. Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agar- wal. Generative verifiers: Reward modeling as next-token prediction, 2025a. URLhttps: //arxiv.org/abs/2408.15240. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas ...

  23. [24]

    passed” if the input is valid, otherwise “failed

    12 A PROMPTS Solution Generation Prompt You are an expert competitive programmer. You will be given a problem statement, test case con- straints and example test inputs and outputs. Please reason step by step about the solution, then provide a complete implementation in C++17. You should correctly implement the routine(s) described in Implementation Detai...