Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models
Pith reviewed 2026-05-18 06:17 UTC · model grok-4.3
The pith
GenCluster enables an open-weight model to reach IOI gold medal performance by scaling test-time compute.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GenCluster attains IOI gold-level performance using the open-weight model gpt-oss-120b by combining large-scale generation, behavioral clustering, ranking, and round-robin submission to efficiently explore diverse solution spaces under limited validation budgets.
What carries the argument
GenCluster, a test-time compute framework that uses behavioral clustering to group solutions by behavior and round-robin submission to test diverse candidates within budget limits.
If this is right
- Performance scales consistently with additional test-time compute resources.
- The performance gap between open-weight and closed models narrows on IOI tasks.
- Transparent and reproducible evaluation of LLM reasoning reaches gold-medal levels for the first time.
- Open models become viable for elite competitive programming benchmarks through inference scaling alone.
Where Pith is reading between the lines
- The same clustering and submission strategy could transfer to other high-stakes reasoning tasks that involve many candidate outputs and limited verification budgets.
- Future improvements might focus on making the behavioral clustering step itself more compute-efficient to stretch smaller validation budgets further.
- If the method generalizes, organizations without access to proprietary models could still compete at the top of programming olympiads by investing in test-time resources.
Load-bearing premise
Behavioral clustering and round-robin submission can reliably identify and prioritize correct solutions from a large generated set under the specific validation budget and judging constraints of IOI problems.
What would settle it
Running the full GenCluster pipeline on the official IOI 2025 problem set and measuring a final score below the gold-medal threshold despite the claimed compute allocation.
Figures
read the original abstract
Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present GenCluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents GenCluster, a test-time compute framework for competitive programming that combines large-scale solution generation with an open-weight model (gpt-oss-120b), behavioral clustering of program outputs, ranking, and round-robin submission under limited validation budgets. It claims this approach scales consistently with compute and achieves IOI gold-medal performance for the first time with an open-weight model on IOI 2025 problems.
Significance. If the central empirical claim holds under rigorous validation, the work would be significant for establishing a reproducible, open-weight baseline for scaling test-time compute on a high-stakes reasoning benchmark like IOI, where prior gold-level results have been limited to proprietary systems. The scaling observation and explicit pipeline could serve as a useful reference point for future studies on LLM reasoning limits.
major comments (3)
- [§4] §4 (Methods), behavioral clustering subsection: the pipeline assumes that equivalence on the observed validation inputs (used for clustering and round-robin) implies correctness on the full hidden test distribution. IOI problems employ adversarial hidden cases and partial scoring; the manuscript provides no ablation or analysis showing that the chosen distance/signature metric groups only fully correct programs together, leaving the gold-medal claim vulnerable to the possibility that high-ranking clusters contain solutions that fail on unseen tests.
- [§5] §5 (Experiments), results tables and scaling plots: the reported performance scaling with compute is presented without error bars, multiple random seeds, or breakdown by problem difficulty/subtask. It is therefore impossible to assess whether the gold-medal threshold is robust or driven by a small number of favorable problems or lucky submissions within the fixed validation budget.
- [§3.3] §3.3 (Round-robin submission strategy): the description does not specify the exact validation budget per problem, the number of submissions allowed, or how ties in behavioral clusters are broken. These parameters are load-bearing for the claim that the method efficiently explores the solution space under IOI constraints.
minor comments (2)
- [Introduction] The abstract and introduction cite prior proprietary IOI results but do not include a direct comparison table of open vs. closed model performance on the same IOI 2025 problem set.
- [§4] Notation for the clustering distance function and ranking score is introduced without an explicit equation or pseudocode, making the pipeline harder to reproduce from the text alone.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our work. We address each of the major comments below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Methods), behavioral clustering subsection: the pipeline assumes that equivalence on the observed validation inputs (used for clustering and round-robin) implies correctness on the full hidden test distribution. IOI problems employ adversarial hidden cases and partial scoring; the manuscript provides no ablation or analysis showing that the chosen distance/signature metric groups only fully correct programs together, leaving the gold-medal claim vulnerable to the possibility that high-ranking clusters contain solutions that fail on unseen tests.
Authors: We agree that this is an important consideration. Behavioral clustering is based on program outputs on the provided validation inputs, which in IOI are typically comprehensive but not exhaustive of all hidden cases. To strengthen the manuscript, we will include additional analysis in the revised version, such as examining the failure rates of top clusters on post-competition revealed test cases where available, and discuss the limitations of validation-based clustering. This will provide more transparency on the robustness of the approach. revision: partial
-
Referee: [§5] §5 (Experiments), results tables and scaling plots: the reported performance scaling with compute is presented without error bars, multiple random seeds, or breakdown by problem difficulty/subtask. It is therefore impossible to assess whether the gold-medal threshold is robust or driven by a small number of favorable problems or lucky submissions within the fixed validation budget.
Authors: The referee is correct that error bars and multi-seed results would improve the presentation. Given the substantial compute required for each scaling point with the 120b model, we conducted experiments with fixed random seeds for reproducibility. In the revision, we will add a breakdown of performance by problem difficulty and subtask, and include error bars from repeated runs at lower compute scales to demonstrate consistency. We believe this will address the concern about robustness. revision: yes
-
Referee: [§3.3] §3.3 (Round-robin submission strategy): the description does not specify the exact validation budget per problem, the number of submissions allowed, or how ties in behavioral clusters are broken. These parameters are load-bearing for the claim that the method efficiently explores the solution space under IOI constraints.
Authors: We appreciate this observation. The round-robin strategy allocates a budget of 100 submissions per problem, with ties broken by selecting the cluster with the highest average validation score. We will update §3.3 with these specific details to make the experimental setup fully reproducible. revision: yes
Circularity Check
No significant circularity in the empirical scaling claim
full rationale
The paper presents GenCluster as an empirical test-time compute framework whose IOI gold-medal performance with gpt-oss-120b is reported as an experimental outcome of scaling generation, behavioral clustering, ranking, and round-robin submission under fixed validation budgets. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations are invoked to derive the result by construction from its own inputs. The central claim therefore remains an observed scaling relationship rather than a tautological reduction, consistent with the reader's low circularity assessment.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GENCLUSTER consists of four stages: (1) parallel generation, (2) behavioral clustering, (3) ranking with tournament, and (4) submission... cluster solutions based on their outputs... hash values of the outputs... tournament between clusters... round-robin submission strategy
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We demonstrate... gold-level performance at IOI 2025... with open-weight model gpt-oss-120b
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Ensemble-Based Uncertainty Estimation for Code Correctness Estimation
Ensemble Semantic Entropy improves correlation with code correctness over single-model methods and powers a cascading scaling system that cuts FLOPs by 64.9% while preserving performance on LiveCodeBench.
-
Majority Voting for Code Generation
Functional Majority Voting selects code by runtime agreement on tests, boosting LiveCodeBench performance and serving as an aggregation method for label-free test-time RL without exceeding base model limits.
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
URLhttps://arxiv.org/abs/2108.07732. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher R ´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
arXiv:2407.21787 [cs.LG]. Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. InThe Eleventh International Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021a. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri ...
work page internal anchor Pith review Pith/arXiv arXiv
- [4]
-
[5]
Training Verifiers to Solve Math Word Problems
URLhttps://arxiv. org/abs/2110.14168. The Decoder. Openai’s ai system wins a gold medal-level score at the international olympiad in informatics 2025.https://the-decoder.com/openais-ai-system-wins- a-gold-medal-level-score-at-the-international-olympiad-in- informatics-2025/,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Accessed: 2025-10-15. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhan...
work page 2025
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
URLhttps://arxiv.org/abs/2501.12948. Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
URLhttps://arxiv.org/abs/2508.15260. Hugging Face. IOI Dataset (open-r1/ioi).https://huggingface.co/datasets/open- r1/ioi,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Accessed: 2025-10-14. Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andres Codas, Mark Encarnaci ´on, Shuvendu Lahiri, Madanlal Musuvathi, and Jianfeng Gao. Fault-aware neural code rankers.Advances in Neural Information Processing Systems, 35:13419–13432,
work page 2025
-
[10]
Results of IOI 2025.https://stats
International Olympiad in Informatics. Results of IOI 2025.https://stats. ioinformatics.org/results/2025, 2025a. Accessed: 2025-10-14. International Olympiad in Informatics. Ioi 2025 tasks.https://ioi2025.obi.org.bo/ tasks.html, 2025b. Accessed: 2025-10-14. Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-L...
work page 2025
-
[11]
Competition-Level Code Generation with AlphaCode
URLhttps://storage.googleapis.com/deepmind-media/AlphaCode2/ AlphaCode2_Tech_Report.pdf. Accessed: 2025-01-14. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, R ´emi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang...
work page internal anchor Pith review arXiv 2025
-
[12]
From System 1 to System 2: A Survey of Reasoning Large Language Models
URLhttps: //arxiv.org/abs/2502.17419. Hanzhao (Maggie) Lin and Heng-Tze Cheng. Gemini achieves gold-level performance at the international collegiate programming contest world finals.https://deepmind. google/discover/blog/gemini-achieves-gold-level-performance-at- the-international-collegiate-programming-contest-world-finals/, September
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
DeepMind Blog. Accessed: 2025-10-14. Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Pairjudge rm: Perform best-of-n sampling with knockout tournament.arXiv preprint arXiv:2501.13007,
- [14]
-
[15]
URLhttps://arxiv.org/abs/ 2508.10925. OpenAI, :, Ahmed El-Kishky, Alexander Wei, Andre Saraiva, Borys Minaiev, Daniel Selsam, David Dohan, Francis Song, Hunter Lightman, Ignasi Clavera, Jakub Pachocki, Jerry Tworek, Lorenz Kuhn, Lukasz Kaiser, Mark Chen, Max Schwarzer, Mostafa Rohaninejad, Nat McAleese, o3 contributors, Oleg M ¨urk, Rhythm Garg, Rui Shu, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
URLhttps: //arxiv.org/abs/2502.06807. Guilherme Penedo, Anton Lozhkov, Hynek Kydl ´ıˇcek, Loubna Ben Allal, Edward Beeching, Agust´ın Piqueres Lajar ´ın, Quentin Gallou ´edec, Nathan Habib, Lewis Tunstall, and Lean- dro von Werra. CodeForces Dataset (open-r1/codeforces).https://huggingface.co/ datasets/open-r1/codeforces,
- [17]
-
[19]
URLhttps://arxiv.org/abs/2406.07791. Qwen Team. Qwen3, April
-
[20]
Hung To, Minh Nguyen, and Nghi Bui
URLhttps://qwenlm.github.io/blog/qwen3/. Hung To, Minh Nguyen, and Nghi Bui. Functional overlap reranking for neural code generation. In Findings of the Association for Computational Linguistics: ACL 2024, pp. 3686–3704,
work page 2024
-
[21]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H
URLhttps://arxiv.org/abs/2507.17797. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems (NeurIPS) 35, pp. 24824–24837. Curran Associates, Inc.,
-
[22]
URLhttps://arxiv.org/abs/2408.00724. An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical ex- pert model via self-improvement.arXiv preprint arXiv:2409.12122,
work page internal anchor Pith review arXiv
-
[23]
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agar- wal
URLhttps://arxiv.org/ abs/2305.14591. Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agar- wal. Generative verifiers: Reward modeling as next-token prediction, 2025a. URLhttps: //arxiv.org/abs/2408.15240. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas ...
-
[24]
passed” if the input is valid, otherwise “failed
12 A PROMPTS Solution Generation Prompt You are an expert competitive programmer. You will be given a problem statement, test case con- straints and example test inputs and outputs. Please reason step by step about the solution, then provide a complete implementation in C++17. You should correctly implement the routine(s) described in Implementation Detai...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.