arxiv: 2603.27844 · v2 · submitted 2026-03-29 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Model Capability Dominates: Inference-Time Optimization Lessons from AIMO 3

Natapong Nitarach

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM reasoningmathematical reasoningmajority votingprompt engineeringinference optimizationmodel capabilityIMO problems

0 comments

The pith

Model capability dominates all tested inference-time optimizations for mathematical reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether assigning different reasoning strategies to multiple LLM attempts can improve performance beyond standard majority voting on hard math problems. Using three models of varying capability on 50 IMO-level problems from AIMO 3, it runs over 23 experiments with variants of a Diverse Prompt Mixer. Every prompt-level change fails to help, with high-temperature sampling already reducing error correlation enough that additional diversity hurts accuracy. Even at equal sample sizes, an 8-point gap persists between models, showing that inherent model strength outweighs optimization tricks. The remaining shortfall from majority voting to the pass@20 upper bound is due to selection rather than prompting, pointing to verifiers as a potential fix instead.

Core claim

Model capability dominates inference-time optimization. Across an 8-point capability gap at equal N=8 and every optimization tested, including Diverse Prompt Mixer variants, model capability dominates. The gap between the best majority-vote score of 42/50 and pass@20 of approximately 45.5 is selection loss, not prompt loss. A verifier-based selector could close it, but prompt engineering cannot.

What carries the argument

The Diverse Prompt Mixer, a method to assign different reasoning strategies to different voters in majority voting to reduce correlated errors.

If this is right

High-temperature sampling suffices to decorrelate errors in LLM reasoning attempts.
Weaker prompt strategies reduce accuracy more than they reduce error correlation.
The performance gap between models cannot be bridged by prompt engineering alone.
A learned verifier could select the best answer from multiple attempts to close the gap to pass@20 performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar capability dominance may apply to other complex reasoning tasks beyond math competitions.
Resources are better allocated to improving base model training than to developing new inference-time prompt strategies.
Future work could test whether combining model scaling with verifier selection achieves near-upper-bound performance.

Load-bearing premise

The 23+ experiments with three models and specific Diverse Prompt Mixer variants cover the space of possible inference-time interventions sufficiently.

What would settle it

Demonstrating an inference-time optimization that allows a weaker model to match the performance of a stronger model on the same set of 50 problems at equal sample size.

Figures

Figures reproduced from arXiv: 2603.27844 by Natapong Nitarach.

**Figure 1.** Figure 1: Model capability dominates. Per-attempt accuracy pˆ vs. expected majority-vote score under Binomial voting at N=3, 8, 16, 32. Seven data points across four model families. At equal N=8, the 8-point gap between gpt-oss-120b (pˆ=0.69, score 39.3) and gpt-oss-20b (pˆ=0.61, score 31.0) dwarfs every prompt optimization (±2 points). Scaling N beyond compute budget backfires: gpt-oss-20b drops from 31.0 (N=8) to… view at source ↗

**Figure 2.** Figure 2: Prompt diversity vs. score on gpt-oss-120b. Blue circles: individual baseline runs (N=21). Black diamonds: configuration means. Shaded band: baseline ±1σ. More diversity monotonically degrades performance. Temperature LB Score ∆ from baseline T=0.5 38 −1.3 T=0.8 40 +0.7 T=1.0 39.3 — (baseline, 21-run mean) T=1.2, min_p=0.03 37 −2.3 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Per-problem ρˆ vs. pˆ across four models. Circles: Qwen (N=16); squares: gpt-oss-120b (N=8); filled diamonds: gpt-oss-20b (N=8); hollow triangles/diamonds: Nemotron-Super/gpt-oss-20b (N=3, forced ρˆ=−0.500). All 19 computable points show ρ <ˆ 0. Mean ρˆ = −0.122 for N≥7. No correlation headroom for diversity strategies. 4.3 Weaker Strategies Reduce Accuracy The Mixer modifies two parameters simultaneously:… view at source ↗

**Figure 4.** Figure 4: Qwen3.5-35B-A3B ablation on 10 local problems. Blue: baseline (8/10). Orange: underperform (7/10). Red: crashed. Nothing improves beyond baseline. 6 Complete Ablation Twenty-three experiments on gpt-oss-120b, each modifying one variable. None reliably improve over baseline. ID Change LB Score ∆ Category Baseline — (21-run mean) 39.3 — — #1 Conservative mix (5+1+1+1) 40 +0.7 Diversity #2 Aggressive mix (3+2… view at source ↗

**Figure 5.** Figure 5: Complete ablation across all experiments. Blue: baseline (39.3). Orange: interventions. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: AIMO competition progression. Orange bars: winner/top LB scores (29 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Left: score distributions. Baseline (µ=39.3, σ=1.7, blue) vs. Mixer (µ≈39.0, σ≈2.0, orange). Red dashed line: target score 42. Right: cumulative probability of max ≥ 42 over K submissions. Baseline: p≈0.056 per run. Mixer: p≈0.037. Dotted line at 42 submissions used. 9 Selection Loss A host-posted analysis3,2 reports gpt-oss-120b at pass@20 ≈ 45.5 on the AIMO 3 private set (95% bootstrap CI [43, 48]) and p… view at source ↗

read the original abstract

Majority voting over multiple LLM attempts improves mathematical reasoning, but correlated errors limit the effective sample size. A natural fix is to assign different reasoning strategies to different voters. The approach, Diverse Prompt Mixer, is tested on the AIMO 3 competition: 3 models, 23+ experiments, 50 IMO-level problems, one H100 80 GB, 5-hour limit. Every prompt-level intervention fails. High-temperature sampling already decorrelates errors; weaker strategies reduce accuracy more than they reduce correlation. Across an 8-point capability gap at equal N=8 and every optimization tested, model capability dominates. The gap between the best majority-vote score (42/50) and pass@20 (~45.5) is selection loss, not prompt loss. A verifier-based selector could close it. Prompt engineering cannot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

On AIMO 3, diverse prompting adds nothing over high-temperature majority voting, so the remaining gap is selection loss.

read the letter

The main result is straightforward: across three models and 23+ experiments on 50 AIMO 3 problems, none of the Diverse Prompt Mixer variants beat plain high-temperature sampling plus majority voting. The 8-point capability gap holds steady, and the difference between the best voting score (42/50) and pass@20 (~45.5) is framed as selection loss rather than something prompt engineering can fix. High-temperature sampling already reduces error correlation enough that weaker strategies mainly hurt accuracy without helping enough on the correlation side. That is a clean empirical observation worth having on record. The paper does a reasonable job of laying out concrete numbers and comparing against a pass@20 baseline, which makes the selection-loss point easy to see. Running the same setup on multiple models also helps show the pattern is not tied to one particular system. The work is mostly empirical measurements with no obvious circularity or invented parameters, so the data can be checked against the claims. The soft spots are the usual ones for this style of study. The interventions are limited to the specific mixer variants they defined, so the claim that prompt engineering cannot close the gap rests on incomplete coverage of the space; an adaptive or verifier-guided method might still do better. Only 50 problems are involved, and the abstract gives no error bars or statistical tests, which makes it harder to judge how stable the 8-point gap really is. Generalization beyond AIMO 3 is not tested. This is the kind of paper that belongs in a reading group for people working on LLM math reasoning who need to decide where to spend compute. It deserves a serious referee because the negative result on prompting is direct and the numbers are reported plainly enough to be discussed, even if the authors will need to add more controls and broader tests in revision.

Referee Report

2 major / 1 minor

Summary. The paper claims that model capability dominates inference-time optimization for mathematical reasoning. On the AIMO 3 benchmark with 50 IMO-level problems, three models, and 23+ experiments, every tested prompt-level intervention (Diverse Prompt Mixer variants) fails to outperform high-temperature majority voting. The gap between the best majority-vote accuracy (42/50) and pass@20 (~45.5) is attributed to selection loss rather than prompt loss, leading to the conclusion that prompt engineering cannot close capability gaps and that a verifier-based selector would be needed.

Significance. If the results hold, the work provides concrete evidence that scaling model capability yields larger gains than prompt diversification strategies under fixed inference budgets. This has practical implications for compute allocation in reasoning systems and motivates investment in post-generation selection mechanisms over further prompt engineering.

major comments (2)

[Abstract] Abstract: the strong claim that 'prompt engineering cannot' close the gap rests on the assertion that the 23+ Diverse Prompt Mixer variants are representative of feasible inference-time interventions. The manuscript provides no argument or coverage analysis showing that untested strategies (e.g., adaptive strategy switching or verifier-guided selection) could not simultaneously improve accuracy and reduce error correlation beyond temperature sampling alone.
[Experimental results] Experimental results: the reported scores (42/50 majority vote vs. ~45.5 pass@20) lack error bars, statistical significance tests, or details on problem difficulty distribution and model-specific variance. Without these, it is difficult to assess whether the 8-point capability gap dominance is robust or sensitive to the particular 50-problem sample.

minor comments (1)

[Abstract] Abstract: the parenthetical 'one H100 80 GB, 5-hour limit' is useful but should be expanded with exact token budgets, temperature schedules, and how pass@20 is computed under the same N=8 constraint used for majority voting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to improve the manuscript's precision and statistical rigor.

read point-by-point responses

Referee: [Abstract] Abstract: the strong claim that 'prompt engineering cannot' close the gap rests on the assertion that the 23+ Diverse Prompt Mixer variants are representative of feasible inference-time interventions. The manuscript provides no argument or coverage analysis showing that untested strategies (e.g., adaptive strategy switching or verifier-guided selection) could not simultaneously improve accuracy and reduce error correlation beyond temperature sampling alone.

Authors: We appreciate this observation. The 23+ Diverse Prompt Mixer variants were explicitly designed to span a broad range of prompt diversification techniques, including combinations of distinct reasoning strategies, template variations, and diversity injections. These represent the core of feasible prompt-level interventions under fixed inference budgets. We distinguish pure prompt engineering from mechanisms that incorporate external verifiers or dynamic selection, which the manuscript already identifies as a separate direction for closing the remaining gap. We will revise the abstract to qualify the claim as applying to the tested class of prompt diversification strategies, thereby avoiding overgeneralization while preserving the core empirical finding that such interventions did not outperform high-temperature majority voting. revision: yes
Referee: [Experimental results] Experimental results: the reported scores (42/50 majority vote vs. ~45.5 pass@20) lack error bars, statistical significance tests, or details on problem difficulty distribution and model-specific variance. Without these, it is difficult to assess whether the 8-point capability gap dominance is robust or sensitive to the particular 50-problem sample.

Authors: We agree that the current presentation would benefit from additional statistical support. In the revised manuscript we will add bootstrap-derived error bars and 95% confidence intervals for all reported accuracies, along with paired significance tests comparing the capability gap against prompt interventions. We will also include a supplementary breakdown of problem difficulty (based on topic coverage and estimated hardness) and model-specific variance across the 50 problems to demonstrate that the observed 8-point dominance holds consistently rather than being driven by a small subset of instances. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical measurements

full rationale

The paper reports results from 23+ experiments on three models and 50 AIMO-3 problems, directly measuring accuracies for majority voting, high-temperature sampling, and Diverse Prompt Mixer variants. The central claim that model capability dominates follows from observed score gaps (e.g., 42/50 vs ~45.5) without any equations, fitted parameters, self-referential definitions, or load-bearing self-citations that reduce the conclusion to its inputs by construction. All inferences are data-driven comparisons of measured performance under different conditions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the work is an empirical comparison of existing techniques on a fixed benchmark.

pith-pipeline@v0.9.0 · 5434 in / 970 out tokens · 45429 ms · 2026-05-14T21:36:41.570677+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Every prompt-level intervention fails. High-temperature sampling already decorrelates errors; weaker strategies reduce accuracy more than they reduce correlation.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across an 8-point capability gap at equal N=8 and every optimization tested, model capability dominates.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

[1]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

URLhttps://www.anthropic.com/engineering/infrastructure-noise. Infrastruc- ture config shifts scores by∼6pp on Terminal-Bench 2.0. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher Ré, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling.arXiv preprint arXiv:2407.21787, 2024. URL...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

URLhttps://arxiv.org/abs/2408.03314. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps: //arxiv.org/abs/2203.11171. XTX Investments. AI mathematic...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Identify what is given , what needs to 9be found , and any c o n s t r a i n t s

U N D E R S T A N D : C are fu ll y read and rephrase the problem in 8your own words . Identify what is given , what needs to 9be found , and any c o n s t r a i n t s

work page
[4]

Think 11about relevant theorems , techniques , patterns , or 12ana lo go us problems

EXPLORE : Consider multiple solution s t r a t e g i e s . Think 11about relevant theorems , techniques , patterns , or 12ana lo go us problems . Don ’ t commit to one approach 13i m m e d i a t e l y

work page
[5]

PLAN : Select the most p ro mi sin g approach and outline 15key steps before e xe cu ti ng

work page
[6]

17Show all re as on in g steps clearly

EXECUTE : Work through your solution m e t h o d i c a l l y . 17Show all re as on in g steps clearly

work page
[7]

20Ensure logical c o n s i s t e n c y t h r o u g h o u t

VERIFY : Check your answer by s u b s t i t u t i n g back , 19testing edge cases , or using a l t e r n a t i v e methods . 20Ensure logical c o n s i s t e n c y t h r o u g h o u t . 21 22# M a t h e m a t i c a l R eas on in g P r i n c i p l e s : 23- Break complex problems into smaller , m a n a g e a b l e 24sub - problems 25- Look for patterns , s...

work page
[8]

5using code or by hand

EN UM ER ATE : Compute the answer for n =1 ,2 ,3 ,4 ,5 ,... 5using code or by hand

work page
[9]

7Can you find a r e c u r r e n c e ? A closed form ?

PATTERN : Look for a pattern in the small cases . 7Can you find a r e c u r r e n c e ? A closed form ?

work page
[10]

C O N J E C T U R E : State your c o n j e c t u r e pr ec ise ly

work page
[11]

PROVE : Prove the c o n j e c t u r e holds in general , or 14 10compute the answer directly from the pattern

work page
[12]

12 13Place your final answer inside \ boxed {}

VERIFY : Check with an i n d e p e n d e n t method . 12 13Place your final answer inside \ boxed {}. B.3 Work Backwards (E2) system_prompt 1You are an elite IMO - level problem solver . Your primary 2strategy is to work ba ckw ar ds from the answer . 3

work page
[13]

C O N S T R A I N T S : List all c o n s t r a i n t s the answer must 5satisfy

work page
[14]

NARROW : What p r o p e r t i e s must the solution have ? 7Eli mi na te i m p o s s i b i l i t i e s

work page
[15]

CO NS TR UCT : Build the answer from the c o n s t r a i n t s

work page
[16]

10 11Place your final answer inside \ boxed {}

VERIFY : Check all c o n s t r a i n t s are sa tis fi ed . 10 11Place your final answer inside \ boxed {}. B.4 Classify Then Solve (E3) system_prompt 1You are an elite IMO - level problem solver . First 2classify , then solve . 3

work page
[17]

CLASSIFY : Is this number theory , algebra , 5combinatorics , or geometry ?

work page
[18]

RECALL : What ca no ni ca l t e c h n i q u e s apply to this 7type ?

work page
[19]

APPLY : Use the most relevant te ch ni que

work page
[20]

10 11Place your final answer inside \ boxed {}

VERIFY : Check your answer . 10 11Place your final answer inside \ boxed {}. B.5 Code-First (E12) system_prompt 1You are an elite IMO - level problem solver . Always 2start with code . 3

work page
[21]

Start with brute - force for small 6cases

IM PL EM ENT : Write Python code to explore the problem 5c o m p u t a t i o n a l l y . Start with brute - force for small 6cases

work page
[22]

OBSERVE : What do the c o m p u t a t i o n a l results tell you ?

work page
[23]

G E N E R A L I Z E : Find the pattern or formula

work page
[24]

COMPUTE : C al cu lat e the final answer

work page
[25]

11 12Place your final answer inside \ boxed {}

VERIFY : Cross - check with an i n d e p e n d e n t method . 11 12Place your final answer inside \ boxed {}. B.6 Formalize-First / F-1 (EF1) 15 system_prompt 1Before writing any code , f or ma li ze the problem : 2

work page
[26]

Let n = ... , let f ( x ) =

Define v ar ia bl es : " Let n = ... , let f ( x ) = ..."

work page
[27]

State c o n s t r a i n t s as e qu at io ns

work page
[28]

Find : max ( f ( n ) ) mod 1000

Identify the ob je ct iv e : " Find : max ( f ( n ) ) mod 1000" 6 7Then im ple me nt code that solves your eq uat io ns . 8Verify your answer . Place it inside \ boxed {}. B.7 Preference Prompt (appended to every problem) preference_prompt (appended to user message) 1You have access to ‘ math ‘ , ‘ numpy ‘ , and ‘ sympy ‘ for : 2 3# Symbolic C o m p u t a...

work page