TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

Bryan Dai; Chuan Hao; Feng Chang; George Wu; Jian Yang; Ming Yang; Nan Jing; Qing Yi; Ran Tao; Yuan Wei

arxiv: 2605.10344 · v2 · pith:GBW53235new · submitted 2026-05-11 · 💻 cs.AI

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

George Wu , Nan Jing , Qing Yi , Chuan Hao , Ming Yang , Feng Chang , Yuan Wei , Jian Yang

show 2 more authors

Ran Tao Bryan Dai

This is my paper

Pith reviewed 2026-05-20 22:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords test-time scalingmulti-agent systemslarge language modelsreasoning benchmarkshierarchical memorieshybrid reward trainingreinforcement learning

0 comments

The pith

TMAS organizes multi-agent inference with hierarchical memories to scale test-time compute more effectively than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TMAS to overcome weak coordination among parallel reasoning trajectories and reliance on noisy history in existing test-time scaling for large language models. It structures inference as collaboration among specialized agents that share information across trajectories and refinement rounds. Hierarchical memories are central: an experience bank reuses reliable low-level conclusions while a guideline bank records high-level strategies to steer away from redundant patterns. A hybrid reward reinforcement learning scheme is added to maintain core reasoning skills, improve memory use, and promote new exploration. Experiments on reasoning benchmarks demonstrate stronger iterative scaling and greater stability across rounds than baseline approaches.

Core claim

TMAS achieves stronger iterative scaling than existing test-time scaling baselines by organizing inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations, introducing hierarchical memories where the experience bank reuses low-level reliable intermediate conclusions and the guideline bank records high-level strategies, and applying a hybrid reward reinforcement learning scheme that preserves basic reasoning capability while enhancing experience utilization and exploration.

What carries the argument

Hierarchical memories consisting of an experience bank for low-level reliable intermediate conclusions and a guideline bank for high-level strategies, together with structured cross-agent information flow and a hybrid reward reinforcement learning scheme.

Load-bearing premise

Hierarchical memories and structured cross-agent information flow can reliably balance exploration and exploitation without introducing coordination overhead or propagating noise.

What would settle it

Running the same reasoning benchmarks and finding that TMAS produces no stronger iterative scaling or that hybrid reward training adds no stability improvement would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10344 by Bryan Dai, Chuan Hao, Feng Chang, George Wu, Jian Yang, Ming Yang, Nan Jing, Qing Yi, Ran Tao, Yuan Wei.

**Figure 1.** Figure 1: Small backbones approach frontier models on IMO-AnswerBench. We compare [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Effect of RL training on iterative test-time scaling. TMAS+Vanilla-RL means training [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 2.** Figure 2: Overview of the TMAS framework. For each problem, TMAS generates multiple [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity analysis of TMAS. Panels (a–b), (c–d), and (e–f) show the impacts of [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 3.** Figure 3: Effect of RL training on iterative test-time scaling. TMAS+Vanilla-RL means training [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation results on AIME26 and HMMT-25-Nov over 12 iterations. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity analysis of TMAS. Panels (a–b), (c–d), and (e–f) show the impacts of [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Relationship between the exploration coefficient and the total count of unique solution [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 5.** Figure 5: Evaluation results on AIME26 and HMMT-25-Nov over 12 iterations. [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Verification score dynamics across TTS iterations on IMO-AnswerBench-50. Problems [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of per-problem average verification scores on IMO-AnswerBench-50. Each [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of wrong solution pattern and correct solution pattern. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of wrong solution pattern and correct solution pattern. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test-time compute via multi-agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks show that TMAS achieves stronger iterative scaling than existing test-time scaling baselines, with hybrid reward training further improving scaling effectiveness and stability across iterations. Code and data are available at https://github.com/IQuestLab/tmas.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TMAS adds hierarchical experience and guideline banks plus hybrid RL to multi-agent test-time scaling, but any scaling edge needs verification at equal total token cost.

read the letter

The main thing to know is that TMAS organizes test-time compute as a multi-agent process with two memory structures: an experience bank that reuses low-level conclusions and a guideline bank that tracks high-level strategies to avoid repetition. It pairs this with a hybrid reward RL scheme meant to keep core reasoning intact while pushing for new exploration paths. The abstract positions this as fixing weak coordination and noisy reuse in prior structured methods, and the experiments claim better iterative scaling on reasoning benchmarks with added stability from the training approach. Code release helps here too for anyone who wants to inspect or extend the setup. What works is the explicit design for cross-agent and cross-trajectory information flow. The banks give a concrete way to decide what to retain rather than dumping all history, and the hybrid reward tries to balance multiple objectives without collapsing to one. That feels like a practical engineering step for people already experimenting with multi-agent inference. The soft spot is the compute accounting. The framework adds bank operations and message passing at every step, which costs tokens and latency that simpler baselines avoid. If the scaling curves compare at equal iteration counts instead of equal cumulative cost, the reported gains could trace to higher per-step budget rather than the claimed synergy. The abstract does not spell out normalized comparisons, so that detail matters for the central claim. The hybrid reward coefficients are also free parameters that could influence how general the results turn out. This is the sort of paper that researchers working on test-time scaling or multi-agent LLM systems would want to read for the architecture ideas. It has enough of a distinct mechanism and benchmark results to merit a serious referee rather than a desk reject. Reviewers can press on the cost controls and ablations. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TMAS, a multi-agent framework for scaling test-time compute in LLMs for reasoning tasks. Inference is organized as collaboration among specialized agents with structured cross-agent, cross-trajectory, and cross-iteration information flow. Key innovations are hierarchical memories—an experience bank that reuses low-level reliable conclusions and local feedback, and a guideline bank that records high-level strategies to avoid redundant patterns—plus a hybrid reward RL scheme that preserves base reasoning, improves experience utilization, and encourages exploration. Experiments on challenging reasoning benchmarks are reported to show stronger iterative scaling than existing test-time scaling baselines, with the hybrid reward further improving scaling effectiveness and stability.

Significance. If the empirical results survive proper controls for added coordination costs, the work would offer a concrete engineering advance in structured test-time scaling by explicitly managing what information is retained and reused across agents. The open release of code and data supports reproducibility and follow-on work.

major comments (2)

[Experiments] Experiments section: the claim of stronger iterative scaling than baselines (self-consistency, tree search, etc.) is load-bearing for the central contribution, yet the manuscript does not state whether scaling curves are normalized to equal cumulative token usage or API calls. The added per-iteration costs of experience-bank lookups, guideline-bank updates, and multi-agent message passing are not present in simpler baselines; without explicit equalization, any observed lift could be an artifact of higher per-iteration budget rather than the claimed balance of exploration and exploitation.
[Methods] Methods / hybrid reward description: the hybrid reward introduces free coefficients whose values are not ablated in detail. Because these coefficients directly control the trade-off among preserving base capability, experience utilization, and exploration, the stability and scaling improvements attributed to the hybrid scheme cannot be fully assessed without sensitivity analysis or default-value justification.

minor comments (2)

[Abstract] Abstract and §1: the limitations of prior methods are summarized at a high level; a concise comparison table (or bullet list) of coordination weaknesses versus TMAS mechanisms would improve readability.
[Methods] Notation for the two banks is introduced without an explicit diagram or pseudocode snippet showing update and retrieval logic; adding one would clarify the structured information flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and describe the revisions incorporated into the updated manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the claim of stronger iterative scaling than baselines (self-consistency, tree search, etc.) is load-bearing for the central contribution, yet the manuscript does not state whether scaling curves are normalized to equal cumulative token usage or API calls. The added per-iteration costs of experience-bank lookups, guideline-bank updates, and multi-agent message passing are not present in simpler baselines; without explicit equalization, any observed lift could be an artifact of higher per-iteration budget rather than the claimed balance of exploration and exploitation.

Authors: We agree that explicit normalization to equal cumulative compute is necessary for a rigorous comparison. The original manuscript reported scaling primarily against iteration count. In the revised version we have added a dedicated subsection and new figures that replot all methods against total token consumption and API calls, thereby equalizing the overall budget. Under these controls TMAS continues to exhibit stronger iterative gains, which we attribute to the hierarchical memories reducing redundant exploration rather than simply spending more tokens per step. revision: yes
Referee: [Methods] Methods / hybrid reward description: the hybrid reward introduces free coefficients whose values are not ablated in detail. Because these coefficients directly control the trade-off among preserving base capability, experience utilization, and exploration, the stability and scaling improvements attributed to the hybrid scheme cannot be fully assessed without sensitivity analysis or default-value justification.

Authors: We acknowledge that a more detailed sensitivity study would strengthen the hybrid-reward claims. The revised manuscript now contains an ablation subsection that sweeps the coefficients over a representative range and reports the resulting effects on final accuracy, scaling slope, and iteration-to-iteration stability. The default values were selected via a small validation sweep to balance the three reward terms; the new results indicate that the reported improvements remain consistent within a broad neighborhood of those defaults. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TMAS framework derivation

full rationale

The paper proposes a new multi-agent test-time scaling framework with explicitly introduced components (hierarchical experience bank, guideline bank, and hybrid reward RL scheme) that are not defined in terms of the claimed performance outcomes. Claims of stronger iterative scaling rest on experimental comparisons rather than any reduction to fitted parameters, self-citations, or ansatzes from prior author work. No equations or derivation steps are presented that equate outputs to inputs by construction; the contribution is an independent engineering design whose validity is assessed externally via benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The paper introduces two new memory structures and a hybrid reward scheme whose effectiveness depends on domain assumptions about agent collaboration and information quality.

free parameters (1)

hybrid reward coefficients
Weights balancing basic reasoning preservation, experience utilization, and exploration are chosen or tuned during the RL stage.

axioms (1)

domain assumption Specialized agents can maintain structured, low-noise information flow across trajectories and refinement iterations.
Invoked when the abstract claims that hierarchical memories enable effective cross-trajectory collaboration.

invented entities (2)

experience bank no independent evidence
purpose: Stores and reuses low-level reliable intermediate conclusions and local feedback.
New component introduced to support cross-trajectory collaboration.
guideline bank no independent evidence
purpose: Records high-level strategies to steer future rollouts away from redundant patterns.
New component introduced to improve exploration.

pith-pipeline@v0.9.0 · 5786 in / 1434 out tokens · 46721 ms · 2026-05-20T22:29:14.235590+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions... while the guideline bank records previously explored high-level strategies
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid reward reinforcement learning scheme... preserves basic reasoning capability, enhances experience utilization, and encourages exploration

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

1000 is equivalent to 1,000

Numerical: 0.5 is equivalent to 1/2. 1000 is equivalent to 1,000

work page
[2]

\frac{1}{\sqrt{2}} is equivalent to \frac{\sqrt {2}}{2}

Algebraic: x+1 is equivalent to 1+x. \frac{1}{\sqrt{2}} is equivalent to \frac{\sqrt {2}}{2}

work page
[3]

Formatting: Ignore Markdown formatting (bold, italic), latex styling (\text{}, \ mathrm{}), or whitespace differences

work page
[4]

Ignore the student’s reasoning steps unless the result is embedded within them

Content: Focus ONLY on the final result value. Ignore the student’s reasoning steps unless the result is embedded within them

work page
[5]

Output Format: Respond strictly in JSON format

Units: If the reference implies units and the student omits them (or vice versa) but the number is correct, count it as correct unless the problem explicitly demands unit conversion. Output Format: Respond strictly in JSON format. Do not output markdown code blocks. LLM-as-Judge User Prompt <problem> {problem} </problem> <reference> {reference} </referenc...

work page
[6]

Analyze the mathematical value of both answers

work page
[7]

Determine if they represent the same solution (equivalent). 18

work page
[8]

reasoning

If the student answer contains a derivation, look for the final result. Respond in JSON: { "reasoning": "Brief explanation...", "equivalent": true/false } A.3. Implementation Details of Baseline Methods For Self-Refine, we generate 8 solutions in parallel, and each solution is refined independently in subsequent rounds without any interaction across diffe...

work page
[9]

A 2 × 1 tile is treated as covering exactly one full column

work page
[10]

A 2 × 2 block is therefore assumed to have only two tilings: 𝑇(2)=2

work page
[11]

The model derives 𝑇(𝑛)=𝑇(𝑛−1) +𝑇(𝑛−2) +𝑇(𝑛−4)

work page
[12]

Diagnostic error.The solution explicitly or im- plicitly rules out horizontal placements of the 2 × 1 tile

Therefore, 𝑇(4)=𝑇(3) +𝑇(2) +𝑇(0)=3+2+1=6. Diagnostic error.The solution explicitly or im- plicitly rules out horizontal placements of the 2 × 1 tile. By iteration 19, the no-experience baseline even states that horizontal placement is invalid, thereby reinforcing rather than cor- recting the original mistake. 6 wrong Correct solution pattern: rotation-awa...

work page
[13]

A 2 × 1 rectangular tile can be placed either vertically or horizontally

work page
[14]

Hence a 2×2 block has three tilings: 𝑇(2)=3, namely two vertical 2 × 1 tiles, one 2 × 2 square tile, or two horizontal 2×1 tiles

work page
[15]

The correct recurrence is 𝑇(𝑛)=𝑇(𝑛−1) +2𝑇(𝑛−2) +𝑇(𝑛−4)

work page
[16]

Key correction.The model explicitly identi- fies the prior error: the wrong solutions under- count because they assume 𝑇( 2)= 2 and ignore the horizontal-pair tiling

Therefore, 𝑇(4)=𝑇(3) +2𝑇(2) +𝑇(0)=5+6+1=12. Key correction.The model explicitly identi- fies the prior error: the wrong solutions under- count because they assume 𝑇( 2)= 2 and ignore the horizontal-pair tiling. 12 correct Figure 9. Comparison of wrong solution pattern and correct solution pattern. against ground-truth correctness, or employing stronger an...

work page
[19]

Merely citing a result without showing why it applies or how it works is considered a failure

**Self-Containment:** Referencing external papers/theorems is allowed **IF AND ONLY IF** you also present a valid proof or clear derivation of the referenced argument . Merely citing a result without showing why it applies or how it works is considered a failure. **Process:**

work page
[20]

Reason carefully about how to solve the problem

work page
[21]

Draft your solution mentally or in your scratchpad

work page
[22]

hand- waving

**Refine your solution** by fixing any potential logical gaps, ambiguity, or "hand- waving" arguments until it meets the highest standard of mathematical proof

work page
[23]

**Output Format:** Your response should follow this exact markdown format: ## Solution

Present *only* your best, finalized version. **Output Format:** Your response should follow this exact markdown format: ## Solution ... // Your final, rigorous solution to the problem here. Ensure all steps are explicitly shown and justified. --- Here is your task input: ## Problem {question} Verification Prompt ## Instruction 26 Your task is to evaluate ...

work page
[24]

Do NOT repeat logic that has already been identified as incorrect

**Error Correction:** You must explicitly address the flaws pointed out in the verification summaries. Do NOT repeat logic that has already been identified as incorrect

work page
[25]

If minor details are omitted, it is considered imperfect

**Completeness:** The solution must cover all cases and steps. If minor details are omitted, it is considered imperfect

work page
[26]

**Rigour:** Fatal errors or severe omissions are unacceptable

work page
[27]

**Process:**

**Self-Containment:** Referencing external papers/theorems is allowed **IF AND ONLY IF** you also present a valid proof or clear derivation of the referenced argument . **Process:**

work page
[28]

Read the **Problem** carefully

work page
[29]

Identify exactly what went wrong, what was incomplete, and what (if anything) was correct

Study each **Previous Attempt** and its **Verification Summary**. Identify exactly what went wrong, what was incomplete, and what (if anything) was correct

work page
[30]

Reason about how to fix the specific issues while retaining any correct sub-results from previous attempts

work page
[31]

Draft your refined solution, ensuring it does not repeat the confirmed errors

work page
[32]

**Output Format:** Your response should follow this exact markdown format: ## Solution

Present *only* your best, finalized, and fully corrected version. **Output Format:** Your response should follow this exact markdown format: ## Solution ... // Your final, rigorous, and corrected solution to the problem here. Ensure all steps are explicitly shown and justified. --- Here is your task input: ## Problem {question} Experience Context Appended...

work page
[33]

2025 == 2 mod 7

**Non-trivial**: It must involve meaningful mathematical work -- a derivation, a transformation, a non-obvious equivalence, or a structural observation. Trivial arithmetic evaluations (e.g., "2025 == 2 mod 7") do NOT qualify unless the congruence itself is the key insight that unlocks a deeper argument. 29

work page 2025
[34]

Prefer results that establish structure over results that are dead ends

**Reusable**: It must be a stepping stone -- something a future solver can directly cite and proceed from, without needing to redo the work. Prefer results that establish structure over results that are dead ends

work page
[35]

The substitution $u = x - 1/x$ reduces the integral to $\int \\frac{{du }}{{u^2+2}}$, which is a standard form

**Verifier-backed**: It must be explicitly confirmed correct by the verification summary. If verifiers are split on a step, do not add it as an Anchor. Verified Anchors fall into the following sub-types (use these to guide what you extract ): - **Structural Reduction**: A transformation that rewrites the problem or a sub-problem into a simpler or more tra...

work page 2025
[36]

Prioritize insights confirmed consistently across multiple rollouts

**ADD**: Extract new Verified Anchors or Error Avoidance Heuristics that are not already covered by the existing bank. Prioritize insights confirmed consistently across multiple rollouts

work page
[37]

**KEEP**: Retain all existing entries that remain valid and are not contradicted by the new rollouts

work page
[38]

Only merge entries that say the exact same thing about the exact same step

**REFINE**: If a new rollout provides a more precise version of an existing entry, rewrite it to be clearer. Only merge entries that say the exact same thing about the exact same step

work page
[39]

verified_anchors

**DELETE**: Remove entries explicitly revealed as incorrect by the verification summary. Remove entries that become fully subsumed after refinement. ## Quantity Guideline Aim for **20-35 entries** in total across both categories. Do NOT aggressively compress -- fine-grained, specific entries are more useful than over-generalized ones. Only merge entries t...

work page
[40]

**Memory of exploration**: It records which broad strategic directions have already been tried, so the solver does not waste computation repeating the same approach

work page
[41]

Guideline

**Diversity enforcement**: When the solver is about to generate a new solution, it will be shown this bank and instructed to pursue a direction that is ** fundamentally different** from everything listed here. The bank therefore acts as the primary mechanism for controlling exploration -- the richer and more precise this log is, the better the solver can ...

work page
[42]

**Identify** the high-level strategy used in the student’s solution (mathematical framework, key structural insight, angle of attack)

work page
[43]

**Compare** it against each entry in <already_attempted_strategies>

work page
[44]

identified_strategy

**Classify**: - **1**: The student’s strategy is genuinely different from ALL listed strategies. - **0**: The student’s strategy is essentially the same as at least one listed strategy. - **-1**: Cannot determine the student’s strategy (solution too vague/incomplete). Respond in a JSON code block: ‘‘‘json {{ "identified_strategy": "Brief description of th...

work page

[1] [1]

1000 is equivalent to 1,000

Numerical: 0.5 is equivalent to 1/2. 1000 is equivalent to 1,000

work page

[2] [2]

\frac{1}{\sqrt{2}} is equivalent to \frac{\sqrt {2}}{2}

Algebraic: x+1 is equivalent to 1+x. \frac{1}{\sqrt{2}} is equivalent to \frac{\sqrt {2}}{2}

work page

[3] [3]

Formatting: Ignore Markdown formatting (bold, italic), latex styling (\text{}, \ mathrm{}), or whitespace differences

work page

[4] [4]

Ignore the student’s reasoning steps unless the result is embedded within them

Content: Focus ONLY on the final result value. Ignore the student’s reasoning steps unless the result is embedded within them

work page

[5] [5]

Output Format: Respond strictly in JSON format

Units: If the reference implies units and the student omits them (or vice versa) but the number is correct, count it as correct unless the problem explicitly demands unit conversion. Output Format: Respond strictly in JSON format. Do not output markdown code blocks. LLM-as-Judge User Prompt <problem> {problem} </problem> <reference> {reference} </referenc...

work page

[6] [6]

Analyze the mathematical value of both answers

work page

[7] [7]

Determine if they represent the same solution (equivalent). 18

work page

[8] [8]

reasoning

If the student answer contains a derivation, look for the final result. Respond in JSON: { "reasoning": "Brief explanation...", "equivalent": true/false } A.3. Implementation Details of Baseline Methods For Self-Refine, we generate 8 solutions in parallel, and each solution is refined independently in subsequent rounds without any interaction across diffe...

work page

[9] [9]

A 2 × 1 tile is treated as covering exactly one full column

work page

[10] [10]

A 2 × 2 block is therefore assumed to have only two tilings: 𝑇(2)=2

work page

[11] [11]

The model derives 𝑇(𝑛)=𝑇(𝑛−1) +𝑇(𝑛−2) +𝑇(𝑛−4)

work page

[12] [12]

Diagnostic error.The solution explicitly or im- plicitly rules out horizontal placements of the 2 × 1 tile

Therefore, 𝑇(4)=𝑇(3) +𝑇(2) +𝑇(0)=3+2+1=6. Diagnostic error.The solution explicitly or im- plicitly rules out horizontal placements of the 2 × 1 tile. By iteration 19, the no-experience baseline even states that horizontal placement is invalid, thereby reinforcing rather than cor- recting the original mistake. 6 wrong Correct solution pattern: rotation-awa...

work page

[13] [13]

A 2 × 1 rectangular tile can be placed either vertically or horizontally

work page

[14] [14]

Hence a 2×2 block has three tilings: 𝑇(2)=3, namely two vertical 2 × 1 tiles, one 2 × 2 square tile, or two horizontal 2×1 tiles

work page

[15] [15]

The correct recurrence is 𝑇(𝑛)=𝑇(𝑛−1) +2𝑇(𝑛−2) +𝑇(𝑛−4)

work page

[16] [16]

Key correction.The model explicitly identi- fies the prior error: the wrong solutions under- count because they assume 𝑇( 2)= 2 and ignore the horizontal-pair tiling

Therefore, 𝑇(4)=𝑇(3) +2𝑇(2) +𝑇(0)=5+6+1=12. Key correction.The model explicitly identi- fies the prior error: the wrong solutions under- count because they assume 𝑇( 2)= 2 and ignore the horizontal-pair tiling. 12 correct Figure 9. Comparison of wrong solution pattern and correct solution pattern. against ground-truth correctness, or employing stronger an...

work page

[17] [19]

Merely citing a result without showing why it applies or how it works is considered a failure

**Self-Containment:** Referencing external papers/theorems is allowed **IF AND ONLY IF** you also present a valid proof or clear derivation of the referenced argument . Merely citing a result without showing why it applies or how it works is considered a failure. **Process:**

work page

[18] [20]

Reason carefully about how to solve the problem

work page

[19] [21]

Draft your solution mentally or in your scratchpad

work page

[20] [22]

hand- waving

**Refine your solution** by fixing any potential logical gaps, ambiguity, or "hand- waving" arguments until it meets the highest standard of mathematical proof

work page

[21] [23]

**Output Format:** Your response should follow this exact markdown format: ## Solution

Present *only* your best, finalized version. **Output Format:** Your response should follow this exact markdown format: ## Solution ... // Your final, rigorous solution to the problem here. Ensure all steps are explicitly shown and justified. --- Here is your task input: ## Problem {question} Verification Prompt ## Instruction 26 Your task is to evaluate ...

work page

[22] [24]

Do NOT repeat logic that has already been identified as incorrect

**Error Correction:** You must explicitly address the flaws pointed out in the verification summaries. Do NOT repeat logic that has already been identified as incorrect

work page

[23] [25]

If minor details are omitted, it is considered imperfect

**Completeness:** The solution must cover all cases and steps. If minor details are omitted, it is considered imperfect

work page

[24] [26]

**Rigour:** Fatal errors or severe omissions are unacceptable

work page

[25] [27]

**Process:**

**Self-Containment:** Referencing external papers/theorems is allowed **IF AND ONLY IF** you also present a valid proof or clear derivation of the referenced argument . **Process:**

work page

[26] [28]

Read the **Problem** carefully

work page

[27] [29]

Identify exactly what went wrong, what was incomplete, and what (if anything) was correct

Study each **Previous Attempt** and its **Verification Summary**. Identify exactly what went wrong, what was incomplete, and what (if anything) was correct

work page

[28] [30]

Reason about how to fix the specific issues while retaining any correct sub-results from previous attempts

work page

[29] [31]

Draft your refined solution, ensuring it does not repeat the confirmed errors

work page

[30] [32]

**Output Format:** Your response should follow this exact markdown format: ## Solution

Present *only* your best, finalized, and fully corrected version. **Output Format:** Your response should follow this exact markdown format: ## Solution ... // Your final, rigorous, and corrected solution to the problem here. Ensure all steps are explicitly shown and justified. --- Here is your task input: ## Problem {question} Experience Context Appended...

work page

[31] [33]

2025 == 2 mod 7

**Non-trivial**: It must involve meaningful mathematical work -- a derivation, a transformation, a non-obvious equivalence, or a structural observation. Trivial arithmetic evaluations (e.g., "2025 == 2 mod 7") do NOT qualify unless the congruence itself is the key insight that unlocks a deeper argument. 29

work page 2025

[32] [34]

Prefer results that establish structure over results that are dead ends

**Reusable**: It must be a stepping stone -- something a future solver can directly cite and proceed from, without needing to redo the work. Prefer results that establish structure over results that are dead ends

work page

[33] [35]

The substitution $u = x - 1/x$ reduces the integral to $\int \\frac{{du }}{{u^2+2}}$, which is a standard form

**Verifier-backed**: It must be explicitly confirmed correct by the verification summary. If verifiers are split on a step, do not add it as an Anchor. Verified Anchors fall into the following sub-types (use these to guide what you extract ): - **Structural Reduction**: A transformation that rewrites the problem or a sub-problem into a simpler or more tra...

work page 2025

[34] [36]

Prioritize insights confirmed consistently across multiple rollouts

**ADD**: Extract new Verified Anchors or Error Avoidance Heuristics that are not already covered by the existing bank. Prioritize insights confirmed consistently across multiple rollouts

work page

[35] [37]

**KEEP**: Retain all existing entries that remain valid and are not contradicted by the new rollouts

work page

[36] [38]

Only merge entries that say the exact same thing about the exact same step

**REFINE**: If a new rollout provides a more precise version of an existing entry, rewrite it to be clearer. Only merge entries that say the exact same thing about the exact same step

work page

[37] [39]

verified_anchors

**DELETE**: Remove entries explicitly revealed as incorrect by the verification summary. Remove entries that become fully subsumed after refinement. ## Quantity Guideline Aim for **20-35 entries** in total across both categories. Do NOT aggressively compress -- fine-grained, specific entries are more useful than over-generalized ones. Only merge entries t...

work page

[38] [40]

**Memory of exploration**: It records which broad strategic directions have already been tried, so the solver does not waste computation repeating the same approach

work page

[39] [41]

Guideline

**Diversity enforcement**: When the solver is about to generate a new solution, it will be shown this bank and instructed to pursue a direction that is ** fundamentally different** from everything listed here. The bank therefore acts as the primary mechanism for controlling exploration -- the richer and more precise this log is, the better the solver can ...

work page

[40] [42]

**Identify** the high-level strategy used in the student’s solution (mathematical framework, key structural insight, angle of attack)

work page

[41] [43]

**Compare** it against each entry in <already_attempted_strategies>

work page

[42] [44]

identified_strategy

**Classify**: - **1**: The student’s strategy is genuinely different from ALL listed strategies. - **0**: The student’s strategy is essentially the same as at least one listed strategy. - **-1**: Cannot determine the student’s strategy (solution too vague/incomplete). Respond in a JSON code block: ‘‘‘json {{ "identified_strategy": "Brief description of th...

work page