Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

Seonil Son; Shenghong Fu; Shuliang Liu; Xuming Hu; Yao Wan; Yuehao Tang; Zipeng Ling

arxiv: 2604.14121 · v1 · submitted 2026-04-15 · 💻 cs.CL

Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

Zipeng Ling , Shuliang Liu , Shenghong Fu , Yuehao Tang , Seonil Son , Yao Wan , Xuming Hu This is my paper

Pith reviewed 2026-05-10 12:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords chain-of-thoughtreasoning knowledge graphLLM reasoning flawsconsensus synthesislogical reasoningmathematical reasoningtopological generation

0 comments

The pith

A consensus graph built from overlapping segments of multiple LLM reasoning traces yields more accurate and higher-quality chain-of-thought outputs than providing ground-truth labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models generate reasoning traces that contain both internal logical errors and step-wise problems such as overthinking or underthinking. Providing the correct final answer as a label surprisingly fails to improve the quality of the intermediate steps. The paper shows that extracting the parts that appear in common across several independent traces and organizing them into a graph allows the model to synthesize a new trace through topological ordering. This process reduces both kinds of flaws at once. Experiments on logical and mathematical benchmarks confirm gains in final-answer accuracy and in the internal structure of the reasoning paths.

Core claim

The central claim is that a Reasoning Knowledge Graph constructed from the consensus segments shared across multiple candidate traces, followed by topological generation of a single coherent path, simultaneously mitigates step-internal flaws such as logical errors and hallucinations and step-wise flaws such as over- or underthinking, producing reasoning traces whose final predictions are more accurate and whose intermediate steps are more reliable than those obtained from single generations or from ground-truth labels alone.

What carries the argument

The Reasoning Knowledge Graph (RKG) assembled from overlapping reliable segments of multiple candidate traces, which supports topological ordering to synthesize one clean reasoning path.

If this is right

Label-prediction accuracy rises by more than 10 percent on average compared with strong baselines.
The method outperforms all tested baselines on both logical-reasoning and mathematical-reasoning benchmarks.
Reasoning-trace quality improves across several measured dimensions even when final-answer accuracy is held constant.
Direct supervision by ground-truth labels alone produces no measurable gain in reasoning ability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may lower the cost of creating high-quality reasoning data by replacing some human step-by-step annotation with automated consensus extraction.
If the same consensus technique is applied at inference time, it could serve as a lightweight verifier that does not require an external model or additional training.
The method might be combined with existing verification modules that check individual edges in the graph rather than the whole trace.

Load-bearing premise

Shared segments across independently generated traces are more likely to reflect correct reasoning than to reflect the same error repeated by the model.

What would settle it

A controlled test in which every sampled trace contains the same incorrect intermediate step due to a shared model bias, yet the consensus graph still reproduces that step in the final synthesized trace.

Figures

Figures reproduced from arXiv: 2604.14121 by Seonil Son, Shenghong Fu, Shuliang Liu, Xuming Hu, Yao Wan, Yuehao Tang, Zipeng Ling.

**Figure 2.** Figure 2: Overview of the CRAFT framework and evaluation. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Significance testing (paired Wilcoxon, w/ Answer [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: K-sensitivity: accuracy changes according to [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

LLM reasoning traces suffer from complex flaws -- *Step Internal Flaws* (logical errors, hallucinations, etc.) and *Step-wise Flaws* (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs' reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the consensus parts of multiple candidate traces, and synthesizes a high-quality trace through topological generation. Our approach improves label-prediction accuracy by 10+% on average, and consistently outperforms all baselines across both logical and mathematical reasoning benchmarks. Further, detailed benchmark evaluation proves that our method also improves the quality of LLMs' reasoning traces in multiple dimensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ground-truth labels don't improve CoT reasoning, but consensus graphs from multiple traces offer a workable alternative with reported gains.

read the letter

The main point is that providing ground-truth step labels produces no gain in LLM reasoning ability, which undercuts the usual intuition about supervision. The authors respond with CRAFT, which pulls consensus segments from several candidate traces, builds a reasoning knowledge graph, and generates a new trace via topological ordering. They report average improvements above 10% on label prediction accuracy plus consistent outperformance on logical and mathematical benchmarks, along with better trace quality across dimensions.

Referee Report

3 major / 2 minor

Summary. The paper introduces CRAFT, a framework for synthesizing robust Chain-of-Thought (CoT) reasoning traces in LLMs. It shows that simply providing ground-truth labels does not improve reasoning performance. Instead, it constructs a Reasoning Knowledge Graph (RKG) from the consensus segments of multiple candidate traces generated by the LLM and uses topological generation to produce a high-quality trace. The method is reported to achieve over 10% average improvement in label-prediction accuracy and to outperform baselines on logical and mathematical reasoning tasks, while also enhancing the quality of reasoning traces in several dimensions.

Significance. If the results hold under proper controls, this work contributes by highlighting the ineffectiveness of direct ground-truth supervision for LLM reasoning and proposing a consensus-graph method to address step-internal and step-wise flaws. The empirical outperformance across benchmarks and the graph-based synthesis approach are strengths that could inform more reliable reasoning systems if the consensus reliably isolates correct steps rather than shared errors.

major comments (3)

[§4] §4 (Experiments) and associated result tables: The central claims of >10% average accuracy gains and consistent outperformance lack reported details on statistical significance testing, number of random seeds, variance across runs, or controls for prompt variation. This information is load-bearing for validating the robustness of the improvements over baselines.
[§3] §3 (RKG construction): The core assumption that overlapping consensus segments across candidate traces correspond to correct reasoning steps (rather than common LLM biases or errors) receives no targeted validation, such as error analysis on the extracted segments or comparison against ground-truth step correctness. This directly threatens the claim that topological generation from the RKG mitigates flaws and produces higher-quality traces.
[§5] §5 (Trace quality evaluation): The assertion of improved reasoning trace quality 'in multiple dimensions' is not supported by explicit metrics, quantification methods, or inter-annotator agreement scores (if human evaluation was involved), making it impossible to assess whether the gains are substantive or merely presentational.

minor comments (2)

[Abstract] Abstract: The specific logical and mathematical benchmarks used are not named, which reduces immediate clarity for readers scanning the contribution.
[§2] Notation for RKG nodes/edges and topological generation could be introduced with a small illustrative example in the early sections to aid comprehension.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for strengthening the empirical rigor and clarity of our work. We address each major comment below and will revise the manuscript to incorporate additional details and analyses where needed.

read point-by-point responses

Referee: [§4] §4 (Experiments) and associated result tables: The central claims of >10% average accuracy gains and consistent outperformance lack reported details on statistical significance testing, number of random seeds, variance across runs, or controls for prompt variation. This information is load-bearing for validating the robustness of the improvements over baselines.

Authors: We agree that these statistical details are essential for demonstrating robustness. In the revised manuscript, we will report the number of random seeds (specifically 5 seeds), include means and standard deviations across runs in the result tables, perform and report statistical significance tests (e.g., paired t-tests with p-values) comparing CRAFT to baselines, and explicitly describe prompt standardization controls used across all methods to ensure fair comparison. revision: yes
Referee: [§3] §3 (RKG construction): The core assumption that overlapping consensus segments across candidate traces correspond to correct reasoning steps (rather than common LLM biases or errors) receives no targeted validation, such as error analysis on the extracted segments or comparison against ground-truth step correctness. This directly threatens the claim that topological generation from the RKG mitigates flaws and produces higher-quality traces.

Authors: We acknowledge that direct validation of the consensus assumption would strengthen the paper. While the consistent outperformance on benchmarks provides supporting evidence, we will add a new analysis subsection in the revised version. This will include error analysis on a sample of consensus segments, comparing them against ground-truth step correctness on the logic and math benchmarks, and discussing any observed cases of shared biases or errors. revision: yes
Referee: [§5] §5 (Trace quality evaluation): The assertion of improved reasoning trace quality 'in multiple dimensions' is not supported by explicit metrics, quantification methods, or inter-annotator agreement scores (if human evaluation was involved), making it impossible to assess whether the gains are substantive or merely presentational.

Authors: We agree that §5 would benefit from greater explicitness. The original evaluation used a combination of automated metrics (e.g., for logical consistency and completeness) and qualitative review across dimensions. We will revise §5 to define each dimension explicitly, detail the quantification methods, report specific scores, and include inter-annotator agreement if human evaluation was part of the process. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical construction with external validation

full rationale

The paper describes an empirical method (CRAFT) that extracts consensus segments from multiple candidate traces to build an RKG and then performs topological generation. All reported gains (10+% accuracy, improved trace quality) are presented as outcomes of benchmark experiments rather than closed-form predictions or derivations. No equations, fitted parameters, or self-citations are shown to reduce the central result to its own inputs by construction. The assumption that consensus segments are reliable is an external modeling choice, not a self-referential loop. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes that consensus segments are meaningful and that topological ordering yields coherent traces, but these are not formalized.

pith-pipeline@v0.9.0 · 5467 in / 1084 out tokens · 16297 ms · 2026-05-10T12:40:34.927058+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

[1]

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar- cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobs- son, Idan Szpektor, Nan-Jia...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B

A comparative study on tf-idf feature weight- ing method and its analysis using unstructured dataset.Preprint, arXiv:2308.04037. Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. 2024. Improving factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325. Bryan Eikema and Wilker Aziz. ...

work page arXiv 2024
[3]

Graph chain-of-thought: Augmenting large language models by reasoning on graphs

Graph chain-of-thought: Augmenting large language models by reasoning on graphs.Preprint, arXiv:2404.07103. Yiming Li, Junzhe Yao, and Haiyang Yu. 2025. Multi- chain graph refinement and selection for complex reasoning in large language models.arXiv preprint arXiv:2502.08674. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy ...

work page arXiv 2025
[4]

Let's Verify Step by Step

Let’s verify step by step.Preprint, arXiv:2305.20050. Zipeng Ling, Yuehao Tang, Chen Huang, Shuliang Liu, Gaoyang Jiang, Shenghong Fu, Junqi Yang, Yao Wan, Jiawan Zhang, Kejia Huang, and Xuming Hu. 2025a. Instruction boundary: Quantifying biases in llm reasoning under various coverage.Preprint, arXiv:2509.20278. Zipeng Ling, Yuehao Tang, Shuliang Liu, Jun...

work page internal anchor Pith review arXiv 2022
[5]

By leveraging exemplar-wise contrastive learning, HiURE miti- gates the problem of pushing semantically related sentences apart and learns hierarchical represen- tations

propose HiURE, a hierarchical exemplar con- trastive framework for unsupervised relation extrac- tion that derives cross-hierarchy signals to improve relational representation learning. By leveraging exemplar-wise contrastive learning, HiURE miti- gates the problem of pushing semantically related sentences apart and learns hierarchical represen- tations. ...

work page
[6]

reasoning vs. answer

work on proactive defenses against misin- formation conceptualizes a three-pillar framework: knowledge credibility, inference reliability and in- put robustness. They argue that proactive strate- gies—such as fortifying training data, embedding self-corrective mechanisms during reasoning, and hardening model interfaces—can improve misinfor- mation prevent...

work page 2026
[7]

undecidable

choose the candidate that minimises ex- pected loss against other candidates; CRAFT’s con- sensus RKG can be viewed as a structural generali- sation of MBR — rather than selecting one existing candidate, it synthesises a new trace that captures the consensus structure across all candidates. D Dataset Details We use six benchmarks across two evaluation con...

work page 2024
[8]

A store sells ap- ples for $2 each and oranges for $3 each. Tom buys 4 apples and 5 oranges. How much does he pay?

contains competition-level mathematics problems sourced from national and international olympiads, spanning algebra, combinatorics, geom- etry, and number theory. It requires deep multi-step reasoning significantly beyond grade-school level. As with GSM8K, each problem has a unique an- swer, so F1 equals accuracy and only accuracy is reported. We evaluate...

work page 2025
[9]

logical_error

Be STRICT with “logical_error” – only assign high percentages (>50%) for SERIOUS, CLEAR logical flaws

work page
[10]

Be GENEROUS with other categories – minor issues should be distributed across multiple categories

work page
[11]

A VOID extreme values – aim for 2–3 non-zero categories in most cases

work page
[12]

distribution

Most real reasoning has MIXED quality – reflect this in your distribution Category Definitions(with strict/generous guidelines): •perfect: All steps correct, optimal path •logical_error: ONLY for CLEAR, SERIOUS invalid reasoning. Minor imperfections are NOT logical errors! 23 •step_missing_rate: Missing intermediate steps, gaps in reasoning. •early_termin...

work page 2023
[13]

This avoids format-sensitive answer extraction but adds one LLM call

extends Self-Consistency by replacing hard majority vote with an LLM-based selector: the model reads all K candidate answers and picks the most consistent one. This avoids format-sensitive answer extraction but adds one LLM call. We use K=10. Self-Aggregation(Venkatraman et al., 2026) re- cursively aggregates K candidate traces into a sin- gle refined ans...

work page 2026
[14]

Tree-of-Thought (ToT)(Yao et al., 2023) struc- tures reasoning as a tree search where each node is a partial reasoning state

frames reasoning as planning with a world model: the LLM simulates future states and uses Monte Carlo Tree Search (MCTS) to explore the reasoning space, balancing exploration and ex- ploitation via UCB scores. Tree-of-Thought (ToT)(Yao et al., 2023) struc- tures reasoning as a tree search where each node is a partial reasoning state. The model generates m...

work page 2023
[19]

Based on your step-by-step reasoning, determine whether the hypothesis is: •__PROVED__: The hypothesis can be logically proven from the facts

After your reasoning steps, provide your final conclusion. Based on your step-by-step reasoning, determine whether the hypothesis is: •__PROVED__: The hypothesis can be logically proven from the facts. •__DISPROVED__: The hypothesis can be logically disproven from the facts. •__UNKNOWN__: The hypothesis cannot be determined from the given facts. End with:...

work page
[20]

Think through this problem step by step

work page
[22]

Each step should cite the context, question, or previous steps it uses

work page
[23]

Build your reasoning logically from one step to the next

work page
[24]

Evaluate each option carefully

work page
[25]

Based on your step-by-step reasoning, select the correct answer from the options (A, B, C, or D)

After your reasoning steps, provide your final answer. Based on your step-by-step reasoning, select the correct answer from the options (A, B, C, or D). End with:Answer: A(or B / C / D). P.3 Tree-of-Thought (ToT) Setting System Prompt (Prefix) You are a logical reasoning expert using tree-of-thought methodology. Explore multiple reasoning paths, evaluate ...

work page
[28]

Each step should cite the facts or previous steps it uses

work page
[30]

Based on your reasoning, determine whether the hypothesis is: •__PROVED__: The hypothesis can be logically proven from the facts

Note that not all problems can be proved or disproved – some remain unknown. Based on your reasoning, determine whether the hypothesis is: •__PROVED__: The hypothesis can be logically proven from the facts. •__DISPROVED__: The hypothesis can be logically disproven from the facts. •__UNKNOWN__: The hypothesis cannot be determined from the given facts. Afte...

work page
[31]

Learn from the reasoning patterns shown in the examples

work page
[32]

Step 1:”, “Step 2:

Use numbered steps in the format “Step 1:”, “Step 2:”, etc

work page
[33]

Each step should cite the context, question, options, or previous steps it uses

work page
[34]

Apply the appropriate logical structure based on what the problem requires

work page
[35]

validity

Evaluate each option carefully. Based on your reasoning, select the correct answer from options (A, B, C, or D). After your steps, provide your answer in the format: Answer: A(or B / C / D). 29 P.6 PRMBench Step Verifier System Prompt — w/ Answer You are a math reasoning verifier. You will be given a math problem, its correct final answer, and numbered so...

work page

[1] [1]

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar- cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobs- son, Idan Szpektor, Nan-Jia...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B

A comparative study on tf-idf feature weight- ing method and its analysis using unstructured dataset.Preprint, arXiv:2308.04037. Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. 2024. Improving factuality and reasoning in language models through multiagent debate.arXiv preprint arXiv:2305.14325. Bryan Eikema and Wilker Aziz. ...

work page arXiv 2024

[3] [3]

Graph chain-of-thought: Augmenting large language models by reasoning on graphs

Graph chain-of-thought: Augmenting large language models by reasoning on graphs.Preprint, arXiv:2404.07103. Yiming Li, Junzhe Yao, and Haiyang Yu. 2025. Multi- chain graph refinement and selection for complex reasoning in large language models.arXiv preprint arXiv:2502.08674. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy ...

work page arXiv 2025

[4] [4]

Let's Verify Step by Step

Let’s verify step by step.Preprint, arXiv:2305.20050. Zipeng Ling, Yuehao Tang, Chen Huang, Shuliang Liu, Gaoyang Jiang, Shenghong Fu, Junqi Yang, Yao Wan, Jiawan Zhang, Kejia Huang, and Xuming Hu. 2025a. Instruction boundary: Quantifying biases in llm reasoning under various coverage.Preprint, arXiv:2509.20278. Zipeng Ling, Yuehao Tang, Shuliang Liu, Jun...

work page internal anchor Pith review arXiv 2022

[5] [5]

By leveraging exemplar-wise contrastive learning, HiURE miti- gates the problem of pushing semantically related sentences apart and learns hierarchical represen- tations

propose HiURE, a hierarchical exemplar con- trastive framework for unsupervised relation extrac- tion that derives cross-hierarchy signals to improve relational representation learning. By leveraging exemplar-wise contrastive learning, HiURE miti- gates the problem of pushing semantically related sentences apart and learns hierarchical represen- tations. ...

work page

[6] [6]

reasoning vs. answer

work on proactive defenses against misin- formation conceptualizes a three-pillar framework: knowledge credibility, inference reliability and in- put robustness. They argue that proactive strate- gies—such as fortifying training data, embedding self-corrective mechanisms during reasoning, and hardening model interfaces—can improve misinfor- mation prevent...

work page 2026

[7] [7]

undecidable

choose the candidate that minimises ex- pected loss against other candidates; CRAFT’s con- sensus RKG can be viewed as a structural generali- sation of MBR — rather than selecting one existing candidate, it synthesises a new trace that captures the consensus structure across all candidates. D Dataset Details We use six benchmarks across two evaluation con...

work page 2024

[8] [8]

A store sells ap- ples for $2 each and oranges for $3 each. Tom buys 4 apples and 5 oranges. How much does he pay?

contains competition-level mathematics problems sourced from national and international olympiads, spanning algebra, combinatorics, geom- etry, and number theory. It requires deep multi-step reasoning significantly beyond grade-school level. As with GSM8K, each problem has a unique an- swer, so F1 equals accuracy and only accuracy is reported. We evaluate...

work page 2025

[9] [9]

logical_error

Be STRICT with “logical_error” – only assign high percentages (>50%) for SERIOUS, CLEAR logical flaws

work page

[10] [10]

Be GENEROUS with other categories – minor issues should be distributed across multiple categories

work page

[11] [11]

A VOID extreme values – aim for 2–3 non-zero categories in most cases

work page

[12] [12]

distribution

Most real reasoning has MIXED quality – reflect this in your distribution Category Definitions(with strict/generous guidelines): •perfect: All steps correct, optimal path •logical_error: ONLY for CLEAR, SERIOUS invalid reasoning. Minor imperfections are NOT logical errors! 23 •step_missing_rate: Missing intermediate steps, gaps in reasoning. •early_termin...

work page 2023

[13] [13]

This avoids format-sensitive answer extraction but adds one LLM call

extends Self-Consistency by replacing hard majority vote with an LLM-based selector: the model reads all K candidate answers and picks the most consistent one. This avoids format-sensitive answer extraction but adds one LLM call. We use K=10. Self-Aggregation(Venkatraman et al., 2026) re- cursively aggregates K candidate traces into a sin- gle refined ans...

work page 2026

[14] [14]

Tree-of-Thought (ToT)(Yao et al., 2023) struc- tures reasoning as a tree search where each node is a partial reasoning state

frames reasoning as planning with a world model: the LLM simulates future states and uses Monte Carlo Tree Search (MCTS) to explore the reasoning space, balancing exploration and ex- ploitation via UCB scores. Tree-of-Thought (ToT)(Yao et al., 2023) struc- tures reasoning as a tree search where each node is a partial reasoning state. The model generates m...

work page 2023

[15] [19]

Based on your step-by-step reasoning, determine whether the hypothesis is: •__PROVED__: The hypothesis can be logically proven from the facts

After your reasoning steps, provide your final conclusion. Based on your step-by-step reasoning, determine whether the hypothesis is: •__PROVED__: The hypothesis can be logically proven from the facts. •__DISPROVED__: The hypothesis can be logically disproven from the facts. •__UNKNOWN__: The hypothesis cannot be determined from the given facts. End with:...

work page

[16] [20]

Think through this problem step by step

work page

[17] [22]

Each step should cite the context, question, or previous steps it uses

work page

[18] [23]

Build your reasoning logically from one step to the next

work page

[19] [24]

Evaluate each option carefully

work page

[20] [25]

Based on your step-by-step reasoning, select the correct answer from the options (A, B, C, or D)

After your reasoning steps, provide your final answer. Based on your step-by-step reasoning, select the correct answer from the options (A, B, C, or D). End with:Answer: A(or B / C / D). P.3 Tree-of-Thought (ToT) Setting System Prompt (Prefix) You are a logical reasoning expert using tree-of-thought methodology. Explore multiple reasoning paths, evaluate ...

work page

[21] [28]

Each step should cite the facts or previous steps it uses

work page

[22] [30]

Based on your reasoning, determine whether the hypothesis is: •__PROVED__: The hypothesis can be logically proven from the facts

Note that not all problems can be proved or disproved – some remain unknown. Based on your reasoning, determine whether the hypothesis is: •__PROVED__: The hypothesis can be logically proven from the facts. •__DISPROVED__: The hypothesis can be logically disproven from the facts. •__UNKNOWN__: The hypothesis cannot be determined from the given facts. Afte...

work page

[23] [31]

Learn from the reasoning patterns shown in the examples

work page

[24] [32]

Step 1:”, “Step 2:

Use numbered steps in the format “Step 1:”, “Step 2:”, etc

work page

[25] [33]

Each step should cite the context, question, options, or previous steps it uses

work page

[26] [34]

Apply the appropriate logical structure based on what the problem requires

work page

[27] [35]

validity

Evaluate each option carefully. Based on your reasoning, select the correct answer from options (A, B, C, or D). After your steps, provide your answer in the format: Answer: A(or B / C / D). 29 P.6 PRMBench Step Verifier System Prompt — w/ Answer You are a math reasoning verifier. You will be given a math problem, its correct final answer, and numbered so...

work page