arxiv: 2604.02967 · v1 · submitted 2026-04-03 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

Kehan Jiang , Haonan Dong , Zhaolu Kang , Zhengzhou Zhu , Guojie Song

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:35 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords Large Reasoning ModelsForest of ErrorsTest-time scalingReasoning efficiencyError propagationSelf-guided inference

0 comments

The pith

In large reasoning models the first solution is the best because errors form a growing forest that harms later paths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models explore multiple solutions at test time yet the first one consistently outperforms the rest. The paper characterizes the errors that appear across these paths as a forest-structured Forest of Errors that grows as more computation is spent. This pattern directly challenges the assumption that generating additional solutions improves results. The authors therefore introduce a framework that strengthens the initial solution and removes the subsequent ones.

Core claim

The central claim is that FoE makes the First the Best: errors within reasoning paths form a forest structure and scale concurrently with test time, so alternative solutions become not merely suboptimal but actively detrimental to final performance. This observation is supported by empirical characterization across models and by theoretical analysis, and it motivates a self-guided method that refines the first path while discarding later ones.

What carries the argument

Forest of Errors (FoE), the forest-structured pattern of errors that accumulates in reasoning paths as test-time compute increases.

Load-bearing premise

Errors within the reasoning path scale concurrently with test time.

What would settle it

An experiment in which generating more alternative solutions reduces overall error rates and improves final accuracy would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.02967 by Guojie Song, Haonan Dong, Kehan Jiang, Zhaolu Kang, Zhengzhou Zhu.

**Figure 1.** Figure 1: (Upper) The First is The Best. (Lower) Forest of Errors. step reasoning processes. Furthermore, the advent of DeepSeek-R1 (DeepSeek-AI et al., 2025) marks a paradigm shift for Large Reasoning Models (LRMs). Powered by Reinforcement Learning (RL)-based training frameworks (Yue et al., 2025; Wang et al., 2025a; Yu et al., 2025), R1-like models tend to generate more extensive responses while exhibiting an … view at source ↗

**Figure 2.** Figure 2: (Left) Key observations within the FoE. (Right) Our proposed RED framework. Dataset Model ✓ →✗ ✗ →✗ ✗ →✓ AIME25 Qwen-distilled-7b 18.3 77.4 4.3 Qwen-distilled-32b 16.7 76.9 6.4 Qwen3-8b 15.5 79.1 5.4 Qwen3-32b 13.0 79.9 7.1 Llama-distilled-70b 18.8 75.4 5.8 GPQADiamond Qwen-distilled-7b 16.0 81.1 2.9 Qwen-distilled-32b 15.3 81.4 3.3 Qwen3-8b 14.4 82.7 2.9 Qwen3-32b 14.1 82.8 3.1 Llama-distilled-70b 21.2 7… view at source ↗

**Figure 3.** Figure 3: Manual correction on distinct error node types. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of various node types with respect [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Average distribution of correction types (True, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Evaluation of intra-solution reflection metrics. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Test-time scalability under self-consistency. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Average sampling error rate (%, lower is bet [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Distribution of absolute PCS disagreement [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Convergence (A, success). For a fixed probe prompt, the modal answer ratio (mode count / N) increases over checkpoints and exceeds the threshold P before the First answer becomes explicit; bars are green since the mode is correct. Variant ESC (%) WESR (%) Crisk@Agree (%) A 73.20 0.88 1.20 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 12.** Figure 12: Robustness (B, success). With M diverse probe prompts and one sample per prompt (N=1), the number of unique induced answers decreases over time and eventually reaches 1 (green), indicating crossprompt agreement. explicit answer text. However, [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 14.** Figure 14: Robustness+Parallelism (B+C, success). Adding per-prompt parallel samples stabilizes promptwise modes; once modes align, the trigger is correct (green), and the minimum internal mode ratio rises. 100 200 300 400 500 600 Token position t 0.0 0.2 0.4 0.6 0.8 1.0 Min internal mode-ratio First answer P=0.60 (not enforced in B+C) Trigger & Correct Trigger & Wrong No Trigger / Diverse [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 15.** Figure 15: Robustness+Parallelism (B+C, failure). Without enforcing an internal threshold, prompt-wise modes may align too early when internal mode ratios are still low, producing a wrong trigger (red). induced answer [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

read the original abstract

Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper observes that the first solution is usually best in large reasoning models and proposes RED to refine it while dropping the rest, backed by decent experiments but thin evidence on why errors scale with more steps.

read the letter

The key takeaway is that in large reasoning models, the first generated solution often outperforms the alternatives, which the authors attribute to errors growing as more paths are explored. They introduce the Forest of Errors concept and the RED framework to refine the initial solution and discard subsequent ones using dual-consistency checks. What the paper does well is demonstrate this across five benchmarks and six backbone models, outperforming eight baselines with accuracy improvements up to 19% and token reductions between 37.7% and 70.4%. The practical benefits for efficient reasoning are clear from the results, and the self-guided nature of RED makes it straightforward to apply. The experiments provide solid evidence for the performance gains, which is the strongest part of the work. The softer area is the theoretical backing for why errors scale with test time and make later solutions detrimental. The abstract references rigorous analysis, but the summary lacks specifics on the measurements or controls used to isolate this effect from other possible causes like model biases or benchmark characteristics. If the full paper has detailed per-step error tracking or ablation studies, that would help, but based on what's here, the central mechanism feels more observed than deeply proven. This paper is for researchers interested in test-time optimization for AI reasoning systems. A reader working on deployment or scaling laws would find the empirical findings and the proposed method useful to consider. I would recommend putting it through peer review. The results are substantial enough to warrant discussion, and reviewers can clarify the theoretical claims.

Referee Report

3 major / 2 minor

Summary. The paper claims that large reasoning models exhibit a 'The First is The Best' phenomenon because errors form a Forest of Errors (FoE) that scale concurrently with test-time compute and additional reasoning paths, rendering later solutions detrimental rather than merely suboptimal. It supports this via empirical analysis across five benchmarks and six models, proposes the RED framework (Refining First to suppress FoE growth plus Discarding Subs via dual-consistency pruning), and reports up to 19% performance gains with 37.7–70.4% token reduction over eight baselines.

Significance. If the error-scaling mechanism can be isolated from artifacts and the theoretical account formalized, the work would meaningfully challenge test-time scaling assumptions in LRMs and supply a practical efficiency recipe. The breadth of experiments (multiple backbones and benchmarks) plus the introduction of FoE metrics for comparative analysis are concrete strengths that could influence follow-on work on reasoning efficiency.

major comments (3)

[Abstract / §3] Abstract and §3 (theoretical analysis): the central claim that 'errors within the reasoning path scale concurrently with test time' is asserted as underpinned by rigorous theory, yet no equations, formal model of per-step error accumulation, or derivation showing why later paths become net detrimental appear in the text; without this the hypothesis remains observational.
[§4] §4 (empirical characterization of FoE): no explicit statistics (per-step error probability, inter-path divergence, or forest-depth histograms) are reported that isolate concurrent error growth from prompt bias, early stopping, or benchmark saturation; the superiority of the first solution could therefore be explained by alternative mechanisms.
[§5.2] §5.2 (Discarding Subs): the dual-consistency criterion is described at a high level but lacks a precise definition of consistency metric and an ablation confirming it does not presuppose the first solution is correct, which is load-bearing for the claim that subsequent solutions are actively detrimental.

minor comments (2)

[Figures / §4.3] Figure captions and §4.3 should explicitly define the FoE metrics used in the comparative experiments so readers can reproduce the reported gains.
[Abstract] The abstract states 'comprehensive empirical analysis' but provides no table of per-benchmark error rates or controls; adding these would strengthen verifiability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the theoretical formalization, empirical isolation of mechanisms, and precise definitions with supporting ablations.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (theoretical analysis): the central claim that 'errors within the reasoning path scale concurrently with test time' is asserted as underpinned by rigorous theory, yet no equations, formal model of per-step error accumulation, or derivation showing why later paths become net detrimental appear in the text; without this the hypothesis remains observational.

Authors: We appreciate this observation. While §3 presented a conceptual framework and empirical characterization of the Forest of Errors, we acknowledge that it lacked formal equations or derivations. In the revised manuscript, we have added a mathematical model in §3, including equations for per-step error probability accumulation and a derivation showing why additional paths become net detrimental under concurrent FoE scaling. revision: yes
Referee: [§4] §4 (empirical characterization of FoE): no explicit statistics (per-step error probability, inter-path divergence, or forest-depth histograms) are reported that isolate concurrent error growth from prompt bias, early stopping, or benchmark saturation; the superiority of the first solution could therefore be explained by alternative mechanisms.

Authors: We thank the referee for highlighting the need for stronger isolation. The original §4 reported aggregate FoE metrics, but the revised version now includes explicit per-step error probabilities, inter-path divergence measures, and forest-depth histograms. We have also added controlled experiments and ablations addressing prompt bias, early stopping, and benchmark saturation to rule out alternative explanations. revision: yes
Referee: [§5.2] §5.2 (Discarding Subs): the dual-consistency criterion is described at a high level but lacks a precise definition of consistency metric and an ablation confirming it does not presuppose the first solution is correct, which is load-bearing for the claim that subsequent solutions are actively detrimental.

Authors: We agree that greater precision is required. The revised §5.2 now provides the exact mathematical definition of the dual-consistency metric (joint answer-level and step-level consistency). We have added an ablation study evaluating pruning performance on instances where the first solution is incorrect, confirming that the method does not presuppose its correctness. revision: yes

Circularity Check

0 steps flagged

No circularity: observation-to-characterization chain remains independent of fitted inputs

full rationale

The paper begins from an empirical observation ('The First is The Best') on LRMs, forms a hypothesis about concurrent error scaling with test-time compute, introduces FoE as a descriptive characterization of that scaling, and then validates a downstream framework (RED) through benchmark experiments. No equation or derivation reduces the central claim to a parameter fit by construction, nor does any load-bearing step rely on a self-citation whose content is itself unverified within the paper. The theoretical analysis is presented as supporting the characterization rather than being definitionally equivalent to the input observation. External benchmarks and comparative FoE metrics provide falsifiable content outside the fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the unverified hypothesis that errors scale with test time and form a forest structure; no free parameters or invented entities with independent evidence are detailed in the abstract.

axioms (1)

domain assumption errors within the reasoning path scale concurrently with test time
Explicitly stated as the basis for hypothesizing FoE structure.

invented entities (1)

Forest of Errors (FoE) no independent evidence
purpose: characterize the structure of errors that makes the first solution best
New term introduced to explain the empirical pattern; no independent falsifiable handle provided in abstract.

pith-pipeline@v0.9.0 · 5521 in / 1293 out tokens · 41026 ms · 2026-05-13T19:35:20.304403+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

errors within the reasoning path scale concurrently with test time... characterize errors as a forest-structured Forest of Errors (FoE)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

rigorous mathematical analysis and proof of FoE makes the First the Best through the lens of probabilistic branching process theory

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
cs.AI 2026-05 unverdicted novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

TEMA is the first framework for multi-modification composed image retrieval, using entity mapping to improve accuracy on both new complex datasets and existing benchmarks while balancing efficiency.
Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 6.0

CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.
HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

HABIT improves robustness in composed image retrieval under noisy triplets by quantifying sample cleanliness via mutual information transition rates and applying dual-consistency progressive learning to retain good pa...
Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
cs.CV 2026-04 unverdicted novelty 6.0

A new SLT framework uses latent thoughts as a middle reasoning layer and plan-then-ground decoding to improve coherence and faithfulness in gloss-free sign language translation.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 5 Pith papers · 4 internal anchors

[1]

Sketch-of-thought: Efficient LLM reasoning with adaptive cognitive-inspired sketching.CoRR, abs/2503.05179. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela M...

work page arXiv
[2]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code.CoRR, abs/2107.03374. Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. Do NOT think that much for 2+3=? on the overthinking of o1-like llms.CoRR, abs/2412.21187. Jeffrey Cheng a...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Training Verifiers to Solve Math Word Problems

Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Mz Dai, Chenxu Yang, and Qingyi Si. 2025. S-GRPO: Early exit via reinforcement learning in reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. 2025. AuroRA: Breaking low-rank bottleneck of loRA with nonlinear mapping. InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems. Sebastian Farquhar, Jannik Kossen,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 24312–24320

C3ot: Generating shorter chain-of-thought without compromising effectiveness. InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 24312–24320. AAAI Press. Bespoke Labs. 2025. Bespoke-stratos: The unrea- sonable effectiveness of reasoning distillation. https://www...

work page arXiv 2025
[6]

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe

OpenReview.net. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe

work page
[7]

Let's Verify Step by Step

Let’s verify step by step.Preprint, arXiv:2305.20050. Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Ji- ayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang

work page internal anchor Pith review Pith/arXiv arXiv
[8]

O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025

Can language models learn to skip steps? In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pro- cessing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shi- wei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. 2025. O1-pruner: Le...

work page arXiv 2024
[9]

apparently cor- rect

modifies the reward function to penalize correct answers that deviate from the average length, encouraging concise reasoning. • GRPO(DeepSeek-AI et al., 2025) (DeepSeek-R1) applies Group Relative Policy Optimization to incentivize reasoning capabilities by comparing a group of candi- date outputs and updating the policy based on normalized advantages. • S...

work page 2025
[10]

**Dependency**: e_j's wrongness depends on at least one wrong artifact introduced by e_i; AND

work page
[11]

independent

**Directness**: within the provided prefix, there is **no closer, more−specific error step** that better explains the particular artifact reuse in e_j. −−− ## High−score propagation patterns High PCS scores are appropriate for **any strong artifact flow **, including: −**numeric propagation** (wrong value reused downstream), −**formula propagation** (wron...

work page
[12]

There are 6 chapters with 18 pages each

work page
[13]

Total pages = 6 * 18 = 96

work page
[14]

**CANDIDATE_PARENT_NODE**

Printing costs $0.10 per page, so total cost = 96 * 0.10 = $9 .60. **CANDIDATE_PARENT_NODE**

work page
[15]

**CHILD_NODE**

Total pages = 6 * 18 = 96. **CHILD_NODE**

work page
[16]

**OUTPUT** 5.0 ### Example 5.0−B (simple formula propagation) **CONTEXT_UP_TO_CHILD**

Printing costs $0.10 per page, so total cost = 96 * 0.10 = $9 .60. **OUTPUT** 5.0 ### Example 5.0−B (simple formula propagation) **CONTEXT_UP_TO_CHILD**

work page
[19]

34 **CANDIDATE_PARENT_NODE**

Therefore area ~= 8 * pi ~= 25.13. 34 **CANDIDATE_PARENT_NODE**

work page
[21]

**OUTPUT** 5.0 ### Example 5.0−C (complex propagation: misjudgment −> wrong theorem use −> derived conclusion) **CONTEXT_UP_TO_CHILD**

Therefore area ~= 8 * pi ~= 25.13. **OUTPUT** 5.0 ### Example 5.0−C (complex propagation: misjudgment −> wrong theorem use −> derived conclusion) **CONTEXT_UP_TO_CHILD**

work page
[22]

Let A be numbers in {1..100} divisible by 2, and B be numbers divisible by 5

work page
[23]

|A| = 50 and |B| = 20

work page
[24]

Since 2 and 5 are coprime, assume |A \cap B| = 0

work page
[25]

**CANDIDATE_PARENT_NODE**

By inclusion−exclusion, |A \ cup B| = |A| + |B| − |A \cap B| = 50 + 20 − 0 = 70. **CANDIDATE_PARENT_NODE**

work page
[26]

**CHILD_NODE**

Since 2 and 5 are coprime, assume |A \cap B| = 0. **CHILD_NODE**

work page
[27]

**OUTPUT** 5.0 −−− ## Anchor = 4.0 (3 examples) ### Example 4.0−A (direct dependency, but child also has an extra mistake) **CONTEXT_UP_TO_CHILD**

By inclusion−exclusion, |A \ cup B| = |A| + |B| − |A \cap B| = 50 + 20 − 0 = 70. **OUTPUT** 5.0 −−− ## Anchor = 4.0 (3 examples) ### Example 4.0−A (direct dependency, but child also has an extra mistake) **CONTEXT_UP_TO_CHILD**

work page
[30]

**CANDIDATE_PARENT_NODE**

Probability(green) = 5/12 ~= 0.45. **CANDIDATE_PARENT_NODE**

work page
[31]

**CHILD_NODE**

Total marbles = 9 + 5 = 12. **CHILD_NODE**

work page
[32]

**OUTPUT** 4.0 ### Example 4.0−B (complex propagation: wrong theorem form reused; child has extra arithmetic error) **CONTEXT_UP_TO_CHILD**

Probability(green) = 5/12 ~= 0.45. **OUTPUT** 4.0 ### Example 4.0−B (complex propagation: wrong theorem form reused; child has extra arithmetic error) **CONTEXT_UP_TO_CHILD**

work page
[33]

Given |A|=30, |B|=25, |C|=20, | A \cap B|=10, |A \cap C|=5, |B \cap C|=4, and |A \cap B \cap C |=3

We want |A \cup B \cup C|. Given |A|=30, |B|=25, |C|=20, | A \cap B|=10, |A \cap C|=5, |B \cap C|=4, and |A \cap B \cap C |=3

work page
[34]

Use inclusion−exclusion: |A \ cup B \cup C| = |A|+|B|+|C| − | A \cap B| − |A \cap C| − |B \ cap C|

work page
[35]

**CANDIDATE_PARENT_NODE**

Plug in: 30+25+20 = 70, so |A \cup B \cup C| = 70 − 10 − 5 − 4 = 51. **CANDIDATE_PARENT_NODE**

work page
[36]

**CHILD_NODE**

Use inclusion−exclusion: |A \ cup B \cup C| = |A|+|B|+|C| − | A \cap B| − |A \cap C| − |B \ cap C|. **CHILD_NODE**

work page
[37]

**OUTPUT** 4.0 ### Example 4.0−C (simple formula propagation, plus child introduces an extra arithmetic mistake) **CONTEXT_UP_TO_CHILD**

Plug in: 30+25+20 = 70, so |A \cup B \cup C| = 70 − 10 − 5 − 4 = 51. **OUTPUT** 4.0 ### Example 4.0−C (simple formula propagation, plus child introduces an extra arithmetic mistake) **CONTEXT_UP_TO_CHILD**

work page
[38]

We need the area of the circle

Radius r = 4. We need the area of the circle

work page
[39]

Area = 2 * pi * r = 8 * pi

work page
[40]

**CANDIDATE_PARENT_NODE**

Therefore area ~= 8 * 3.14 = 23.12. **CANDIDATE_PARENT_NODE**

work page
[41]

**CHILD_NODE**

Area = 2 * pi * r = 8 * pi. **CHILD_NODE**

work page
[42]

**OUTPUT** 4.0 −−− ## Anchor = 3.0 (3 examples) ### Example 3.0−A (ancestor: a nearer mediator is more direct) **CONTEXT_UP_TO_CHILD**

Therefore area ~= 8 * 3.14 = 23.12. **OUTPUT** 4.0 −−− ## Anchor = 3.0 (3 examples) ### Example 3.0−A (ancestor: a nearer mediator is more direct) **CONTEXT_UP_TO_CHILD**

work page
[43]

A box has 9 blue marbles and 5 green marbles

work page
[44]

Total marbles = 9 + 5 = 12

work page
[45]

Let T = 12 be the total number of marbles

work page
[46]

**CANDIDATE_PARENT_NODE**

Probability(green) = 5/T = 5/12. **CANDIDATE_PARENT_NODE**

work page
[47]

**CHILD_NODE** 35

Total marbles = 9 + 5 = 12. **CHILD_NODE** 35

work page
[48]

**OUTPUT** 3.0 ### Example 3.0−B (related, but a closer step is the direct inducer) **CONTEXT_UP_TO_CHILD**

Probability(green) = 5/T = 5/12. **OUTPUT** 3.0 ### Example 3.0−B (related, but a closer step is the direct inducer) **CONTEXT_UP_TO_CHILD**

work page
[49]

Assume the two draws are independent

work page
[50]

Therefore P(two reds) = P( first red) * P(second red)

work page
[51]

**CANDIDATE_PARENT_NODE**

Take P(first red)=3/5 and P( second red)=3/5, so P(two reds) =(3/5)*(3/5). **CANDIDATE_PARENT_NODE**

work page
[52]

**CHILD_NODE**

Assume the two draws are independent. **CHILD_NODE**

work page
[53]

**OUTPUT** 3.0 ### Example 3.0−C (enabling/ ancestor; child mostly driven by another nearer wrong artifact) **CONTEXT_UP_TO_CHILD**

Take P(first red)=3/5 and P( second red)=3/5, so P(two reds) =(3/5)*(3/5). **OUTPUT** 3.0 ### Example 3.0−C (enabling/ ancestor; child mostly driven by another nearer wrong artifact) **CONTEXT_UP_TO_CHILD**

work page
[54]

The function passes through points (0,1) and (1,3)

work page
[55]

So it must be linear

work page
[56]

Slope m = (3−1)/(1−0) = 1

work page
[57]

**CANDIDATE_PARENT_NODE**

Therefore f(x) = 1 + 1*x = x + 1. **CANDIDATE_PARENT_NODE**

work page
[58]

**CHILD_NODE**

So it must be linear. **CHILD_NODE**

work page
[59]

**OUTPUT** 3.0 −−− ## Anchor = 2.0 (3 examples) ### Example 2.0−A (same symbol/ topic, but child does not use the parent's artifact) **CONTEXT_UP_TO_CHILD**

Therefore f(x) = 1 + 1*x = x + 1. **OUTPUT** 3.0 −−− ## Anchor = 2.0 (3 examples) ### Example 2.0−A (same symbol/ topic, but child does not use the parent's artifact) **CONTEXT_UP_TO_CHILD**

work page
[60]

The problem states n = 8

work page
[61]

Assume n = 10 for convenience

work page
[62]

**CANDIDATE_PARENT_NODE**

Using n = 8, compute 8! = 30240. **CANDIDATE_PARENT_NODE**

work page
[63]

**CHILD_NODE**

Assume n = 10 for convenience. **CHILD_NODE**

work page
[64]

**OUTPUT** 2.0 ### Example 2.0−B (same general domain; errors are independent) **CONTEXT_UP_TO_CHILD**

Using n = 8, compute 8! = 30240. **OUTPUT** 2.0 ### Example 2.0−B (same general domain; errors are independent) **CONTEXT_UP_TO_CHILD**

work page
[65]

For a fair die, P(roll <= 2) = 2/6 = 1/2

work page
[66]

**CANDIDATE_PARENT_NODE**

The expected value of a fair die is 4. **CANDIDATE_PARENT_NODE**

work page
[67]

**CHILD_NODE**

For a fair die, P(roll <= 2) = 2/6 = 1/2. **CHILD_NODE**

work page
[68]

**OUTPUT** 2.0 ### Example 2.0−C (same variable name, but child overwrites/ ignores parent's value) **CONTEXT_UP_TO_CHILD**

The expected value of a fair die is 4. **OUTPUT** 2.0 ### Example 2.0−C (same variable name, but child overwrites/ ignores parent's value) **CONTEXT_UP_TO_CHILD**

work page
[69]

Solve x + 1 = 3 => x = 1

work page
[70]

In part (b), set x = 5

work page
[71]

**CANDIDATE_PARENT_NODE**

Using x = 5, compute y = 2x = 12. **CANDIDATE_PARENT_NODE**

work page
[72]

**CHILD_NODE**

Solve x + 1 = 3 => x = 1. **CHILD_NODE**

work page
[73]

**OUTPUT** 2.0 −−− ## Anchor = 1.0 (2 examples) ### Example 1.0−A (unrelated subgoals) **CONTEXT_UP_TO_CHILD**

Using x = 5, compute y = 2x = 12. **OUTPUT** 2.0 −−− ## Anchor = 1.0 (2 examples) ### Example 1.0−A (unrelated subgoals) **CONTEXT_UP_TO_CHILD**

work page
[74]

Simplify 18/24 by dividing by 6 to get 3/5

work page
[75]

**CANDIDATE_PARENT_NODE**

Different step: derivative of x^2 is 2. **CANDIDATE_PARENT_NODE**

work page
[76]

**CHILD_NODE**

Simplify 18/24 by dividing by 6 to get 3/5. **CHILD_NODE**

work page
[77]

36 **OUTPUT** 1.0 ### Example 1.0−B (different domains, no shared artifacts) **CONTEXT_UP_TO_CHILD**

Different step: derivative of x^2 is 2. 36 **OUTPUT** 1.0 ### Example 1.0−B (different domains, no shared artifacts) **CONTEXT_UP_TO_CHILD**

work page
[78]

Triangle area = (1/2)bh = (1/2)*10*3 = 60

work page
[79]

**CANDIDATE_PARENT_NODE**

Probability of heads in a fair coin is 1/3. **CANDIDATE_PARENT_NODE**

work page
[80]

**CHILD_NODE**

Triangle area = (1/2)bh = (1/2)*10*3 = 60. **CHILD_NODE**

work page
[81]

Probability of heads in a fair coin is 1/3. **OUTPUT** 1.0 −−− # Now score the real case ## CONTEXT_UP_TO_CHILD {{CONTEXT_UP_TO_CHILD}} ## CANDIDATE_PARENT_NODE (earlier) {{CANDIDATE_PARENT_NODE}} ## CHILD_NODE (later) {{CHILD_NODE}} # Output: one line, one number, exactly one decimal (1.0−5.0). No other text. 37

work page