pith. machine review for the scientific record. sign in

arxiv: 2604.02967 · v1 · submitted 2026-04-03 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:35 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords Large Reasoning ModelsForest of ErrorsTest-time scalingReasoning efficiencyError propagationSelf-guided inference
0
0 comments X

The pith

In large reasoning models the first solution is the best because errors form a growing forest that harms later paths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models explore multiple solutions at test time yet the first one consistently outperforms the rest. The paper characterizes the errors that appear across these paths as a forest-structured Forest of Errors that grows as more computation is spent. This pattern directly challenges the assumption that generating additional solutions improves results. The authors therefore introduce a framework that strengthens the initial solution and removes the subsequent ones.

Core claim

The central claim is that FoE makes the First the Best: errors within reasoning paths form a forest structure and scale concurrently with test time, so alternative solutions become not merely suboptimal but actively detrimental to final performance. This observation is supported by empirical characterization across models and by theoretical analysis, and it motivates a self-guided method that refines the first path while discarding later ones.

What carries the argument

Forest of Errors (FoE), the forest-structured pattern of errors that accumulates in reasoning paths as test-time compute increases.

Load-bearing premise

Errors within the reasoning path scale concurrently with test time.

What would settle it

An experiment in which generating more alternative solutions reduces overall error rates and improves final accuracy would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.02967 by Guojie Song, Haonan Dong, Kehan Jiang, Zhaolu Kang, Zhengzhou Zhu.

Figure 1
Figure 1. Figure 1: (Upper) The First is The Best. (Lower) Forest of Errors. step reasoning processes. Furthermore, the ad￾vent of DeepSeek-R1 (DeepSeek-AI et al., 2025) marks a paradigm shift for Large Reasoning Mod￾els (LRMs). Powered by Reinforcement Learn￾ing (RL)-based training frameworks (Yue et al., 2025; Wang et al., 2025a; Yu et al., 2025), R1-like models tend to generate more extensive responses while exhibiting an … view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Key observations within the FoE. (Right) Our proposed RED framework. Dataset Model ✓ →✗ ✗ →✗ ✗ →✓ AIME25 Qwen-distilled-7b 18.3 77.4 4.3 Qwen-distilled-32b 16.7 76.9 6.4 Qwen3-8b 15.5 79.1 5.4 Qwen3-32b 13.0 79.9 7.1 Llama-distilled-70b 18.8 75.4 5.8 GPQA￾Diamond Qwen-distilled-7b 16.0 81.1 2.9 Qwen-distilled-32b 15.3 81.4 3.3 Qwen3-8b 14.4 82.7 2.9 Qwen3-32b 14.1 82.8 3.1 Llama-distilled-70b 21.2 7… view at source ↗
Figure 3
Figure 3. Figure 3: Manual correction on distinct error node types. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of various node types with respect [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average distribution of correction types (True, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation of intra-solution reflection metrics. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Test-time scalability under self-consistency. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average sampling error rate (%, lower is bet [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of absolute PCS disagreement [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Convergence (A, success). For a fixed probe prompt, the modal answer ratio (mode count / N) in￾creases over checkpoints and exceeds the threshold P before the First answer becomes explicit; bars are green since the mode is correct. Variant ESC (%) WESR (%) Crisk@Agree (%) A 73.20 0.88 1.20 [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Robustness (B, success). With M diverse probe prompts and one sample per prompt (N=1), the number of unique induced answers decreases over time and eventually reaches 1 (green), indicating cross￾prompt agreement. explicit answer text. However, [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Robustness+Parallelism (B+C, success). Adding per-prompt parallel samples stabilizes prompt￾wise modes; once modes align, the trigger is correct (green), and the minimum internal mode ratio rises. 100 200 300 400 500 600 Token position t 0.0 0.2 0.4 0.6 0.8 1.0 Min internal mode-ratio First answer P=0.60 (not enforced in B+C) Trigger & Correct Trigger & Wrong No Trigger / Diverse [PITH_FULL_IMAGE:figures… view at source ↗
Figure 15
Figure 15. Figure 15: Robustness+Parallelism (B+C, failure). Without enforcing an internal threshold, prompt-wise modes may align too early when internal mode ratios are still low, producing a wrong trigger (red). induced answer [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
read the original abstract

Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that large reasoning models exhibit a 'The First is The Best' phenomenon because errors form a Forest of Errors (FoE) that scale concurrently with test-time compute and additional reasoning paths, rendering later solutions detrimental rather than merely suboptimal. It supports this via empirical analysis across five benchmarks and six models, proposes the RED framework (Refining First to suppress FoE growth plus Discarding Subs via dual-consistency pruning), and reports up to 19% performance gains with 37.7–70.4% token reduction over eight baselines.

Significance. If the error-scaling mechanism can be isolated from artifacts and the theoretical account formalized, the work would meaningfully challenge test-time scaling assumptions in LRMs and supply a practical efficiency recipe. The breadth of experiments (multiple backbones and benchmarks) plus the introduction of FoE metrics for comparative analysis are concrete strengths that could influence follow-on work on reasoning efficiency.

major comments (3)
  1. [Abstract / §3] Abstract and §3 (theoretical analysis): the central claim that 'errors within the reasoning path scale concurrently with test time' is asserted as underpinned by rigorous theory, yet no equations, formal model of per-step error accumulation, or derivation showing why later paths become net detrimental appear in the text; without this the hypothesis remains observational.
  2. [§4] §4 (empirical characterization of FoE): no explicit statistics (per-step error probability, inter-path divergence, or forest-depth histograms) are reported that isolate concurrent error growth from prompt bias, early stopping, or benchmark saturation; the superiority of the first solution could therefore be explained by alternative mechanisms.
  3. [§5.2] §5.2 (Discarding Subs): the dual-consistency criterion is described at a high level but lacks a precise definition of consistency metric and an ablation confirming it does not presuppose the first solution is correct, which is load-bearing for the claim that subsequent solutions are actively detrimental.
minor comments (2)
  1. [Figures / §4.3] Figure captions and §4.3 should explicitly define the FoE metrics used in the comparative experiments so readers can reproduce the reported gains.
  2. [Abstract] The abstract states 'comprehensive empirical analysis' but provides no table of per-benchmark error rates or controls; adding these would strengthen verifiability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to strengthen the theoretical formalization, empirical isolation of mechanisms, and precise definitions with supporting ablations.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (theoretical analysis): the central claim that 'errors within the reasoning path scale concurrently with test time' is asserted as underpinned by rigorous theory, yet no equations, formal model of per-step error accumulation, or derivation showing why later paths become net detrimental appear in the text; without this the hypothesis remains observational.

    Authors: We appreciate this observation. While §3 presented a conceptual framework and empirical characterization of the Forest of Errors, we acknowledge that it lacked formal equations or derivations. In the revised manuscript, we have added a mathematical model in §3, including equations for per-step error probability accumulation and a derivation showing why additional paths become net detrimental under concurrent FoE scaling. revision: yes

  2. Referee: [§4] §4 (empirical characterization of FoE): no explicit statistics (per-step error probability, inter-path divergence, or forest-depth histograms) are reported that isolate concurrent error growth from prompt bias, early stopping, or benchmark saturation; the superiority of the first solution could therefore be explained by alternative mechanisms.

    Authors: We thank the referee for highlighting the need for stronger isolation. The original §4 reported aggregate FoE metrics, but the revised version now includes explicit per-step error probabilities, inter-path divergence measures, and forest-depth histograms. We have also added controlled experiments and ablations addressing prompt bias, early stopping, and benchmark saturation to rule out alternative explanations. revision: yes

  3. Referee: [§5.2] §5.2 (Discarding Subs): the dual-consistency criterion is described at a high level but lacks a precise definition of consistency metric and an ablation confirming it does not presuppose the first solution is correct, which is load-bearing for the claim that subsequent solutions are actively detrimental.

    Authors: We agree that greater precision is required. The revised §5.2 now provides the exact mathematical definition of the dual-consistency metric (joint answer-level and step-level consistency). We have added an ablation study evaluating pruning performance on instances where the first solution is incorrect, confirming that the method does not presuppose its correctness. revision: yes

Circularity Check

0 steps flagged

No circularity: observation-to-characterization chain remains independent of fitted inputs

full rationale

The paper begins from an empirical observation ('The First is The Best') on LRMs, forms a hypothesis about concurrent error scaling with test-time compute, introduces FoE as a descriptive characterization of that scaling, and then validates a downstream framework (RED) through benchmark experiments. No equation or derivation reduces the central claim to a parameter fit by construction, nor does any load-bearing step rely on a self-citation whose content is itself unverified within the paper. The theoretical analysis is presented as supporting the characterization rather than being definitionally equivalent to the input observation. External benchmarks and comparative FoE metrics provide falsifiable content outside the fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Central claim rests on the unverified hypothesis that errors scale with test time and form a forest structure; no free parameters or invented entities with independent evidence are detailed in the abstract.

axioms (1)
  • domain assumption errors within the reasoning path scale concurrently with test time
    Explicitly stated as the basis for hypothesizing FoE structure.
invented entities (1)
  • Forest of Errors (FoE) no independent evidence
    purpose: characterize the structure of errors that makes the first solution best
    New term introduced to explain the empirical pattern; no independent falsifiable handle provided in abstract.

pith-pipeline@v0.9.0 · 5521 in / 1293 out tokens · 41026 ms · 2026-05-13T19:35:20.304403+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

    cs.AI 2026-05 unverdicted novelty 8.0

    Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

  2. TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    TEMA is the first framework for multi-modification composed image retrieval, using entity mapping to improve accuracy on both new complex datasets and existing benchmarks while balancing efficiency.

  3. Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.

  4. HABIT: Chrono-Synergia Robust Progressive Learning Framework for Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    HABIT improves robustness in composed image retrieval under noisy triplets by quantifying sample cleanliness via mutual information transition rates and applying dual-consistency progressive learning to retain good pa...

  5. Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation

    cs.CV 2026-04 unverdicted novelty 6.0

    A new SLT framework uses latent thoughts as a middle reasoning layer and plan-then-ground decoding to improve coherence and faithfulness in gloss-free sign language translation.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 5 Pith papers · 4 internal anchors

  1. [1]

    Sketch-of-thought: Efficient LLM reasoning with adaptive cognitive-inspired sketching.CoRR, abs/2503.05179. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela M...

  2. [2]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code.CoRR, abs/2107.03374. Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. 2024. Do NOT think that much for 2+3=? on the overthinking of o1-like llms.CoRR, abs/2412.21187. Jeffrey Cheng a...

  3. [3]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word prob- lems.arXiv preprint arXiv:2110.14168. Mz Dai, Chenxu Yang, and Qingyi Si. 2025. S-GRPO: Early exit via reinforcement learning in reasoning models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu,...

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capa- bility in llms via reinforcement learning.Preprint, arXiv:2501.12948. Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. 2025. AuroRA: Breaking low-rank bottleneck of loRA with nonlinear mapping. InThe Thirty-ninth Annual Conference on Neural Information Process- ing Systems. Sebastian Farquhar, Jannik Kossen,...

  5. [5]

    InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 24312–24320

    C3ot: Generating shorter chain-of-thought without compromising effectiveness. InAAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 - March 4, 2025, Philadelphia, PA, USA, pages 24312–24320. AAAI Press. Bespoke Labs. 2025. Bespoke-stratos: The unrea- sonable effectiveness of reasoning distillation. https://www...

  6. [6]

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe

    OpenReview.net. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe

  7. [7]

    Let's Verify Step by Step

    Let’s verify step by step.Preprint, arXiv:2305.20050. Tengxiao Liu, Qipeng Guo, Xiangkun Hu, Cheng Ji- ayang, Yue Zhang, Xipeng Qiu, and Zheng Zhang

  8. [8]

    O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.ArXiv, abs/2501.12570, 2025

    Can language models learn to skip steps? In Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Pro- cessing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shi- wei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. 2025. O1-pruner: Le...

  9. [9]

    apparently cor- rect

    modifies the reward function to penalize correct answers that deviate from the average length, encouraging concise reasoning. • GRPO(DeepSeek-AI et al., 2025) (DeepSeek-R1) applies Group Relative Policy Optimization to incentivize reasoning capabilities by comparing a group of candi- date outputs and updating the policy based on normalized advantages. • S...

  10. [10]

    **Dependency**: e_j's wrongness depends on at least one wrong artifact introduced by e_i; AND

  11. [11]

    independent

    **Directness**: within the provided prefix, there is **no closer, more−specific error step** that better explains the particular artifact reuse in e_j. −−− ## High−score propagation patterns High PCS scores are appropriate for **any strong artifact flow **, including: −**numeric propagation** (wrong value reused downstream), −**formula propagation** (wron...

  12. [12]

    There are 6 chapters with 18 pages each

  13. [13]

    Total pages = 6 * 18 = 96

  14. [14]

    **CANDIDATE_PARENT_NODE**

    Printing costs $0.10 per page, so total cost = 96 * 0.10 = $9 .60. **CANDIDATE_PARENT_NODE**

  15. [15]

    **CHILD_NODE**

    Total pages = 6 * 18 = 96. **CHILD_NODE**

  16. [16]

    **OUTPUT** 5.0 ### Example 5.0−B (simple formula propagation) **CONTEXT_UP_TO_CHILD**

    Printing costs $0.10 per page, so total cost = 96 * 0.10 = $9 .60. **OUTPUT** 5.0 ### Example 5.0−B (simple formula propagation) **CONTEXT_UP_TO_CHILD**

  17. [19]

    34 **CANDIDATE_PARENT_NODE**

    Therefore area ~= 8 * pi ~= 25.13. 34 **CANDIDATE_PARENT_NODE**

  18. [21]

    **OUTPUT** 5.0 ### Example 5.0−C (complex propagation: misjudgment −> wrong theorem use −> derived conclusion) **CONTEXT_UP_TO_CHILD**

    Therefore area ~= 8 * pi ~= 25.13. **OUTPUT** 5.0 ### Example 5.0−C (complex propagation: misjudgment −> wrong theorem use −> derived conclusion) **CONTEXT_UP_TO_CHILD**

  19. [22]

    Let A be numbers in {1..100} divisible by 2, and B be numbers divisible by 5

  20. [23]

    |A| = 50 and |B| = 20

  21. [24]

    Since 2 and 5 are coprime, assume |A \cap B| = 0

  22. [25]

    **CANDIDATE_PARENT_NODE**

    By inclusion−exclusion, |A \ cup B| = |A| + |B| − |A \cap B| = 50 + 20 − 0 = 70. **CANDIDATE_PARENT_NODE**

  23. [26]

    **CHILD_NODE**

    Since 2 and 5 are coprime, assume |A \cap B| = 0. **CHILD_NODE**

  24. [27]

    **OUTPUT** 5.0 −−− ## Anchor = 4.0 (3 examples) ### Example 4.0−A (direct dependency, but child also has an extra mistake) **CONTEXT_UP_TO_CHILD**

    By inclusion−exclusion, |A \ cup B| = |A| + |B| − |A \cap B| = 50 + 20 − 0 = 70. **OUTPUT** 5.0 −−− ## Anchor = 4.0 (3 examples) ### Example 4.0−A (direct dependency, but child also has an extra mistake) **CONTEXT_UP_TO_CHILD**

  25. [30]

    **CANDIDATE_PARENT_NODE**

    Probability(green) = 5/12 ~= 0.45. **CANDIDATE_PARENT_NODE**

  26. [31]

    **CHILD_NODE**

    Total marbles = 9 + 5 = 12. **CHILD_NODE**

  27. [32]

    **OUTPUT** 4.0 ### Example 4.0−B (complex propagation: wrong theorem form reused; child has extra arithmetic error) **CONTEXT_UP_TO_CHILD**

    Probability(green) = 5/12 ~= 0.45. **OUTPUT** 4.0 ### Example 4.0−B (complex propagation: wrong theorem form reused; child has extra arithmetic error) **CONTEXT_UP_TO_CHILD**

  28. [33]

    Given |A|=30, |B|=25, |C|=20, | A \cap B|=10, |A \cap C|=5, |B \cap C|=4, and |A \cap B \cap C |=3

    We want |A \cup B \cup C|. Given |A|=30, |B|=25, |C|=20, | A \cap B|=10, |A \cap C|=5, |B \cap C|=4, and |A \cap B \cap C |=3

  29. [34]

    Use inclusion−exclusion: |A \ cup B \cup C| = |A|+|B|+|C| − | A \cap B| − |A \cap C| − |B \ cap C|

  30. [35]

    **CANDIDATE_PARENT_NODE**

    Plug in: 30+25+20 = 70, so |A \cup B \cup C| = 70 − 10 − 5 − 4 = 51. **CANDIDATE_PARENT_NODE**

  31. [36]

    **CHILD_NODE**

    Use inclusion−exclusion: |A \ cup B \cup C| = |A|+|B|+|C| − | A \cap B| − |A \cap C| − |B \ cap C|. **CHILD_NODE**

  32. [37]

    **OUTPUT** 4.0 ### Example 4.0−C (simple formula propagation, plus child introduces an extra arithmetic mistake) **CONTEXT_UP_TO_CHILD**

    Plug in: 30+25+20 = 70, so |A \cup B \cup C| = 70 − 10 − 5 − 4 = 51. **OUTPUT** 4.0 ### Example 4.0−C (simple formula propagation, plus child introduces an extra arithmetic mistake) **CONTEXT_UP_TO_CHILD**

  33. [38]

    We need the area of the circle

    Radius r = 4. We need the area of the circle

  34. [39]

    Area = 2 * pi * r = 8 * pi

  35. [40]

    **CANDIDATE_PARENT_NODE**

    Therefore area ~= 8 * 3.14 = 23.12. **CANDIDATE_PARENT_NODE**

  36. [41]

    **CHILD_NODE**

    Area = 2 * pi * r = 8 * pi. **CHILD_NODE**

  37. [42]

    **OUTPUT** 4.0 −−− ## Anchor = 3.0 (3 examples) ### Example 3.0−A (ancestor: a nearer mediator is more direct) **CONTEXT_UP_TO_CHILD**

    Therefore area ~= 8 * 3.14 = 23.12. **OUTPUT** 4.0 −−− ## Anchor = 3.0 (3 examples) ### Example 3.0−A (ancestor: a nearer mediator is more direct) **CONTEXT_UP_TO_CHILD**

  38. [43]

    A box has 9 blue marbles and 5 green marbles

  39. [44]

    Total marbles = 9 + 5 = 12

  40. [45]

    Let T = 12 be the total number of marbles

  41. [46]

    **CANDIDATE_PARENT_NODE**

    Probability(green) = 5/T = 5/12. **CANDIDATE_PARENT_NODE**

  42. [47]

    **CHILD_NODE** 35

    Total marbles = 9 + 5 = 12. **CHILD_NODE** 35

  43. [48]

    **OUTPUT** 3.0 ### Example 3.0−B (related, but a closer step is the direct inducer) **CONTEXT_UP_TO_CHILD**

    Probability(green) = 5/T = 5/12. **OUTPUT** 3.0 ### Example 3.0−B (related, but a closer step is the direct inducer) **CONTEXT_UP_TO_CHILD**

  44. [49]

    Assume the two draws are independent

  45. [50]

    Therefore P(two reds) = P( first red) * P(second red)

  46. [51]

    **CANDIDATE_PARENT_NODE**

    Take P(first red)=3/5 and P( second red)=3/5, so P(two reds) =(3/5)*(3/5). **CANDIDATE_PARENT_NODE**

  47. [52]

    **CHILD_NODE**

    Assume the two draws are independent. **CHILD_NODE**

  48. [53]

    **OUTPUT** 3.0 ### Example 3.0−C (enabling/ ancestor; child mostly driven by another nearer wrong artifact) **CONTEXT_UP_TO_CHILD**

    Take P(first red)=3/5 and P( second red)=3/5, so P(two reds) =(3/5)*(3/5). **OUTPUT** 3.0 ### Example 3.0−C (enabling/ ancestor; child mostly driven by another nearer wrong artifact) **CONTEXT_UP_TO_CHILD**

  49. [54]

    The function passes through points (0,1) and (1,3)

  50. [55]

    So it must be linear

  51. [56]

    Slope m = (3−1)/(1−0) = 1

  52. [57]

    **CANDIDATE_PARENT_NODE**

    Therefore f(x) = 1 + 1*x = x + 1. **CANDIDATE_PARENT_NODE**

  53. [58]

    **CHILD_NODE**

    So it must be linear. **CHILD_NODE**

  54. [59]

    **OUTPUT** 3.0 −−− ## Anchor = 2.0 (3 examples) ### Example 2.0−A (same symbol/ topic, but child does not use the parent's artifact) **CONTEXT_UP_TO_CHILD**

    Therefore f(x) = 1 + 1*x = x + 1. **OUTPUT** 3.0 −−− ## Anchor = 2.0 (3 examples) ### Example 2.0−A (same symbol/ topic, but child does not use the parent's artifact) **CONTEXT_UP_TO_CHILD**

  55. [60]

    The problem states n = 8

  56. [61]

    Assume n = 10 for convenience

  57. [62]

    **CANDIDATE_PARENT_NODE**

    Using n = 8, compute 8! = 30240. **CANDIDATE_PARENT_NODE**

  58. [63]

    **CHILD_NODE**

    Assume n = 10 for convenience. **CHILD_NODE**

  59. [64]

    **OUTPUT** 2.0 ### Example 2.0−B (same general domain; errors are independent) **CONTEXT_UP_TO_CHILD**

    Using n = 8, compute 8! = 30240. **OUTPUT** 2.0 ### Example 2.0−B (same general domain; errors are independent) **CONTEXT_UP_TO_CHILD**

  60. [65]

    For a fair die, P(roll <= 2) = 2/6 = 1/2

  61. [66]

    **CANDIDATE_PARENT_NODE**

    The expected value of a fair die is 4. **CANDIDATE_PARENT_NODE**

  62. [67]

    **CHILD_NODE**

    For a fair die, P(roll <= 2) = 2/6 = 1/2. **CHILD_NODE**

  63. [68]

    **OUTPUT** 2.0 ### Example 2.0−C (same variable name, but child overwrites/ ignores parent's value) **CONTEXT_UP_TO_CHILD**

    The expected value of a fair die is 4. **OUTPUT** 2.0 ### Example 2.0−C (same variable name, but child overwrites/ ignores parent's value) **CONTEXT_UP_TO_CHILD**

  64. [69]

    Solve x + 1 = 3 => x = 1

  65. [70]

    In part (b), set x = 5

  66. [71]

    **CANDIDATE_PARENT_NODE**

    Using x = 5, compute y = 2x = 12. **CANDIDATE_PARENT_NODE**

  67. [72]

    **CHILD_NODE**

    Solve x + 1 = 3 => x = 1. **CHILD_NODE**

  68. [73]

    **OUTPUT** 2.0 −−− ## Anchor = 1.0 (2 examples) ### Example 1.0−A (unrelated subgoals) **CONTEXT_UP_TO_CHILD**

    Using x = 5, compute y = 2x = 12. **OUTPUT** 2.0 −−− ## Anchor = 1.0 (2 examples) ### Example 1.0−A (unrelated subgoals) **CONTEXT_UP_TO_CHILD**

  69. [74]

    Simplify 18/24 by dividing by 6 to get 3/5

  70. [75]

    **CANDIDATE_PARENT_NODE**

    Different step: derivative of x^2 is 2. **CANDIDATE_PARENT_NODE**

  71. [76]

    **CHILD_NODE**

    Simplify 18/24 by dividing by 6 to get 3/5. **CHILD_NODE**

  72. [77]

    36 **OUTPUT** 1.0 ### Example 1.0−B (different domains, no shared artifacts) **CONTEXT_UP_TO_CHILD**

    Different step: derivative of x^2 is 2. 36 **OUTPUT** 1.0 ### Example 1.0−B (different domains, no shared artifacts) **CONTEXT_UP_TO_CHILD**

  73. [78]

    Triangle area = (1/2)bh = (1/2)*10*3 = 60

  74. [79]

    **CANDIDATE_PARENT_NODE**

    Probability of heads in a fair coin is 1/3. **CANDIDATE_PARENT_NODE**

  75. [80]

    **CHILD_NODE**

    Triangle area = (1/2)bh = (1/2)*10*3 = 60. **CHILD_NODE**

  76. [81]

    Probability of heads in a fair coin is 1/3. **OUTPUT** 1.0 −−− # Now score the real case ## CONTEXT_UP_TO_CHILD {{CONTEXT_UP_TO_CHILD}} ## CANDIDATE_PARENT_NODE (earlier) {{CANDIDATE_PARENT_NODE}} ## CHILD_NODE (later) {{CHILD_NODE}} # Output: one line, one number, exactly one decimal (1.0−5.0). No other text. 37