pith. sign in

arxiv: 2308.04371 · v11 · pith:PYNFIMCHnew · submitted 2023-08-08 · 💻 cs.AI

Cumulative Reasoning with Large Language Models

Pith reviewed 2026-05-24 07:49 UTC · model grok-4.3

classification 💻 cs.AI
keywords cumulative reasoninglarge language modelsdirected acyclic graphlogical inferencemathematical problem solvingstep verificationproposer verifier reporter
0
0 comments X

The pith

Cumulative Reasoning improves LLM problem solving by building a graph of verified intermediate propositions through proposer, verifier, and reporter roles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Cumulative Reasoning as a framework that assigns LLMs to proposer, verifier, and reporter roles to decompose problems, validate each step, and accumulate them into a dynamic directed acyclic graph. This structure is tested on logical inference, puzzle solving, and mathematics benchmarks, where it reports higher final accuracy than prior prompting methods. A sympathetic reader would care because current LLMs often fail on multi-step tasks due to unchecked errors accumulating across reasoning chains.

Core claim

By orchestrating LLMs to propose propositions, verify them for correctness and consistency, and report a solution composed from the verified set, Cumulative Reasoning constructs a dynamic DAG that substantially raises accuracy on complex reasoning tasks.

What carries the argument

Dynamic Directed Acyclic Graph (DAG) of verified propositions, assembled by proposers generating candidate steps, verifiers checking each one, and reporters selecting a consistent solution path.

If this is right

  • On the FOLIO logical inference dataset, accuracy reaches 98.04 percent, up to 9.3 percent above previous methods.
  • On the Game of 24 puzzle, accuracy reaches 98 percent, a 24 percent absolute gain over prior approaches.
  • On the MATH dataset, overall accuracy rises 4.2 percent, with a 43 percent relative gain on the hardest level-5 problems.
  • When a code interpreter is added, the method outperforms Program of Thought by 38.8 percent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same verifier-proposer loop could be inserted into scientific reasoning pipelines to flag inconsistent hypotheses before they reach a final answer.
  • If verifiers are drawn from a different model family than proposers, the assumption of unbiased error detection might strengthen further.
  • The DAG structure naturally supports backtracking over rejected branches, which could be exposed as an explicit search parameter in future extensions.

Load-bearing premise

Verifier instances can reliably detect errors or contradictions in propositions generated by the same model family without systematic bias.

What would settle it

A test set of problems where subtle logical contradictions are deliberately inserted into intermediate steps, then measuring whether verifier accuracy falls enough to erase the reported gains over baselines.

read the original abstract

Recent advancements in large language models (LLMs) have shown remarkable progress, yet their ability to solve complex problems remains limited. In this work, we introduce Cumulative Reasoning (CR), a structured framework that enhances LLM problem-solving by emulating human-like iterative and cumulative thought processes. CR orchestrates LLMs in three distinct roles: Proposer, Verifier(s), and Reporter, to systematically decompose tasks, generate and validate intermediate reasoning steps, and compose them into a solution by building a dynamic Directed Acyclic Graph (DAG) of verified propositions. This approach substantially enhances problem-solving capabilities. We demonstrate CR's advantage through several complex reasoning tasks: it outperforms existing methods in logical inference tasks with up to a 9.3% improvement, achieving 98.04% accuracy on the curated FOLIO wiki dataset. In the Game of 24, it achieves 98% accuracy, marking a 24% improvement over previous methods. In solving MATH problems, CR achieves a 4.2% increase from previous methods and a 43% relative improvement in the most challenging level 5 problems. When incorporating a code environment with CR, we further harness LLMs' reasoning capabilities and outperform the Program of Thought (PoT) method by 38.8%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces Cumulative Reasoning (CR), a framework that assigns LLMs to Proposer, Verifier(s), and Reporter roles to generate, validate, and accumulate propositions into a dynamic DAG for complex reasoning. It reports large gains over baselines: 98.04% accuracy on FOLIO, 98% on Game of 24 (24% absolute improvement), 4.2% absolute (43% relative on level-5) on MATH, and 38.8% over Program of Thought when code is added.

Significance. If the reported gains are shown to be robust, CR supplies a concrete, role-separated mechanism for iterative verification that could be adopted in other multi-step reasoning pipelines. The explicit construction of a verified DAG is a clear methodological contribution that distinguishes it from single-pass or self-consistency baselines.

major comments (3)
  1. [Abstract] Abstract: the headline accuracies (98.04% FOLIO, 98% Game of 24, 4.2% MATH) are stated without any information on number of runs, standard deviation, prompt templates, model versions, or data-leakage controls, rendering the numerical claims impossible to evaluate.
  2. [Framework description (roles and DAG construction)] Framework description (roles and DAG construction): the entire performance advantage rests on the Verifier(s) correctly rejecting invalid propositions produced by the Proposer (same model family), yet no precision, recall, false-positive rate, or human-agreement statistics for the verifier are supplied; without this measurement the DAG cannot be shown to improve soundness rather than merely propagate the proposer's blind spots.
  3. [Experiments (Game of 24 and MATH results)] Experiments (Game of 24 and MATH results): the 24% absolute and 43% relative gains are presented as direct comparisons, but no ablation isolating the verifier component or measuring disagreement between proposer and verifier outputs is reported, so it is unclear whether the cumulative DAG, rather than prompt engineering or sampling, drives the improvement.
minor comments (1)
  1. [Abstract] The abstract mentions 'up to a 9.3% improvement' on logical inference without naming the exact baseline method or dataset split.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline accuracies (98.04% FOLIO, 98% Game of 24, 4.2% MATH) are stated without any information on number of runs, standard deviation, prompt templates, model versions, or data-leakage controls, rendering the numerical claims impossible to evaluate.

    Authors: We agree that the abstract would benefit from additional context on the experimental conditions. In the revised manuscript we will add a brief statement or footnote specifying the number of runs, standard deviations (where computed), prompt templates, model versions, and data-leakage controls. revision: yes

  2. Referee: [Framework description (roles and DAG construction)] Framework description (roles and DAG construction): the entire performance advantage rests on the Verifier(s) correctly rejecting invalid propositions produced by the Proposer (same model family), yet no precision, recall, false-positive rate, or human-agreement statistics for the verifier are supplied; without this measurement the DAG cannot be shown to improve soundness rather than merely propagate the proposer's blind spots.

    Authors: The referee correctly notes that direct evaluation of the verifier is important for establishing that the DAG improves soundness. While end-to-end gains provide indirect evidence, we did not report verifier-specific metrics. We will add a dedicated analysis subsection that measures verifier precision, recall, and human agreement on sampled propositions from each benchmark. revision: yes

  3. Referee: [Experiments (Game of 24 and MATH results)] Experiments (Game of 24 and MATH results): the 24% absolute and 43% relative gains are presented as direct comparisons, but no ablation isolating the verifier component or measuring disagreement between proposer and verifier outputs is reported, so it is unclear whether the cumulative DAG, rather than prompt engineering or sampling, drives the improvement.

    Authors: We recognize that isolating the verifier's contribution would strengthen the experimental claims. We will include new ablation studies that remove or weaken the verifier and report disagreement rates between proposer and verifier outputs to demonstrate that the cumulative DAG structure, rather than sampling alone, accounts for the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper introduces Cumulative Reasoning as an empirical orchestration of LLM roles (Proposer, Verifier, Reporter) to build a DAG of propositions, with performance measured directly on public benchmarks (FOLIO, Game of 24, MATH). No equations, fitted parameters, self-citations as load-bearing premises, or ansatzes are described that would reduce any claimed result to its own inputs by construction. The method's soundness assumptions (e.g., verifier reliability) are testable against external ground truth and do not create definitional equivalence between inputs and outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The framework rests on the unstated premise that separate LLM calls can be treated as independent verifiers and that the DAG construction process itself does not introduce selection bias.

invented entities (2)
  • Proposer, Verifier(s), Reporter roles no independent evidence
    purpose: Decompose task, validate steps, and compose final solution
    These are functional assignments given to LLM instances; no independent evidence is supplied that they behave as distinct agents.
  • Dynamic DAG of verified propositions no independent evidence
    purpose: Accumulate only correct intermediate results
    The DAG is a new data structure introduced by the method.

pith-pipeline@v0.9.0 · 5752 in / 1116 out tokens · 40033 ms · 2026-05-24T07:49:22.238883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework

    cs.LG 2026-05 unverdicted novelty 7.0

    AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...

  2. Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.

  3. Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    cs.AI 2023-12 conditional novelty 6.0

    Math-Shepherd is an automatically trained process reward model that scores solution steps to verify and reinforce LLMs, lifting Mistral-7B from 77.9% to 89.1% on GSM8K and 28.6% to 43.5% on MATH.

  4. Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 5.0

    APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

  5. Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

    cs.CL 2026-04 unverdicted novelty 5.0

    FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

  6. On the Diagram of Thought

    cs.CL 2024-09 unverdicted novelty 5.0

    Diagram of Thought (DoT) is a controller-light framework in which an LLM builds typed reasoning diagrams validated online and interpreted as diagrams in a slice topos whose synthesis is a finite limit.

  7. R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

    cs.CV 2025-03 unverdicted novelty 4.0

    R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.

  8. Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    cs.AI 2025-01 unverdicted novelty 4.0

    The survey organizes LLM-based multi-agent collaboration mechanisms into a framework with dimensions of actors, types, structures, strategies, and coordination protocols, reviews applications across domains, and ident...

Reference graph

Works this paper leans on

113 extracted references · 113 canonical work pages · cited by 8 Pith papers

  1. [1]

    For example, properties involving uncountably many real numbers often cannot be expressed in FOL

    Limitations of Expressiveness (L¨ owenheim, 1967): FOL even lacks the expressive power to capture some properties of the real numbers. For example, properties involving uncountably many real numbers often cannot be expressed in FOL. In addition, properties requiring quantification over sets of real numbers or functions from real numbers to real numbers ca...

  2. [2]

    CanFly” and “Fly

    Translation Misalignment: Risk of semantic discrepancies during translation, rendering resolutions ineffective. For instance, translating statements as ∀Bird(x) ⇒ CanFly(x) and ∀x(Fly(x) ⇒ Wings(x)) may cause a misalignment between “CanFly” and “Fly”, leading to flawed conclusions. It often fails to capture the full richness and ambiguity of natural langu...

  3. [3]

    a representative who reads this report

    Undecidability: The general problem of determining the truth of a statement in FOL is undecidable (Turing et al., 1936; Chimakonam, 2012) (deeply connected to the halting problem), constraining its applicability for automated reasoning in complex tasks. C.1 Illustrative example on higher-order logic Here we present a refined example derived from the FraCa...

  4. [4]

    Most of the representatives who read the report have a positive attitude towards it

  5. [5]

    No two representatives have read it at the same time, and they may have different opinions about it

  6. [6]

    No representative took less than half a day to read the report

  7. [7]

    • Hypothesis: It took the representatives more than a week to read the report, and most found it valuable

    There are sixteen representatives. • Hypothesis: It took the representatives more than a week to read the report, and most found it valuable. • Label: [True] • Higher-Order Logic Premises : 1. most(λx.representative(x) ∧ reads(x, report), λx.has positive attitude(x, report)) 2. ¬∃x, y (x ̸= y ∧ representative(x) ∧ representative(y)∧ read at same time(x, y...

  8. [8]

    ¬∃x(representative(x) ∧ took less than half day(λy.read(y, report)))

  9. [9]

    count( λx.representative(x)) = 16 • Higher-Order Logic Hypothesis : ∃t(total reading time(report) > 7 ∧ time unit(t, day))∧ most(λx.representative(x), λx.found valuable(x, report)) [Cumulative Reasoning by GPT-4] • Generated Propositions :

  10. [10]

    Using P1 and P2, we can deduce Q1: Every representative read the report at a different time, and most representatives found the report valuable

    “Using P1 and P2, we can deduce Q1: Every representative read the report at a different time, and most representatives found the report valuable.”

  11. [11]

    Using P3 and Q1, we deduce Q2: The total reading time must be at least 8 days. (since 16 representatives each took at least half a day)

    “Using P3 and Q1, we deduce Q2: The total reading time must be at least 8 days. (since 16 representatives each took at least half a day).”

  12. [12]

    Using Q2 and P4, we deduce Q3: The representatives took at least 8 days to read the report

    “Using Q2 and P4, we deduce Q3: The representatives took at least 8 days to read the report.” • Reasoning: “Based on the intermediate propositions Q1, Q2, and Q3, the hypothesis that it took the representatives more than a week to read the report, and most found it valuable, is confirmed as True.” • Prediction: [True] (Correct) 26 D Appendix for Examples ...

  13. [13]

    ¬(∃x)(Giant(x) ∧ ¬GoodPerformance(x))

  14. [14]

    ∀x (GoodPerformance(x) → UsedByResearchers(x))

  15. [15]

    ∀x (UsedByResearchers(x) → Popular(x))

  16. [16]

    However, the premises do not explicitly provide information regarding GPT-3’s performance or usage by researchers, and hence no definitive conclusion on its popularity can be drawn

    BERT Given that BERT is a giant language model, premise (4) implies that GPT-3 is also a giant language model. However, the premises do not explicitly provide information regarding GPT-3’s performance or usage by researchers, and hence no definitive conclusion on its popularity can be drawn. Prediction: [Unknown] (Incorrect) [Cumulative Reasoning by GPT-4...

  17. [17]

    The roots of the polynomial are distinct integers, and the leading coefficient is 2, so the roots must be divisors of 66 2 = 33

  18. [18]

    The divisors of 33 are ±1, ±3, ±11, ±33

  19. [19]

    The sum of the roots of the polynomial is − an−1 an

  20. [20]

    To minimize |an−1|, we should select the roots to be as close to zero as possible

  21. [21]

    The smallest possible sum of the roots is −1 − 3 + 11 + 33 = 40,

  22. [22]

    Thus, the smallest possible value of |an−1| is 2 × 40 = 80 .” • Answer: 80 (Wrong) [Cumulative Reasoning by GPT-4] • Generated Hints :

  23. [23]

    Hint 1 : Recall Vieta’s formulas, which relate the coefficients of a polynomial to the sums and products of its roots

    “Hint 1 : Recall Vieta’s formulas, which relate the coefficients of a polynomial to the sums and products of its roots.”

  24. [24]

    Hint 2 : The product of the roots of the polynomial is equal to − a0 an

    “Hint 2 : The product of the roots of the polynomial is equal to − a0 an .”

  25. [25]

    Hint 3 : The sum of the roots of the polynomial is equal to − an−1 an

    “Hint 3 : The sum of the roots of the polynomial is equal to − an−1 an .”

  26. [26]

    Hint 4 : Since the roots are distinct integers, consider the factors of − a0 an and their sums

    “Hint 4 : Since the roots are distinct integers, consider the factors of − a0 an and their sums.” • Generated Simple Questions and Answers : Question 1: “What is the product of the roots of the polynomial f(x)?” Answer 1: “The product of the roots of the polynomial is − a0 an = − 66 2 = −33.” Question 2: “What are the possible sets of distinct integer roo...

  27. [27]

    The only types of mammals that lay eggs are platypuses and echidnas

  28. [28]

    Platypuses are not hyrax

  29. [29]

    Echidnas are not hyrax

  30. [30]

    No mammals are invertebrates

  31. [31]

    All animals are either vertebrates or invertebrates

  32. [32]

    • Label: [False] [Chain-of-Thought Reasoning by GPT-4] • Reasoning: “The premises state that the only types of mammals that lay eggs are platypuses and echidnas

    Grebes are not platypuses and also not echidnas. • Label: [False] [Chain-of-Thought Reasoning by GPT-4] • Reasoning: “The premises state that the only types of mammals that lay eggs are platypuses and echidnas. Hyraxes are mammals but are neither platypuses nor echidnas. Since the conclusion is about hyraxes laying eggs but there’s no direct information i...

  33. [33]

    All mammals are either vertebrates or invertebrates

    “All mammals are either vertebrates or invertebrates.”

  34. [34]

    Hyraxes do not lay eggs, as they are neither platypuses nor echidnas

    “Hyraxes do not lay eggs, as they are neither platypuses nor echidnas.” • Reasoning: “We can deduce that the only types of mammals that lay eggs are platypuses and echidnas. Hyraxes are mammals, but they are neither platypuses nor echidnas. Therefore, hyraxes do not lay eggs.” • Prediction: [False] (Correct) [Example ID: 546] • Hypothesis: Extractive mode...

  35. [35]

    A summarization model is always faithful if it uses content from the input documents

  36. [36]

    Extractive models are a kind of summarization models

  37. [37]

    • Label: [False] 31 [Chain-of-Thought Reasoning by GPT-4] • Reasoning: “The premises state that a summarization model is always faithful if it uses content from the input documents

    Extractive model can only use content from the input documents. • Label: [False] 31 [Chain-of-Thought Reasoning by GPT-4] • Reasoning: “The premises state that a summarization model is always faithful if it uses content from the input documents. Extractive models are a kind of summarization model, and they can only use content from the input documents. Ho...

  38. [38]

    All extractive models are faithful summarization models since they use content from the input documents

    “All extractive models are faithful summarization models since they use content from the input documents.”

  39. [39]

    All faithful summarization models use content from the input documents

    “All faithful summarization models use content from the input documents.” • Reasoning: “We can infer that extractive models are always faithful because they only use content from the input documents and satisfy the condition for a summarization model to be faithful. Therefore, the conclusion that extractive models are not always faithful is incorrect.” • ...

  40. [40]

    Greyhound racing is a competitive sport where spectators bet on greyhounds

  41. [41]

    Coursing is a part of Greyhound racing

  42. [42]

    Some competitive sports where spectators bet on things are banned

  43. [43]

    Coursing involves spectators betting on a hare being pursued by greyhounds

  44. [44]

    Hares are small game

  45. [45]

    If a competitive sport involves spectators betting on small game, then it is banned. • Label: [False] [Chain-of-Thought Reasoning by GPT-4] • Reasoning: “The premises state that coursing is a part of greyhound racing, and it involves spectators betting on a hare being pursued by greyhounds. Hares are small game, and if a competitive sport involves spectat...

  46. [46]

    If a competitive sport involves spectators betting on hares, then it is banned

    “If a competitive sport involves spectators betting on hares, then it is banned.”

  47. [47]

    Coursing is a competitive sport that involves spectators betting on small game

    “Coursing is a competitive sport that involves spectators betting on small game.” • Reasoning: “We know that coursing is a competitive sport involving spectators betting on small game, specifically hares. The premises also state that if a competitive sport involves spectators betting on small game, then it is banned. Thus, we can deduce that there is a po...

  48. [48]

    If you have room for dessert, you have room for broccoli

  49. [49]

    Everyone at Luis’s dinner party has room for dessert, including Luis

  50. [50]

    Mauricia does not have room for broccoli

  51. [51]

    Luis’s dinner party is the first ever dinner party that Allison has attended

  52. [52]

    Gustave has room for both broccoli and asparagus

  53. [53]

    • Label: [True] [Chain-of-Thought Reasoning by GPT-4] • Reasoning: “The premises state that if you have room for dessert, you have room for broccoli

    Broccoli and asparagus are both vegetables. • Label: [True] [Chain-of-Thought Reasoning by GPT-4] • Reasoning: “The premises state that if you have room for dessert, you have room for broccoli. Everyone at Luis’s dinner party has room for dessert, including Luis. However, we do not have information specifically about Allison having room for dessert or not...

  54. [54]

    Gustave has room for vegetables

    “Gustave has room for vegetables.”

  55. [55]

    Everyone at Luis’s dinner party, including Mauricia, has room for dessert

    “Everyone at Luis’s dinner party, including Mauricia, has room for dessert.” • Reasoning: “We know that:

  56. [56]

    If one has room for dessert, they have room for broccoli

  57. [57]

    Using these premises, we can infer that since everyone at Luis’s dinner party has room for dessert, they also have room for broccoli

    Everyone at Luis’s dinner party, including Luis and Allison, has room for dessert. Using these premises, we can infer that since everyone at Luis’s dinner party has room for dessert, they also have room for broccoli. Therefore, Allison, who is at Luis’s dinner party, must have room for broccoli.” • Prediction: [True] (Correct) E.2 Curating FOLIO wiki dataset

  58. [58]

    34, 62, 162, 167, 228, 268, 526, 677, 679 ) 33

    Missing common knowledge or contradictory to common knowledge; (9 in total, Example ID No. 34, 62, 162, 167, 228, 268, 526, 677, 679 ) 33

  59. [59]

    141, 215, 216, 223, 252, 261, 298, 321, 330, 396, 402, 409, 411, 431, 432, 456, 457, 482, 483, 496, 563, 572, 599, 624, 629, 641, 654, 660, 673, 682, 698, 750 )

    Overly ambiguous problems failing to provide unequivocal answers; (37 in total, Example ID No. 141, 215, 216, 223, 252, 261, 298, 321, 330, 396, 402, 409, 411, 431, 432, 456, 457, 482, 483, 496, 563, 572, 599, 624, 629, 641, 654, 660, 673, 682, 698, 750 )

  60. [60]

    640, 643 )

    Inherent inconsistencies presented within the premises; (2 in total, Example ID No. 640, 643 )

  61. [61]

    314, 315 )

    Vague premises or typographical errors; (2 in total, Example ID No. 314, 315 )

  62. [62]

    (24 in total, Example ID No

    Incorrect answers. (24 in total, Example ID No. 9, 46, 52, 84, 100, 144, 273, 276, 299, 310, 322, 345, 367, 437, 452, 453, 464, 557, 573, 578, 605, 632, 671, 715 ) [Problem Description] • Example ID: 679 • Premises:

  63. [63]

    Zaha Hadid is a British-Iraqi architect, artist and designer

  64. [64]

    Zaha Hadid was born on 31 October 1950 in Baghdad, Iraq

  65. [65]

    Hadid was a visiting professor of Architectural Design at the Yale School of Architecture

  66. [66]

    • Hypothesis: Hadid was born in 1982

    Max is an aspiring architecture student, and he plans to apply to Yale School of Architecture. • Hypothesis: Hadid was born in 1982. • FOL Label : [Unknown] • Human Label : [False] • Explanation: We can see that Zaha Hadid was born on 31 October 1950 in Baghdad, Iraq. This directly contradicts the hypothesis that Hadid was born in 1982. It is common knowl...

  67. [67]

    The Croton River watershed is the drainage basin of the Croton River

  68. [68]

    The Croton River is in southwestern New York

  69. [69]

    Water from the Croton River watershed flows to the Bronx

  70. [70]

    • Hypothesis: Water from the Croton River flows to the Bronx

    The Bronx is in New York. • Hypothesis: Water from the Croton River flows to the Bronx. • Label: [Unknown] • Wrong Type: [Type 1: Missing common knowledge or contradictory to common knowledge in the premises] • Explanation: We understand that the Croton River is in southwestern New York, and the Bronx is also located in New York. It is stated that water f...

  71. [71]

    Bernarda Bryson Shahn was a painter and lithographer

  72. [72]

    Bernarda Bryson Shahn was born in Athens, Ohio

  73. [73]

    Bernarda Bryson Shahn was married to Ben Shahn

  74. [74]

    • Hypothesis: Bernarda Bryson Shahn was born in Greece

    People born in Athens, Ohio are Americans. • Hypothesis: Bernarda Bryson Shahn was born in Greece. • Label: [Unknown] • Wrong Type: [Type 1: Missing common knowledge or contradictory to common knowledge in the premises] • Explanation: We know that Bernarda Bryson Shahn was born in Athens, Ohio. It is common knowledge that Greece is not in Ohio. It also st...

  75. [75]

    The Golden State Warriors are a team from San Francisco

  76. [76]

    The Golden State Warriors won the NBA finals

  77. [77]

    All teams attending the NBA finals have more than thirty years of history

  78. [78]

    Boston Celtics are a team that lost the NBA finals

  79. [79]

    If a team wins the NBA finals, then they will have more income

  80. [80]

    • Hypothesis: The Golden State Warriors will have more income for gate receipts

    If a team wins or loses at the NBA finals, then they are attending the finals. • Hypothesis: The Golden State Warriors will have more income for gate receipts. • Label: [True] • Wrong Type: [Type 1: Missing common knowledge or contradictory to common knowledge in the premises] • Explanation: We know that the Golden State Warriors won the NBA finals and th...

Showing first 80 references.