Abductive Reasoning with Probabilistic Commonsense
Pith reviewed 2026-05-11 03:00 UTC · model grok-4.3
The pith
By sampling multiple possible commonsense proofs from a language model and aggregating their conclusions, a new algorithm determines what most people would likely judge as true or false more accurately than methods assuming fixed facts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PACS samples multiple proofs by prompting an LLM to supply commonsense assumptions and using a formal solver to validate each one, treats every valid sample as an observation of one possible individual's distinct belief set, and aggregates the conclusions across samples to estimate whether most people would accept a given statement as true or false.
What carries the argument
PACS, the algorithm that samples LLM-generated proofs as observations of varied commonsense beliefs and aggregates their conclusions to approximate majority human judgment.
If this is right
- PACS achieves higher performance than chain-of-thought reasoning on the tested benchmarks.
- It outperforms prior neurosymbolic methods that supply fixed commonsense assumptions.
- It also beats search-based approaches by explicitly modeling variation rather than seeking a single solution.
- The method can be applied across multiple benchmarks without requiring new human annotations for each commonsense fact.
Where Pith is reading between the lines
- The sampling approach could be adjusted to target specific demographic groups instead of a generic majority if human data from those groups were used to guide prompt variation.
- Similar aggregation over multiple LLM outputs might apply to other subjective tasks such as preference modeling or ethical judgment where single answers are unreliable.
- The framework suggests that the cost of additional samples trades off against accuracy in approximating human belief distributions, opening a path to efficiency studies.
Load-bearing premise
Repeated sampling from the language model produces a distribution of proofs that approximates how human commonsense beliefs actually differ, so that the aggregated outcome matches what most people would judge true or false.
What would settle it
A large-scale human survey that rates the same reasoning conclusions as true or false and shows that PACS majority votes match human majorities no better than chain-of-thought or fixed neurosymbolic baselines would falsify the central claim.
Figures
read the original abstract
Recent efforts to improve the reasoning abilities of Large Language Models (LLMs) have focused on integrating formal logic solvers within neurosymbolic frameworks. A key challenge is that formal solvers lack commonsense world knowledge, preventing them from making reasoning steps that humans find obvious. Prior methods address this by using LLMs to supply missing commonsense assumptions, but these approaches implicitly assume universal agreement on such commonsense facts. In reality, commonsense beliefs vary across individuals. We propose a probabilistic framework for abductive commonsense reasoning that explicitly models this variation, aiming to determine whether most people would judge a statement as true or false. We introduce Probabilistic Abductive CommonSense (PACS), a novel algorithm that uses an LLM and a formal solver to sample proofs as observations of individuals' distinct commonsense beliefs, and aggregates conclusions across these samples. Empirically, PACS outperforms chain-of-thought reasoning, prior neurosymbolic methods, and search-based approaches across multiple benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Probabilistic Abductive CommonSense (PACS), a neurosymbolic algorithm for abductive reasoning that models variation in commonsense beliefs. It uses an LLM to sample multiple proofs (treated as observations from distinct individuals' belief distributions), applies a formal solver to derive conclusions from each, and aggregates via majority vote to determine whether most people would judge a statement true or false. The paper claims this outperforms chain-of-thought reasoning, prior neurosymbolic methods, and search-based approaches across multiple benchmarks.
Significance. If the empirical results hold under proper controls and the sampling procedure can be shown to approximate human belief variation, PACS would address a genuine limitation in existing neurosymbolic systems that assume universal commonsense agreement. The probabilistic aggregation idea is a clear conceptual advance over deterministic assumption-injection methods. However, the absence of human calibration data means the practical significance remains provisional; the work is more a promising algorithmic proposal than a fully validated framework.
major comments (2)
- Abstract: The claim of empirical outperformance over CoT, neurosymbolic, and search-based methods is stated without any quantitative results, error bars, benchmark names, dataset sizes, or ablation details. This makes it impossible to assess whether gains survive controls for prompt engineering, solver choice, or sampling temperature; the central empirical claim therefore cannot be evaluated from the provided information.
- Method section (description of PACS algorithm): The framework treats repeated LLM-generated proof samples as draws from a distribution of human commonsense beliefs and uses majority vote to recover the modal judgment. No human calibration experiments, correlation with psychometric data on commonsense variation, or ablation comparing majority vote to single-sample or temperature-0 baselines are reported. This assumption is load-bearing for the probabilistic interpretation and the novelty claim relative to prior work that assumes universal agreement.
minor comments (1)
- The paper would benefit from an explicit formal definition (e.g., as a probability distribution over possible worlds or belief sets) early in the method section to clarify how the LLM samples are aggregated.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: Abstract: The claim of empirical outperformance over CoT, neurosymbolic, and search-based methods is stated without any quantitative results, error bars, benchmark names, dataset sizes, or ablation details. This makes it impossible to assess whether gains survive controls for prompt engineering, solver choice, or sampling temperature; the central empirical claim therefore cannot be evaluated from the provided information.
Authors: We agree that the abstract would benefit from greater specificity. In the revised version, we will include concrete quantitative results (accuracy figures with error bars on the primary benchmarks), dataset sizes, and references to the key ablations (including controls for sampling temperature and solver variants). This will make the empirical claims directly evaluable while preserving the abstract's brevity. revision: yes
-
Referee: Method section (description of PACS algorithm): The framework treats repeated LLM-generated proof samples as draws from a distribution of human commonsense beliefs and uses majority vote to recover the modal judgment. No human calibration experiments, correlation with psychometric data on commonsense variation, or ablation comparing majority vote to single-sample or temperature-0 baselines are reported. This assumption is load-bearing for the probabilistic interpretation and the novelty claim relative to prior work that assumes universal agreement.
Authors: We acknowledge that the manuscript does not contain human calibration experiments or psychometric correlations validating that LLM samples approximate human belief distributions; this remains an assumption underlying the probabilistic framing. We do, however, include ablations of majority aggregation versus single-sample inference. We will revise the method and discussion sections to state the modeling assumption more explicitly, add the requested temperature-0 baseline comparison, and insert a limitations paragraph highlighting the need for future human validation studies. These changes will clarify the distinction from deterministic neurosymbolic baselines. revision: partial
- Absence of human calibration experiments or psychometric data to empirically support the assumption that LLM-generated proof samples approximate variation in human commonsense beliefs.
Circularity Check
No circularity: algorithmic sampling and aggregation method is self-contained
full rationale
The paper presents PACS as an algorithmic procedure that invokes an external LLM to generate proof samples (treated as observations of individual belief distributions) and a formal solver to evaluate them, followed by majority-vote aggregation. No equations, parameters, or derivations are defined in terms of the target output; the method does not fit any quantity to a subset of its own results and then relabel that quantity as a prediction. No load-bearing self-citations or uniqueness theorems imported from the authors' prior work appear in the provided text. The central claim therefore rests on the external behavior of the LLM and solver rather than on any internal reduction to the inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-generated proofs constitute valid observations of distinct individual commonsense belief sets
- domain assumption Majority vote across samples approximates what most people would judge true or false
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a probabilistic framework... sample proofs as observations of individuals’ distinct commonsense beliefs, and aggregates conclusions across these samples.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
cAP(S, c) = 1/K Σ 1[S∧Lk ⊢ c]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation , author=. Proc. Conf. Association for Computational Linguistics , pages=
-
[2]
CaDiCaL 2.0 , author=. Proc. Int. Conf. on Computer Aided Verification , pages=. 2024 , organization=
work page 2024
-
[3]
Large language models are zero-shot reasoners , author=. Proc. Conf. Neural Informations Processing Systems , pages=
-
[4]
Getting closer to AI complete question answering: A set of prerequisite real tasks , author=. Proc. AAAI Conf. Artificial Intelligence , volume=
-
[5]
Cosmos QA : Machine Reading Comprehension with Contextual Commonsense Reasoning
Huang, Lifu and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin. Cosmos QA : Machine Reading Comprehension with Contextual Commonsense Reasoning. Proc. Conf. Empirical Methods in Natural Language Processing. 2019
work page 2019
-
[6]
Language Models as Knowledge Bases?
Petroni, Fabio and Rockt. Language Models as Knowledge Bases?. Proc. Conf. Empirical Methods in Natural Language Processing. 2019
work page 2019
-
[7]
Faith and Fate: Limits of Transformers on Compositionality , volume =
Dziri, Nouha and Lu, Ximing and Sclar, Melanie and Li, Xiang (Lorraine) and Jiang, Liwei and Lin, Bill Yuchen and Welleck, Sean and West, Peter and Bhagavatula, Chandra and Le Bras, Ronan and Hwang, Jena and Sanyal, Soumya and Ren, Xiang and Ettinger, Allyson and Harchaoui, Zaid and Choi, Yejin , booktitle =. Faith and Fate: Limits of Transformers on Comp...
-
[8]
Nils J. Nilsson , abstract =. Logic and artificial intelligence , journal =. 1991 , issn =
work page 1991
-
[9]
Honghua Dong and Jiayuan Mao and Tian Lin and Chong Wang and Lihong Li and Denny Zhou , title =
-
[10]
Miles Turpin and Julian Michael and Ethan Perez and Samuel R. Bowman , title =. Proc. Conf. Neural Information Processing Systems , year =
-
[11]
FOLIO : Natural Language Reasoning with First-Order Logic
Han, Simeng and Schoelkopf, Hailey and Zhao, Yilun and Qi, Zhenting and Riddell, Martin and Zhou, Wenfei and Coady, James and Peng, David and Qiao, Yujie and Benson, Luke and Sun, Lucy and Wardle-Solano, Alexander and Szab \'o , Hannah and Zubova, Ekaterina and Burtell, Matthew and Fan, Jonathan and Liu, Yixin and Wong, Brian and Sailor, Malcolm and Ni, A...
work page 2024
-
[12]
Neural logic reasoning , author=. Proc. Int. Conf. Information & Knowledge Management , pages=
-
[13]
Daniel Crevier , title =
-
[14]
Bertrand Russell and Alfred Whitehead , title =
-
[15]
IRE Transactions on Information Theory , year =
Allen Newell and Herbert Simon , title =. IRE Transactions on Information Theory , year =
- [16]
-
[17]
Camburu, Oana-Maria and Rockt\". Proc. Conf. Neural Information Processing Systems , title =
-
[18]
A comprehensive overview of knowledge graph completion , journal =. 2022 , author =
work page 2022
-
[19]
Embedding Uncertain Knowledge Graphs , number=. Proc. Conf. Artificial Intell. , author=. 2019 , month=
work page 2019
-
[20]
Automated Knowledge Base Construction , year =
Joint Reasoning for Multi-Faceted Commonsense Knowledge , author=. Automated Knowledge Base Construction , year =
-
[21]
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought , author=. Proc. Int. Conf. Learning Representations , year=
-
[22]
SymBa: Symbolic Backward Chaining for Structured Natural Language ReasoningSymBa: Symbolic Backward Chaining for Structured Natural Language Reasoning , author=. 2024 , eprint=
work page 2024
-
[23]
Xi Ye and Qiaochu Chen and Isil Dillig and Greg Durrett , booktitle=. Sat
-
[24]
Logic- LM : Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning
Pan, Liangming and Albalak, Alon and Wang, Xinyi and Wang, William. Logic- LM : Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. Findings of the Association for Computational Linguistics. 2023
work page 2023
-
[25]
Faithful Chain-of-Thought Reasoning , author=. Proc. Conf. Natural Language Processing , year=
-
[26]
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers , author=. Proc. Conf. Empirical Methods in Natural Language Processing , pages=
-
[27]
Koustuv Sinha and Shagun Sodhani and Jin Dong and Joelle Pineau and William L. Hamilton , Title =. 2019 , booktitle =
work page 2019
-
[28]
Reasoning with large lan- guage models, a survey
Reasoning with large language models, a survey , author=. arXiv preprint arXiv:2407.11511 , year=
-
[29]
Diagnosing the first-order logical reasoning ability through LogicNLI , author=. Proc. Conf. Empirical Methods in Natural Language Processing , pages=
-
[30]
Faithful Logical Reasoning via Symbolic Chain-of-Thought , author=. Proc. Conf. Association for Computational Linguistics , pages=
-
[31]
ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language , author=. Proc. Conf. Association for Computational Linguistics: ACL-IJCNLP , pages=
-
[32]
Transformers as soft reasoners over language , author=. Proc. Int. Joint Conf. on Artificial Intelligence , pages=
-
[33]
Chain-of-thought prompting elicits reasoning in large language models , author=. Proc. Conf. Neural Information Processing Systems , pages=
-
[34]
Language models are few-shot learners , author=. Proc. Conf. Neural Information Processing Systems , pages=
-
[35]
LogiQA: a challenge dataset for machine reading comprehension with logical reasoning , author=. Proc. Int. Joint Conf. on Artificial Intelligence , pages=
-
[36]
ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning , author=. Proc. Int. Conf. on Learning Representations , year=
-
[37]
Graph of thoughts: Solving elaborate problems with large language models , author=. Proc. Conf. Association Advancement of Artifical Intelligence , pages=
-
[38]
Decompose, analyze and rethink: Solving intricate problems with human-like reasoning cycle , author=. Proc. Conf. Neural Information Processing Systems , pages=
-
[39]
Verifiable, Debuggable, and Repairable Commonsense Logical Reasoning via LLM-based Theory Resolution , author=. Proc. Conf. Empirical Methods in Natural Language Processing , pages=
-
[40]
Tree of thoughts: Deliberate problem solving with large language models , author=. Proc. Conf. Neural Information Processing Systems , pages=
-
[41]
Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text
Wang, Siyuan and Zhong, Wanjun and Tang, Duyu and Wei, Zhongyu and Fan, Zhihao and Jiang, Daxin and Zhou, Ming and Duan, Nan. Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text. Proc. Conf. Association for Computational Linguistics. 2022
work page 2022
-
[42]
Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. Proc. Int. Conf. on Learning Representations , year=
-
[43]
A Balanced Neuro-Symbolic Approach for Commonsense Abductive Logic , author=. 2026 , eprint=
work page 2026
-
[44]
LRM s are not thinking straight: Unreliability of thinking trajectories
Cuesta-Ramirez, Jhouben and Beaussant, Samuel and Mounsif, Mehdi. LRM s are not thinking straight: Unreliability of thinking trajectories. Proc. Conf. Natural Language Processing. 2025
work page 2025
-
[45]
Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation
Yang, Yuan and Xiong, Siheng and Payani, Ali and Shareghi, Ehsan and Fekri, Faramarz. Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation. Proc. Conf. Association for Computational Linguistics. 2024
work page 2024
-
[46]
Self-Evaluation Guided Beam Search for Reasoning , year =
Xie, Yuxi and Kawaguchi, Kenji and Zhao, Yiran and Zhao, James Xu and Kan, Min-Yen and He, Junxian and Xie, Michael , booktitle =. Self-Evaluation Guided Beam Search for Reasoning , year =
-
[47]
Reasoning with Language Model is Planning with World Model
Hao, Shibo and Gu, Yi and Ma, Haodi and Hong, Joshua and Wang, Zhen and Wang, Daisy and Hu, Zhiting. Reasoning with Language Model is Planning with World Model. Proc. Conf. Empirical Methods in Natural Language Processing. 2023
work page 2023
-
[48]
Stepwise Informativeness Search for Improving LLM Reasoning
Wang, Siyuan and Zhao, Enda and Ren, Xiang. Stepwise Informativeness Search for Improving LLM Reasoning. Proc. Conf. Empirical Methods in Natural Language Processing. 2025
work page 2025
-
[49]
arXiv:2409.17539 , archivePrefix=
Logic-of-thought: Injecting logic into contexts for full reasoning in large language models , author=. arXiv:2409.17539 , archivePrefix=
-
[50]
LAMBADA: Backward Chaining for Automated Reasoning in Natural Language , author=. Proc. Conf. Association for Computational Linguistics , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.