QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

Kailong Wang; Ling Shi; Lorenz Goette; Tianlong Yu; Xinyi Zheng; Yongxin Zhao

arxiv: 2606.20227 · v1 · pith:DJKUWPK3new · submitted 2026-06-18 · 💻 cs.AI · cs.SE

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

Xinyi Zheng , Ling Shi , Tianlong Yu , Yongxin Zhao , Lorenz Goette , Kailong Wang This is my paper

Pith reviewed 2026-06-26 17:22 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords deductive reasoningLLM evaluationfirst-order logicbenchmark generationlogical complexitymonadic FOLtest case generationreasoning benchmarks

0 comments

The pith

QMFOL generates monadic first-order logic benchmarks with precise control over reasoning depth, width, and label types to measure LLM deductive performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an automated way to build deductive reasoning test cases from formal monadic first-order logic structures that use conjunction and disjunction patterns to set exact levels of complexity. These structures are turned into natural-language problems by language models and then checked for consistency by sending the output back through an external prover. The resulting benchmark contains 2880 instances spanning 960 configurations, and tests on several large models show accuracy falling and compute rising as logical depth or width increases. Models handle true-labeled cases more reliably than false or unknown ones and remain sensitive to how the same logic is worded. This setup lets evaluators dial complexity without losing logical soundness, addressing gaps in earlier benchmarks that lacked such control.

Core claim

QMFOL constructs formal logical structures in monadic first-order logic using controlled conjunction and disjunction patterns, translates them into natural language via LLMs, and applies round-trip verification with an external prover to guarantee consistency; the resulting QMFOLBench of 2880 instances demonstrates that large reasoning models exhibit declining performance and rising computational cost as logical complexity rises, with better results on true labels than false or unknown and clear sensitivity to semantic wording.

What carries the argument

Formal logical structures built from conjunction and disjunction patterns in monadic first-order logic, translated to natural language and verified by round-trip prover checks, that enforce controllable reasoning depth, width, label types, and distractors.

If this is right

LLM accuracy on deductive tasks decreases steadily as logical depth and width increase.
Models achieve higher performance on tasks whose correct label is True than on those labeled False or Unknown.
Computational cost for solving the tasks grows with added logical complexity.
Changing the natural-language wording of the same logical structure alters model success rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark designers could apply the same pattern-based construction and verification loop to other fragments of logic beyond monadic first-order logic.
Training pipelines that sample tasks at targeted complexity levels might produce models whose reasoning scales more evenly.
The verification step points toward hybrid systems that combine language models with symbolic provers for generating reliable evaluation data at scale.

Load-bearing premise

Round-trip verification by an external prover is enough to ensure that the natural-language versions keep exactly the same logical structure and label as the original formal statements.

What would settle it

An independent prover or human logician finding a natural-language instance whose truth value differs from the formal source after the round-trip check passed would show the verification step does not preserve semantics.

Figures

Figures reproduced from arXiv: 2606.20227 by Kailong Wang, Ling Shi, Lorenz Goette, Tianlong Yu, Xinyi Zheng, Yongxin Zhao.

**Figure 2.** Figure 2: Overview of QMFOL. Blue boxes represent inputs, while pink represent outputs. 𝑁𝑖 is a predicate node, either an atomic predicate 𝑃𝑖 or its negation ¬𝑃𝑖 . For simplicity, 𝑁𝑖 = 𝑃𝑖 in the shown task instance. ❶ Basic Rules Construction. As shown in Algorithm 1, the construction starts from the initial predicate node 𝑁0. • Depth Expansion: (line 4-6) We first introduce 𝐷 new predicate nodes and construct the… view at source ↗

**Figure 3.** Figure 3: Prompt Template for FOL2NL conversion. transformation is illustrated in the middle part of [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Average answer accuracy (%), time overhead (s), and token overhead ( [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Model answer changes from QMFOLBench subsets without distractors to those with varying numbers of distractors, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have made significant progress in reasoning, particularly in deductive reasoning, which is crucial for high-stakes decision-making. As models improve, evaluation benchmarks should evolve to keep pace. However, existing benchmarks lack fine-grained control over logical complexity and struggle to balance semantic diversity with logical consistency. To address these issues, we propose QMFOL, an automated framework for generating monadic first-order logic reasoning tasks with quantifiable and controllable complexity. It constructs formal logical structures using conjunction and disjunction patterns, enabling precise control over reasoning depth, width, label types, and distractors. These structures are then translated into natural language via LLMs, with logical consistency ensured through round-trip verification using an external prover. Based on our framework, we build QMFOLBench, a benchmark comprising 2880 instances with 960 configurations across diverse logical and semantic dimensions. Evaluations on six large reasoning models (LRMs) and two LLMs show that performance degrades and computational overhead increases with rising logical complexity. Models perform better on True-labeled tasks than on False or Unknown ones, and exhibit sensitivity to semantic variation. Overall, QMFOL offers a scalable and reliable approach for constructing deductive reasoning benchmarks with controllable complexity, enabling more precise evaluation of reasoning capabilities in modern language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QMFOL adds controllable monadic FOL generation with prover checks, but round-trip verification leaves semantic drift risk unaddressed.

read the letter

The core offering is a generator that builds monadic FOL structures from conjunction/disjunction patterns, lets you set depth, width, label, and distractors, translates them to natural language with an LLM, and runs round-trip prover verification to keep the logic intact. They produce QMFOLBench with 2880 instances over 960 configurations and run it on six LRMs plus two LLMs, reporting that accuracy drops and compute rises with complexity while models favor True labels over False or Unknown.

The synthesis of monadic restriction, explicit control knobs, and automated verification is new relative to the benchmarks cited in the abstract. The evaluation shows clear trends on sensitivity to semantic variation, which is useful incremental work for anyone building deductive reasoning tests.

The main soft spot is exactly the one in the stress-test note. Round-trip verification assumes that reconstructing a formal sentence from the LLM-generated English will catch any drift in meaning, scope, or implied restrictions. Monadic FOL in natural language can still introduce ambiguities that a prover on the back-translation would miss, especially once distractors or width changes alter phrasing. The abstract gives no quantitative checks on how often this happens or how they validated the translations beyond the prover step, so the controllability claim rests on an assumption that may not hold.

The paper is aimed at researchers who need tunable logic benchmarks rather than fixed suites. It is coherent enough on its own terms to warrant a serious referee who can examine the verification procedure and the actual result tables in detail.

Referee Report

1 major / 1 minor

Summary. The paper introduces QMFOL, a framework that generates monadic first-order logic structures via conjunction/disjunction patterns with explicit control over depth, width, label type, and distractors; translates them to natural language using LLMs; and applies round-trip verification with an external prover to ensure consistency. From this it constructs QMFOLBench (2880 instances, 960 configurations) and reports that six LRMs and two LLMs exhibit degrading performance and rising compute cost with increasing logical complexity, a preference for True labels, and sensitivity to semantic variation.

Significance. If the verification step reliably prevents semantic drift, the approach supplies a scalable route to deductive-reasoning benchmarks whose complexity parameters are inherited from the formal side rather than fitted post hoc. The independent-prover check and the explicit enumeration of 960 configurations constitute concrete strengths that would allow more precise isolation of reasoning failures than existing datasets.

major comments (1)

[Abstract] Abstract (and the verification procedure described therein): the central reliability claim—that round-trip verification with an external prover guarantees that LLM natural-language translations preserve the original logical structure, label, and complexity—rests on an assumption whose sufficiency is not demonstrated. Monadic FOL rendered in English can introduce scope ambiguities, implicit domain restrictions, or pragmatic implications that a back-translated formal sentence would not necessarily expose, especially once distractors or width/depth parameters alter surface phrasing. Because benchmark labels and complexity metrics are taken directly from the formal side, any undetected mismatch directly undermines the controllability guarantee that underpins the paper’s contribution.

minor comments (1)

The abstract states performance trends but supplies no quantitative tables, error bars, or exclusion criteria; the full methods and results sections will need to include these to allow readers to judge robustness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying a key point about the strength of the verification claim. We respond to the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract (and the verification procedure described therein): the central reliability claim—that round-trip verification with an external prover guarantees that LLM natural-language translations preserve the original logical structure, label, and complexity—rests on an assumption whose sufficiency is not demonstrated. Monadic FOL rendered in English can introduce scope ambiguities, implicit domain restrictions, or pragmatic implications that a back-translated formal sentence would not necessarily expose, especially once distractors or width/depth parameters alter surface phrasing. Because benchmark labels and complexity metrics are taken directly from the formal side, any undetected mismatch directly undermines the controllability guarantee that underpins the paper’s contribution.

Authors: We agree that the round-trip verification establishes structural equivalence (the back-translated formula must match the original for the prover to succeed) but does not constitute a complete guarantee against all possible scope ambiguities, domain restrictions, or pragmatic implications that may arise in the natural-language rendering. The check is therefore necessary for preserving the formal label and complexity parameters yet insufficient to rule out every semantic nuance. In the revised manuscript we will (1) qualify the claim in the abstract and methods section to state precisely what the verification guarantees, (2) add an explicit limitations paragraph discussing the referee’s concerns, and (3) report a small-scale human fidelity study on a subset of instances to provide additional evidence. These changes will be reflected in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained via external verification

full rationale

The paper's central contribution is a constructive generation procedure that builds formal monadic FOL structures with explicit parameters for depth/width/label/distractors, translates them via LLM, and applies round-trip verification against an independent external prover. This verification step is external to the generation process and does not reduce any claimed benchmark property back to quantities fitted from the same outputs. No equations, self-citations, or uniqueness theorems are invoked to force the result; the controllability and consistency guarantees are inherited directly from the formal side plus the prover check, making the method self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, new axioms, or invented entities are stated. The framework relies on standard first-order logic semantics and the correctness of an unnamed external prover.

axioms (2)

domain assumption Monadic first-order logic formulas can be constructed from conjunction and disjunction patterns while preserving decidability and allowing exact control of reasoning depth and width.
Invoked when the paper states that formal logical structures are built using conjunction and disjunction patterns to control complexity.
domain assumption Round-trip translation through an LLM followed by prover verification guarantees semantic equivalence between the formal formula and the generated natural-language sentence.
Central to the claim that logical consistency is ensured.

pith-pipeline@v0.9.1-grok · 5778 in / 1428 out tokens · 30212 ms · 2026-06-26T17:22:30.664525+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 14 canonical work pages

[1]

Anthropic. 2026. Introducing Claude Sonnet 4.6. https://www.anthropic.com/ news/claude-sonnet-4-6. Accessed: 2026-3-19

2026
[2]

Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G

S. Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G. Ives. 2007. DBpedia: A Nucleus for a Web of Open Data. In ISWC/ASWC. https://api.semanticscholar.org/CorpusID:7278297

2007
[3]

Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, and Yixuan Li. 2025. How Con- taminated Is Your Benchmark? Measuring Dataset Leakage in Large Language Models with Kernel Divergence. InInternational Conference on Machine Learning. https://icml.cc/media/icml-2025/Slides/43619.pdf

2025
[4]

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang, Ming Liu, and Bing Qin. 2024. TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar ...

work page doi:10.18653/v1/ 2024
[5]

Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2021. Transformers as soft reasoners over language. InProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence(Yokohama, Yokohama, Japan)(IJCAI’20). Article 537, 9 pages. https://arxiv.org/abs/2002.05867

arXiv 2021
[6]

Kolbinger, Hannah Sophie Muti, Zunamys I

Jan Clusmann, Fiona R. Kolbinger, Hannah Sophie Muti, Zunamys I. Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie- Caroline Schwarzkopf, Michaela Unger, Gregory P. Veldhuizen, Sophia J. Wagner, and Jakob Nikolas Kather. 2023. The future landscape of large language models in medicine.Communications Medicine3, 1 (2023...

work page doi:10.1038/s43856-023- 2023
[7]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

Pith/arXiv arXiv 2025
[8]

Eva Eigner and Thorsten Händler. 2024. Determinants of LLM-assisted Decision- Making. arXiv:2402.17385 [cs.AI] https://arxiv.org/abs/2402.17385

arXiv 2024
[9]

Google. 2026. Gemini 3.1 Pro. https://ai.google.dev/gemini-api/docs/models/ gemini-3.1-pro-preview. Accessed: 2026-3-19

2026
[10]

Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alexander Wardle-Solano, Hannah Szabó, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander Fabbri, Wojciech Ma...

work page doi:10.18653/v1/2024.emnlp-main.1229 2024
[11]

Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, and Mengdi Wang. 2025. MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations.arXiv preprint arXiv:2502.06453(2025). https:...

arXiv 2025
[12]

Duygu Sezen Islakoglu and Jan-Christoph Kalo. 2025. ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Ass...

work page doi:10.18653/v1/2025.acl-short.46 2025
[13]

Laura Kovács and Andrei Voronkov. 2013. First-Order Theorem Proving and Vampire. InComputer Aided Verification, Natasha Sharygina and Helmut Veith (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–35

2013
[14]

Jinqi Lai, Wensheng Gan, Jiayang Wu, Zhenlian Qi, and Philip S. Yu. 2024. Large language models in law: A survey.AI Open5 (2024), 181–196. doi:10.1016/j. aiopen.2024.09.002

work page doi:10.1016/j 2024
[15]

Ningke Li, Yuekang Li, Yi Liu, Ling Shi, Kailong Wang, and Haoyu Wang. 2024. Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models.Proc. ACM Program. Lang.8, OOPSLA2, Article 336 (Oct. 2024), 30 pages. doi:10.1145/3689776

work page doi:10.1145/3689776 2024
[16]

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2024. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.arXiv preprint arXiv:2410.05229 (2024). https://arxiv.org/abs/2410.05229

Pith/arXiv arXiv 2024
[17]

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. 2025. Evaluation and Benchmarking of LLM Agents: A Survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). ACM, 6129–6139. doi:10.1145/3711896.3736570

work page doi:10.1145/3711896.3736570 2025
[18]

Phuong Minh Nguyen, Tien Huu Dang, and Naoya Inoue. 2025. Non- Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning. arXiv:2508.12425 [cs.AI] https://arxiv.org/abs/2508.12425

arXiv 2025
[19]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/. Accessed: 2026-1-3

2024
[20]

OpenAI. 2026. Introducing GPT -5.4. https://openai.com/index/introducing-gpt- 5-4/. Accessed: 2026-3-19

2026
[21]

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. 2023. Logic- LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3806–3824. doi:10...

work page doi:10.18653/v1/2023.findings- 2023
[22]

Chengwen Qi, Ren Ma, Bowen Li, He Du, Binyuan Hui, Jinwang Wu, Yuanjun Laili, and Conghui He. 2025. Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=C25SgeXWjE

2025
[23]

QMFOL. 2026. https://doi.org/10.5281/zenodo.19348694

work page doi:10.5281/zenodo.19348694 2026
[24]

Qwen. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog? id=qwen3.5. Accessed: 2026-3-19

2026
[25]

Zhyar Rzgar K Rostam, Márta Takács, and Gábor Kertész. 2025. Evaluating Large Language Models: A Review of Metrics and Benchmarks. In2025 IEEE 23rd Jubilee International Symposium on Intelligent Systems and Informatics (SISY). 000233–000240. doi:10.1109/SISY67000.2025.11205366

work page doi:10.1109/sisy67000.2025.11205366 2025
[26]

Soumya Sanyal, Zeyi Liao, and Xiang Ren. 2022. RobustLR: A Diagnostic Benchmark for Evaluating Logical Robustness of Deductive Reasoners. InPro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emi...

work page doi:10.18653/v1/2022.emnlp-main.653 2022
[27]

Abulhair Saparov and He He. 2023. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. InThe Eleventh Interna- tional Conference on Learning Representations. https://openreview.net/forum?id= qFVVBzXxR2V

2023
[28]

Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, Najoung Kim, and He He. 2023. Testing Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Zheng et al. the General Deductive Reasoning Capacity of Large Language Models Us- ing OOD Examples. InAdvances in Neural Information Processing Systems, Vol. 36. 3083–...

2023
[29]

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. 2021. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. InFind- ings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computa- tional Linguistics, Online, 3621–3634. d...

work page doi:10.18653/v1/2021.findings-acl.317 2021
[30]

Jidong Tian, Yitian Li, Wenqing Chen, Liqiang Xiao, Hao He, and Yaohui Jin
[31]

Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI

Diagnosing the First-Order Logical Reasoning Ability Through LogicNLI. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 3738–3747. doi:10.18653/v1/...

work page doi:10.18653/v1/2021.emnlp-main.303 2021
[32]

Jan Wielemaker, Tom Schrijvers, Markus Triska, and TorbjÖrn Lager. 2012. Swi- prolog.Theory Pract. Log. Program.12, 1–2 (Jan. 2012), 67–96. doi:10.1017/ S1471068411000494

2012
[33]

Wikipedia. 2026. Automated theorem proving. https://en.wikipedia.org/wiki/ Automated_theorem_proving. Accessed: 2026-3-19

2026
[34]

Pattaraphon Kenny Wongchamcharoen and Paul Glasserman. 2026. Do Large Language Models Understand Chronology?. InProceedings of the AAAI Conference on Artificial Intelligence. https://arxiv.org/abs/2511.14214 Student Abstract

arXiv 2026
[35]

Zihao Xu, Junchen Ding, Yiling Lou, Kun Zhang, Dong Gong, and Yuekang Li
[36]

https://arxiv.org/abs/2504.12312

Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles.arXiv preprint arXiv:2504.12312(2025). https://arxiv.org/abs/2504.12312

arXiv 2025
[37]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv 2025
[38]

Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuur- mans. 2023. Foundation Models for Decision Making: Problems, Methods, and Opportunities. arXiv:2303.04129 [cs.AI] https://arxiv.org/abs/2303.04129

arXiv 2023
[39]

Jingqian Zhao, Bingbing Wang, Geng Tu, Yice Zhang, Qianlong Wang, Bin Liang, Jing Li, and Ruifeng Xu. 2025. CoreEval: Automatically Building Contamination- Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang ...

work page doi:10.18653/v1/2025.acl-long.1085 2025

[1] [1]

Anthropic. 2026. Introducing Claude Sonnet 4.6. https://www.anthropic.com/ news/claude-sonnet-4-6. Accessed: 2026-3-19

2026

[2] [2]

Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G

S. Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G. Ives. 2007. DBpedia: A Nucleus for a Web of Open Data. In ISWC/ASWC. https://api.semanticscholar.org/CorpusID:7278297

2007

[3] [3]

Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, and Yixuan Li. 2025. How Con- taminated Is Your Benchmark? Measuring Dataset Leakage in Large Language Models with Kernel Divergence. InInternational Conference on Machine Learning. https://icml.cc/media/icml-2025/Slides/43619.pdf

2025

[4] [4]

Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Haotian Wang, Ming Liu, and Bing Qin. 2024. TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar ...

work page doi:10.18653/v1/ 2024

[5] [5]

Peter Clark, Oyvind Tafjord, and Kyle Richardson. 2021. Transformers as soft reasoners over language. InProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence(Yokohama, Yokohama, Japan)(IJCAI’20). Article 537, 9 pages. https://arxiv.org/abs/2002.05867

arXiv 2021

[6] [6]

Kolbinger, Hannah Sophie Muti, Zunamys I

Jan Clusmann, Fiona R. Kolbinger, Hannah Sophie Muti, Zunamys I. Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie- Caroline Schwarzkopf, Michaela Unger, Gregory P. Veldhuizen, Sophia J. Wagner, and Jakob Nikolas Kather. 2023. The future landscape of large language models in medicine.Communications Medicine3, 1 (2023...

work page doi:10.1038/s43856-023- 2023

[7] [7]

DeepSeek-AI, Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenhao Xu, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Erhang Li, Fangqi Zhou, Fangyun Lin, Fucong Dai, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Ha...

Pith/arXiv arXiv 2025

[8] [8]

Eva Eigner and Thorsten Händler. 2024. Determinants of LLM-assisted Decision- Making. arXiv:2402.17385 [cs.AI] https://arxiv.org/abs/2402.17385

arXiv 2024

[9] [9]

Google. 2026. Gemini 3.1 Pro. https://ai.google.dev/gemini-api/docs/models/ gemini-3.1-pro-preview. Accessed: 2026-3-19

2026

[10] [10]

Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Wenfei Zhou, James Coady, David Peng, Yujie Qiao, Luke Benson, Lucy Sun, Alexander Wardle-Solano, Hannah Szabó, Ekaterina Zubova, Matthew Burtell, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Alexander Fabbri, Wojciech Ma...

work page doi:10.18653/v1/2024.emnlp-main.1229 2024

[11] [11]

Kaixuan Huang, Jiacheng Guo, Zihao Li, Xiang Ji, Jiawei Ge, Wenzhe Li, Yingqing Guo, Tianle Cai, Hui Yuan, Runzhe Wang, Yue Wu, Ming Yin, Shange Tang, Yangsibo Huang, Chi Jin, Xinyun Chen, Chiyuan Zhang, and Mengdi Wang. 2025. MATH-Perturb: Benchmarking LLMs’ Math Reasoning Abilities against Hard Perturbations.arXiv preprint arXiv:2502.06453(2025). https:...

arXiv 2025

[12] [12]

Duygu Sezen Islakoglu and Jan-Christoph Kalo. 2025. ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Ass...

work page doi:10.18653/v1/2025.acl-short.46 2025

[13] [13]

Laura Kovács and Andrei Voronkov. 2013. First-Order Theorem Proving and Vampire. InComputer Aided Verification, Natasha Sharygina and Helmut Veith (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–35

2013

[14] [14]

Jinqi Lai, Wensheng Gan, Jiayang Wu, Zhenlian Qi, and Philip S. Yu. 2024. Large language models in law: A survey.AI Open5 (2024), 181–196. doi:10.1016/j. aiopen.2024.09.002

work page doi:10.1016/j 2024

[15] [15]

Ningke Li, Yuekang Li, Yi Liu, Ling Shi, Kailong Wang, and Haoyu Wang. 2024. Drowzee: Metamorphic Testing for Fact-Conflicting Hallucination Detection in Large Language Models.Proc. ACM Program. Lang.8, OOPSLA2, Article 336 (Oct. 2024), 30 pages. doi:10.1145/3689776

work page doi:10.1145/3689776 2024

[16] [16]

Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. 2024. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models.arXiv preprint arXiv:2410.05229 (2024). https://arxiv.org/abs/2410.05229

Pith/arXiv arXiv 2024

[17] [17]

Mahmoud Mohammadi, Yipeng Li, Jane Lo, and Wendy Yip. 2025. Evaluation and Benchmarking of LLM Agents: A Survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD ’25). ACM, 6129–6139. doi:10.1145/3711896.3736570

work page doi:10.1145/3711896.3736570 2025

[18] [18]

Phuong Minh Nguyen, Tien Huu Dang, and Naoya Inoue. 2025. Non- Interactive Symbolic-Aided Chain-of-Thought for Logical Reasoning. arXiv:2508.12425 [cs.AI] https://arxiv.org/abs/2508.12425

arXiv 2025

[19] [19]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/. Accessed: 2026-1-3

2024

[20] [20]

OpenAI. 2026. Introducing GPT -5.4. https://openai.com/index/introducing-gpt- 5-4/. Accessed: 2026-3-19

2026

[21] [21]

Liangming Pan, Alon Albalak, Xinyi Wang, and William Wang. 2023. Logic- LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. InFindings of the Association for Computational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 3806–3824. doi:10...

work page doi:10.18653/v1/2023.findings- 2023

[22] [22]

Chengwen Qi, Ren Ma, Bowen Li, He Du, Binyuan Hui, Jinwang Wu, Yuanjun Laili, and Conghui He. 2025. Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=C25SgeXWjE

2025

[23] [23]

QMFOL. 2026. https://doi.org/10.5281/zenodo.19348694

work page doi:10.5281/zenodo.19348694 2026

[24] [24]

Qwen. 2026. Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog? id=qwen3.5. Accessed: 2026-3-19

2026

[25] [25]

Zhyar Rzgar K Rostam, Márta Takács, and Gábor Kertész. 2025. Evaluating Large Language Models: A Review of Metrics and Benchmarks. In2025 IEEE 23rd Jubilee International Symposium on Intelligent Systems and Informatics (SISY). 000233–000240. doi:10.1109/SISY67000.2025.11205366

work page doi:10.1109/sisy67000.2025.11205366 2025

[26] [26]

Soumya Sanyal, Zeyi Liao, and Xiang Ren. 2022. RobustLR: A Diagnostic Benchmark for Evaluating Logical Robustness of Deductive Reasoners. InPro- ceedings of the 2022 Conference on Empirical Methods in Natural Language Pro- cessing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emi...

work page doi:10.18653/v1/2022.emnlp-main.653 2022

[27] [27]

Abulhair Saparov and He He. 2023. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. InThe Eleventh Interna- tional Conference on Learning Representations. https://openreview.net/forum?id= qFVVBzXxR2V

2023

[28] [28]

Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, Najoung Kim, and He He. 2023. Testing Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Zheng et al. the General Deductive Reasoning Capacity of Large Language Models Us- ing OOD Examples. InAdvances in Neural Information Processing Systems, Vol. 36. 3083–...

2023

[29] [29]

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. 2021. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language. InFind- ings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computa- tional Linguistics, Online, 3621–3634. d...

work page doi:10.18653/v1/2021.findings-acl.317 2021

[30] [30]

Jidong Tian, Yitian Li, Wenqing Chen, Liqiang Xiao, Hao He, and Yaohui Jin

[31] [31]

Diagnosing the First-Order Logical Reasoning Ability Through L ogic NLI

Diagnosing the First-Order Logical Reasoning Ability Through LogicNLI. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen- tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 3738–3747. doi:10.18653/v1/...

work page doi:10.18653/v1/2021.emnlp-main.303 2021

[32] [32]

Jan Wielemaker, Tom Schrijvers, Markus Triska, and TorbjÖrn Lager. 2012. Swi- prolog.Theory Pract. Log. Program.12, 1–2 (Jan. 2012), 67–96. doi:10.1017/ S1471068411000494

2012

[33] [33]

Wikipedia. 2026. Automated theorem proving. https://en.wikipedia.org/wiki/ Automated_theorem_proving. Accessed: 2026-3-19

2026

[34] [34]

Pattaraphon Kenny Wongchamcharoen and Paul Glasserman. 2026. Do Large Language Models Understand Chronology?. InProceedings of the AAAI Conference on Artificial Intelligence. https://arxiv.org/abs/2511.14214 Student Abstract

arXiv 2026

[35] [35]

Zihao Xu, Junchen Ding, Yiling Lou, Kun Zhang, Dong Gong, and Yuekang Li

[36] [36]

https://arxiv.org/abs/2504.12312

Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles.arXiv preprint arXiv:2504.12312(2025). https://arxiv.org/abs/2504.12312

arXiv 2025

[37] [37]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv 2025

[38] [38]

Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuur- mans. 2023. Foundation Models for Decision Making: Problems, Methods, and Opportunities. arXiv:2303.04129 [cs.AI] https://arxiv.org/abs/2303.04129

arXiv 2023

[39] [39]

Jingqian Zhao, Bingbing Wang, Geng Tu, Yice Zhang, Qianlong Wang, Bin Liang, Jing Li, and Ruifeng Xu. 2025. CoreEval: Automatically Building Contamination- Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang ...

work page doi:10.18653/v1/2025.acl-long.1085 2025