LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

Junjie Ye; Kexin Tan; Maxm Pan; Ming Zhang; Qiyuan Peng; Qi Zhang; Ruizhi Yang; Shihan Dou; Tao Gui; Xuanjing Huang

arxiv: 2605.19597 · v1 · pith:I35OYP2Pnew · submitted 2026-05-19 · 💻 cs.CL

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

Ming Zhang , Qiyuan Peng , Yinxi Wei , Yujiong Shen , Kexin Tan , Yuhui Wang , Zhenghao Xiang , Junjie Ye

show 8 more authors

Zhangyue Yin Zhiheng Xi Shihan Dou Tao Gui Maxm Pan Ruizhi Yang Qi Zhang Xuanjing Huang

This is my paper

Pith reviewed 2026-05-20 05:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords logical reasoning benchmarkLLM evaluationChinese languageZ3 solver verificationadversarial hardeningformalization scoringnatural language to logic

0 comments

The pith

A new Chinese benchmark with Z3-verified formalizations shows frontier LLMs reach only 37.5 percent accuracy on its hardest logical-reasoning items.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLMEval-Logic, a benchmark of realistic situational scenarios in Chinese whose natural-language statements are paired with expert-audited formalizations. Items are verified for correctness by the Z3 solver and further hardened through an adversarial loop that generates challenging variants while keeping the underlying logic closed. Evaluation of fourteen frontier models on the resulting base and hard subsets documents clear performance ceilings, with the strongest model attaining 37.5 percent hard-item accuracy and 60.16 percent on joint Z3-plus-rubric formalization scoring even when reference symbols are supplied. A sympathetic reader would conclude that current LLMs still lack reliable mechanisms for deriving conclusions that follow strictly from stated premises in open-ended natural language.

Core claim

LLMEval-Logic supplies 246 base items and 190 hard items, each accompanied by reference formalizations whose satisfiability is checked by Z3 and whose natural-to-formal mappings are scored against expert rubrics. The hard subset further decomposes each item into 938 multi-step sub-questions over closed model spaces. When fourteen frontier LLMs are tested, the highest hard-item accuracy observed is 37.5 percent and the highest joint formalization score is 60.16 percent, indicating that existing models cannot yet maintain strict logical entailment across realistic Chinese scenarios even under favorable prompting conditions.

What carries the argument

The closed-loop adversarial hardening pipeline that starts from expert-audited natural-language items and reference formalizations, verifies them with Z3, and iteratively generates harder variants while preserving logical closure.

If this is right

Models must improve their ability to map natural-language premises to precise logical forms without external symbol guidance.
Benchmark saturation will require new items that survive repeated adversarial refinement rather than templated generation.
Joint solver-plus-rubric evaluation becomes a necessary complement to surface-level accuracy when judging logical competence.
Chinese situational scenarios expose gaps that English-centric benchmarks may have masked.
Progress on logical reasoning will be measurable only after models exceed the 37.5 percent hard-item threshold observed here.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline could be applied to other languages or to domains such as legal or medical reasoning where strict entailment matters.
If the low formalization scores persist, training objectives that directly optimize for Z3-consistent derivations may be required rather than relying solely on next-token prediction.
The existence of a 190-item hard subset with closed model spaces suggests a practical route for constructing future adversarial benchmarks without exhaustive human re-annotation.

Load-bearing premise

The expert-audited natural-language items together with their reference formalizations supply reliable ground truth for what counts as correct logical reasoning in realistic situations.

What would settle it

A model that scores above 80 percent on the hard subset while still producing contradictions on equivalent problems when the surface wording is slightly altered or when the same premises are presented in a different order.

read the original abstract

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper delivers a new Chinese logical reasoning benchmark that combines expert auditing, Z3 verification, rubrics, and adversarial hardening to expose real gaps in frontier models.

read the letter

The main thing here is a benchmark built from realistic Chinese scenarios rather than templated formulas. The pipeline authors items forward, has experts audit both the natural-language version and the reference formalization, runs Z3 to verify answers, creates detailed rubrics, and then runs a closed-loop adversarial process to toughen up the hard subset. That specific mix is new compared with earlier logical-reasoning sets, and the results are straightforward: the best model hits only 37.5% on the 190-item hard subset, and even with reference symbols the top joint Z3-plus-rubric score is 60.16%. The base set ships with 1,400 rubric atoms, which should make grading more reproducible than coarse labels in prior work. The authors also release the data and code, which is useful for anyone who wants to extend or re-run the evaluation. The central claim that current models still fall short on controlled logical reasoning holds up on the numbers they report. The main soft spot is that the abstract and summary give limited detail on inter-annotator agreement or exactly how the adversarial loop was run, so it is hard to judge whether any unintended bias crept into the final item selection. That said, the construction directly targets the usual problems of saturation and unaudited annotations, and nothing in the reported pipeline looks internally inconsistent. No load-bearing math or data-fitting issues appear. This is for groups working on LLM evaluation and reasoning benchmarks, especially those who care about non-English settings or want tighter formal checks. It is worth sending to a serious referee because the construction choices are explicit and the performance gaps are large enough to matter for the subfield.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces LLMEval-Logic, a Chinese logical-reasoning benchmark constructed from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with reference formalizations, verifies answers via the Z3 solver, supplies expert rubrics for natural-to-formal grading, and applies closed-loop adversarial hardening. The released benchmark comprises a 246-item Base subset (with 1,400 rubric atoms) and a 190-item Hard subset (with 938 multi-step sub-questions). Evaluation of 14 frontier LLMs reports that the strongest model reaches only 37.5% Hard Item Accuracy and 60.16% joint Z3+Rubric formalization score.

Significance. If the construction pipeline is sound, the work supplies a publicly released, solver-verified benchmark that directly targets saturation and annotation weaknesses in prior logical-reasoning suites. The combination of Z3 verification, expert rubrics, and adversarial hardening constitutes a reproducible and falsifiable evaluation framework; the GitHub release further strengthens this contribution. The reported performance gaps provide concrete, quantitative evidence of current limitations in LLM logical reasoning on Chinese situational tasks and can usefully guide subsequent model development.

minor comments (3)

[§3.1] §3.1: the description of the expert-audit protocol would be clearer if it explicitly stated the number of annotators per item and any tie-breaking procedure.
[Table 1] Table 1: the column headers for rubric-atom counts could be aligned with the text definitions in §3.3 to avoid momentary ambiguity about what constitutes a 'multi-step sub-question'.
[§5.2] §5.2: the discussion of model performance would benefit from a brief note on whether any of the 14 evaluated models had been exposed to similar Chinese logical-reasoning data during pre-training.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough and positive review. We are pleased that the referee recognizes the value of the solver-verified construction pipeline, the combination of Z3 verification with expert rubrics, the adversarial hardening process, and the quantitative evidence of current limitations in LLM logical reasoning on Chinese situational tasks. The recommendation to accept is appreciated.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs LLMEval-Logic via expert-audited natural-language items, reference formalizations, Z3 verification, expert rubrics, and closed-loop adversarial hardening, then directly measures performance of 14 external frontier LLMs on the resulting Base and Hard subsets. The headline results (37.5% Hard Item Accuracy and 60.16% joint Z3+Rubric score) are empirical observations on those external models using the benchmark as independent ground truth. No equations, fitted parameters, or self-citation chains reduce these measured accuracies to quantities defined by the paper's own inputs or prior work by the same authors. The derivation chain consists of benchmark construction followed by external evaluation and is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim that the benchmark reveals genuine gaps in LLM logical reasoning rests on the assumption that Z3 is a sound verifier for the formalized items and that expert auditing produces unbiased ground truth; no free parameters or invented entities are introduced in the abstract.

axioms (2)

standard math Z3 theorem prover correctly verifies the logical entailment in the annotated formulas
The pipeline relies on Z3 to confirm answers; this is a standard assumption when using an established solver.
domain assumption Expert-audited natural-language items and formalizations accurately represent logical reasoning tasks without systematic bias
The benchmark construction depends on the quality and neutrality of the expert auditing step described in the abstract.

pith-pipeline@v0.9.0 · 5829 in / 1535 out tokens · 43852 ms · 2026-05-20T05:44:12.708624+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

[1]

L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Parmar, Mihir and Patel, Nisarg and Varshney, Neeraj and Nakamura, Mutsumi and Luo, Man and Mashetty, Santosh and Mitra, Arindam and Baral, Chitta. L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

work page 2024
[2]

FOLIO : Natural Language Reasoning with First-Order Logic

Han, Simeng and Schoelkopf, Hailey and Zhao, Yilun and Qi, Zhenting and Riddell, Martin and Zhou, Wenfei and Coady, James and Peng, David and Qiao, Yujie and Benson, Luke and Sun, Lucy and Wardle-Solano, Alexander and Szab \'o , Hannah and Zubova, Ekaterina and Burtell, Matthew and Fan, Jonathan and Liu, Yixin and Wong, Brian and Sailor, Malcolm and Ni, A...

work page doi:10.18653/v1/2024.emnlp-main.1229 2024
[3]

Transformers as Soft Reasoners over Language

Clark, Peter and Tafjord, Oyvind and Richardson, Kyle. Transformers as Soft Reasoners over Language. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ( IJCAI ). 2020

work page 2020
[4]

P roof W riter: Generating Implications, Proofs, and Abductive Statements over Natural Language

Tafjord, Oyvind and Dalvi, Bhavana and Clark, Peter. P roof W riter: Generating Implications, Proofs, and Abductive Statements over Natural Language. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021

work page 2021
[5]

Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

Lin, Bill Yuchen and Bras, Ronan Le and Richardson, Kyle and Sabharwal, Ashish and Poovendran, Radha and Clark, Peter and Choi, Yejin. Z ebra L ogic: On the Scaling Limits of LLM s for Logical Reasoning. arXiv preprint arXiv:2502.01100. 2025

work page arXiv 2025
[6]

P - FOLIO : Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

Han, Simeng and Yu, Aaron and Shen, Rui and Qi, Zhenting and Riddell, Martin and Zhou, Wenfei and Qiao, Yujie and Zhao, Yilun and Yavuz, Semih and Liu, Ye and Joty, Shafiq and Zhou, Yingbo and Xiong, Caiming and Radev, Dragomir and Ying, Rex and Cohan, Arman. P - FOLIO : Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chai...

work page doi:10.18653/v1/2024.findings-emnlp.966 2024
[7]

Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation

Yang, Yuan and Xiong, Siheng and Payani, Ali and Shareghi, Ehsan and Fekri, Faramarz. Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.375

work page doi:10.18653/v1/2024.acl-long.375 2024
[8]

Entailment-Preserving First-order Logic Representations in Natural Language Entailment

Lee, Jinu and Liu, Qi and Ma, Runzhi and Han, Vincent and Wang, Ziqi and Ji, Heng and Hockenmaier, Julia. Entailment-Preserving First-order Logic Representations in Natural Language Entailment. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.286

work page doi:10.18653/v1/2025.acl-long.286 2025
[9]

F o V er: First-Order Logic Verification for Natural Language Reasoning

Pei, Yu and Du, Yongping and Jin, Xingnan. F o V er: First-Order Logic Verification for Natural Language Reasoning. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl.a.41

work page doi:10.1162/tacl.a.41 2025
[10]

D iv L ogic E val: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

Chung, Tsz Ting and Liu, Lemao and Yu, Mo and Yeung, Dit-Yan. D iv L ogic E val: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.47

work page doi:10.18653/v1/2025.findings-emnlp.47 2025
[11]

SATB ench: Benchmarking LLM s' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Wei, Anjiang and Wu, Yuheng and Wan, Yingjia and Suresh, Tarun and Tan, Huanmi and Zhou, Zhanke and Koyejo, Sanmi and Wang, Ke and Aiken, Alex. SATB ench: Benchmarking LLM s' Logical Reasoning via Automated Puzzle Generation from SAT Formulas. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/202...

work page doi:10.18653/v1/2025.emnlp-main.1716 2025
[12]

Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought

Saparov, Abulhair and He, He. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. The Eleventh International Conference on Learning Representations. 2023

work page 2023
[13]

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

Saparov, Abulhair and Pang, Richard Yuanzhe and Padmakumar, Vishakh and Joshi, Nitish and Kazemi, Mehran and Kim, Najoung and He, He. Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples. Advances in Neural Information Processing Systems. 2023

work page 2023
[14]

Z3 : An Efficient SMT Solver

de Moura, Leonardo and Bj rner, Nikolaj. Z3 : An Efficient SMT Solver. Tools and Algorithms for the Construction and Analysis of Systems. 2008

work page 2008
[15]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Llmeval: A preliminary study on how to evaluate large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page
[16]

LLME val- M ed: A Real-world Clinical Benchmark for Medical LLM s with Physician Validation

Zhang, Ming and Shen, Yujiong and Li, Zelin and Sha, Huayu and Hu, Binze and Wang, Yuhui and Huang, Chenhao and Liu, Shichun and Tong, Jingqi and Jiang, Changhao and Chai, Mingxu and Xi, Zhiheng and Dou, Shihan and Gui, Tao and Zhang, Qi and Huang, Xuanjing. LLME val- M ed: A Real-world Clinical Benchmark for Medical LLM s with Physician Validation. Findi...

work page doi:10.18653/v1/2025.findings-emnlp.263 2025
[17]

2025 , eprint=

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models , author=. 2025 , eprint=

work page 2025
[18]

2026 , month = mar, url =

work page 2026
[19]

2026 , month = feb, url =

work page 2026
[20]

2026 , month = apr, url =

work page 2026
[21]

Biometrics , volume =

The Measurement of Observer Agreement for Categorical Data , author =. Biometrics , volume =. 1977 , publisher =

work page 1977
[22]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

Tom McCoy and Ellie Pavlick and Tal Linzen , editor =. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , booktitle =. 2019 , url =. doi:10.18653/V1/P19-1334 , timestamp =

work page doi:10.18653/v1/p19-1334 2019
[23]

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , booktitle =

Zhaofeng Wu and Linlu Qiu and Alexis Ross and Ekin Aky. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.102 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.102 2024
[24]

Rabe and Charles Staats and Mateja Jamnik and Christian Szegedy , editor =

Yuhuai Wu and Albert Qiaochu Jiang and Wenda Li and Markus N. Rabe and Charles Staats and Mateja Jamnik and Christian Szegedy , editor =. Autoformalization with Large Language Models , booktitle =. 2022 , url =

work page 2022
[25]

and Dahl, George

Bowman, Samuel R. and Dahl, George. What Will it Take to Fix Benchmarking in Natural Language Understanding?. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.385

work page doi:10.18653/v1/2021.naacl-main.385 2021
[26]

Challenging

Suzgun, Mirac and Scales, Nathan and Sch. Challenging BIG -Bench Tasks and Whether Chain-of-Thought Can Solve Them. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.824

work page doi:10.18653/v1/2023.findings-acl.824 2023
[27]

Cook , editor =

Stephen A. Cook , editor =. The Complexity of Theorem-Proving Procedures , booktitle =. 1971 , url =. doi:10.1145/800157.805047 , timestamp =

work page doi:10.1145/800157.805047 1971
[28]

Turing, A. M. , title =. Proceedings of the London Mathematical Society , volume =. doi:https://doi.org/10.1112/plms/s2-42.1.230 , url =. https://londmathsoc.onlinelibrary.wiley.com/doi/pdf/10.1112/plms/s2-42.1.230 , year =

work page doi:10.1112/plms/s2-42.1.230
[29]

A Note on the

Alonzo Church , doi =. A Note on the. Journal of Symbolic Logic , number =

work page
[30]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

Ruiyang Xu and Jialun Cao and Yaojie Lu and Ming Wen and Hongyu Lin and Xianpei Han and Ben He and Shing. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =

work page 2025
[31]

Mehran Kazemi and Bahare Fatemi and Hritik Bansal and John Palowitch and Chrysovalantis Anastasiou and Sanket Vaibhav Mehta and Lalit K. Jain and Virginia Aglietti and Disha Jindal and Peter Chen and Nishanth Dikkala and Gladys Tyen and Xin Liu and Uri Shalit and Silvia Chiappa and Kate Olszewska and Yi Tay and Vinh Q. Tran and Quoc V. Le and Orhan Firat ...

work page 2025
[32]

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

Bowen Cao and Dongdong Zhang and Yixia Li and Junpeng Liu and Shijue Huang and Chufan Shi and Hongyuan Lu and Yaokang Wu and Guanhua Chen and Wai Lam and Furu Wei , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2601.23048 , eprinttype =. 2601.23048 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.23048 2026
[33]

McIntosh and Teo Susnjak and Nalin A

Timothy R. McIntosh and Teo Susnjak and Nalin A. G. Arachchilage and Tong Liu and Dan Xu and Paul A. Watters and Malka N. Halgamuge , title =. 2026 , url =. doi:10.1109/TAI.2025.3569516 , timestamp =

work page doi:10.1109/tai.2025.3569516 2026
[34]

Su, Camillo Jose Taylor, and Dan Roth

Bowen Jiang and Yangxinyu Xie and Zhuoqun Hao and Xiaomeng Wang and Tanwi Mallick and Weijie Su and Camillo J. Taylor and Dan Roth , editor =. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.272 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.272 2024
[35]

On Memorization of Large Language Models in Logical Reasoning , booktitle =

Chulin Xie and Yangsibo Huang and Chiyuan Zhang and Da Yu and Xinyun Chen and Bill Yuchen Lin and Bo Li and Badih Ghazi and Ravi Kumar , editor =. On Memorization of Large Language Models in Logical Reasoning , booktitle =. 2025 , url =

work page 2025
[36]

2025 , url =

Mihai Nadas and Laura Diosan and Andreea Tomescu , title =. 2025 , url =. doi:10.1109/ACCESS.2025.3589503 , timestamp =

work page doi:10.1109/access.2025.3589503 2025

[1] [1]

L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Parmar, Mihir and Patel, Nisarg and Varshney, Neeraj and Nakamura, Mutsumi and Luo, Man and Mashetty, Santosh and Mitra, Arindam and Baral, Chitta. L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

work page 2024

[2] [2]

FOLIO : Natural Language Reasoning with First-Order Logic

Han, Simeng and Schoelkopf, Hailey and Zhao, Yilun and Qi, Zhenting and Riddell, Martin and Zhou, Wenfei and Coady, James and Peng, David and Qiao, Yujie and Benson, Luke and Sun, Lucy and Wardle-Solano, Alexander and Szab \'o , Hannah and Zubova, Ekaterina and Burtell, Matthew and Fan, Jonathan and Liu, Yixin and Wong, Brian and Sailor, Malcolm and Ni, A...

work page doi:10.18653/v1/2024.emnlp-main.1229 2024

[3] [3]

Transformers as Soft Reasoners over Language

Clark, Peter and Tafjord, Oyvind and Richardson, Kyle. Transformers as Soft Reasoners over Language. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ( IJCAI ). 2020

work page 2020

[4] [4]

P roof W riter: Generating Implications, Proofs, and Abductive Statements over Natural Language

Tafjord, Oyvind and Dalvi, Bhavana and Clark, Peter. P roof W riter: Generating Implications, Proofs, and Abductive Statements over Natural Language. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021

work page 2021

[5] [5]

Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

Lin, Bill Yuchen and Bras, Ronan Le and Richardson, Kyle and Sabharwal, Ashish and Poovendran, Radha and Clark, Peter and Choi, Yejin. Z ebra L ogic: On the Scaling Limits of LLM s for Logical Reasoning. arXiv preprint arXiv:2502.01100. 2025

work page arXiv 2025

[6] [6]

P - FOLIO : Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

Han, Simeng and Yu, Aaron and Shen, Rui and Qi, Zhenting and Riddell, Martin and Zhou, Wenfei and Qiao, Yujie and Zhao, Yilun and Yavuz, Semih and Liu, Ye and Joty, Shafiq and Zhou, Yingbo and Xiong, Caiming and Radev, Dragomir and Ying, Rex and Cohan, Arman. P - FOLIO : Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chai...

work page doi:10.18653/v1/2024.findings-emnlp.966 2024

[7] [7]

Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation

Yang, Yuan and Xiong, Siheng and Payani, Ali and Shareghi, Ehsan and Fekri, Faramarz. Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.375

work page doi:10.18653/v1/2024.acl-long.375 2024

[8] [8]

Entailment-Preserving First-order Logic Representations in Natural Language Entailment

Lee, Jinu and Liu, Qi and Ma, Runzhi and Han, Vincent and Wang, Ziqi and Ji, Heng and Hockenmaier, Julia. Entailment-Preserving First-order Logic Representations in Natural Language Entailment. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.286

work page doi:10.18653/v1/2025.acl-long.286 2025

[9] [9]

F o V er: First-Order Logic Verification for Natural Language Reasoning

Pei, Yu and Du, Yongping and Jin, Xingnan. F o V er: First-Order Logic Verification for Natural Language Reasoning. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl.a.41

work page doi:10.1162/tacl.a.41 2025

[10] [10]

D iv L ogic E val: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

Chung, Tsz Ting and Liu, Lemao and Yu, Mo and Yeung, Dit-Yan. D iv L ogic E val: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.47

work page doi:10.18653/v1/2025.findings-emnlp.47 2025

[11] [11]

SATB ench: Benchmarking LLM s' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

Wei, Anjiang and Wu, Yuheng and Wan, Yingjia and Suresh, Tarun and Tan, Huanmi and Zhou, Zhanke and Koyejo, Sanmi and Wang, Ke and Aiken, Alex. SATB ench: Benchmarking LLM s' Logical Reasoning via Automated Puzzle Generation from SAT Formulas. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/202...

work page doi:10.18653/v1/2025.emnlp-main.1716 2025

[12] [12]

Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought

Saparov, Abulhair and He, He. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. The Eleventh International Conference on Learning Representations. 2023

work page 2023

[13] [13]

Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

Saparov, Abulhair and Pang, Richard Yuanzhe and Padmakumar, Vishakh and Joshi, Nitish and Kazemi, Mehran and Kim, Najoung and He, He. Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples. Advances in Neural Information Processing Systems. 2023

work page 2023

[14] [14]

Z3 : An Efficient SMT Solver

de Moura, Leonardo and Bj rner, Nikolaj. Z3 : An Efficient SMT Solver. Tools and Algorithms for the Construction and Analysis of Systems. 2008

work page 2008

[15] [15]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Llmeval: A preliminary study on how to evaluate large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

work page

[16] [16]

LLME val- M ed: A Real-world Clinical Benchmark for Medical LLM s with Physician Validation

Zhang, Ming and Shen, Yujiong and Li, Zelin and Sha, Huayu and Hu, Binze and Wang, Yuhui and Huang, Chenhao and Liu, Shichun and Tong, Jingqi and Jiang, Changhao and Chai, Mingxu and Xi, Zhiheng and Dou, Shihan and Gui, Tao and Zhang, Qi and Huang, Xuanjing. LLME val- M ed: A Real-world Clinical Benchmark for Medical LLM s with Physician Validation. Findi...

work page doi:10.18653/v1/2025.findings-emnlp.263 2025

[17] [17]

2025 , eprint=

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models , author=. 2025 , eprint=

work page 2025

[18] [18]

2026 , month = mar, url =

work page 2026

[19] [19]

2026 , month = feb, url =

work page 2026

[20] [20]

2026 , month = apr, url =

work page 2026

[21] [21]

Biometrics , volume =

The Measurement of Observer Agreement for Categorical Data , author =. Biometrics , volume =. 1977 , publisher =

work page 1977

[22] [22]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

Tom McCoy and Ellie Pavlick and Tal Linzen , editor =. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , booktitle =. 2019 , url =. doi:10.18653/V1/P19-1334 , timestamp =

work page doi:10.18653/v1/p19-1334 2019

[23] [23]

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , booktitle =

Zhaofeng Wu and Linlu Qiu and Alexis Ross and Ekin Aky. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.102 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.102 2024

[24] [24]

Rabe and Charles Staats and Mateja Jamnik and Christian Szegedy , editor =

Yuhuai Wu and Albert Qiaochu Jiang and Wenda Li and Markus N. Rabe and Charles Staats and Mateja Jamnik and Christian Szegedy , editor =. Autoformalization with Large Language Models , booktitle =. 2022 , url =

work page 2022

[25] [25]

and Dahl, George

Bowman, Samuel R. and Dahl, George. What Will it Take to Fix Benchmarking in Natural Language Understanding?. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.385

work page doi:10.18653/v1/2021.naacl-main.385 2021

[26] [26]

Challenging

Suzgun, Mirac and Scales, Nathan and Sch. Challenging BIG -Bench Tasks and Whether Chain-of-Thought Can Solve Them. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.824

work page doi:10.18653/v1/2023.findings-acl.824 2023

[27] [27]

Cook , editor =

Stephen A. Cook , editor =. The Complexity of Theorem-Proving Procedures , booktitle =. 1971 , url =. doi:10.1145/800157.805047 , timestamp =

work page doi:10.1145/800157.805047 1971

[28] [28]

Turing, A. M. , title =. Proceedings of the London Mathematical Society , volume =. doi:https://doi.org/10.1112/plms/s2-42.1.230 , url =. https://londmathsoc.onlinelibrary.wiley.com/doi/pdf/10.1112/plms/s2-42.1.230 , year =

work page doi:10.1112/plms/s2-42.1.230

[29] [29]

A Note on the

Alonzo Church , doi =. A Note on the. Journal of Symbolic Logic , number =

work page

[30] [30]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

Ruiyang Xu and Jialun Cao and Yaojie Lu and Ming Wen and Hongyu Lin and Xianpei Han and Ben He and Shing. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =

work page 2025

[31] [31]

Mehran Kazemi and Bahare Fatemi and Hritik Bansal and John Palowitch and Chrysovalantis Anastasiou and Sanket Vaibhav Mehta and Lalit K. Jain and Virginia Aglietti and Disha Jindal and Peter Chen and Nishanth Dikkala and Gladys Tyen and Xin Liu and Uri Shalit and Silvia Chiappa and Kate Olszewska and Yi Tay and Vinh Q. Tran and Quoc V. Le and Orhan Firat ...

work page 2025

[32] [32]

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

Bowen Cao and Dongdong Zhang and Yixia Li and Junpeng Liu and Shijue Huang and Chufan Shi and Hongyuan Lu and Yaokang Wu and Guanhua Chen and Wai Lam and Furu Wei , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2601.23048 , eprinttype =. 2601.23048 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.23048 2026

[33] [33]

McIntosh and Teo Susnjak and Nalin A

Timothy R. McIntosh and Teo Susnjak and Nalin A. G. Arachchilage and Tong Liu and Dan Xu and Paul A. Watters and Malka N. Halgamuge , title =. 2026 , url =. doi:10.1109/TAI.2025.3569516 , timestamp =

work page doi:10.1109/tai.2025.3569516 2026

[34] [34]

Su, Camillo Jose Taylor, and Dan Roth

Bowen Jiang and Yangxinyu Xie and Zhuoqun Hao and Xiaomeng Wang and Tanwi Mallick and Weijie Su and Camillo J. Taylor and Dan Roth , editor =. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.272 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.272 2024

[35] [35]

On Memorization of Large Language Models in Logical Reasoning , booktitle =

Chulin Xie and Yangsibo Huang and Chiyuan Zhang and Da Yu and Xinyun Chen and Bill Yuchen Lin and Bo Li and Badih Ghazi and Ravi Kumar , editor =. On Memorization of Large Language Models in Logical Reasoning , booktitle =. 2025 , url =

work page 2025

[36] [36]

2025 , url =

Mihai Nadas and Laura Diosan and Andreea Tomescu , title =. 2025 , url =. doi:10.1109/ACCESS.2025.3589503 , timestamp =

work page doi:10.1109/access.2025.3589503 2025