LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
Pith reviewed 2026-05-20 05:44 UTC · model grok-4.3
The pith
A new Chinese benchmark with Z3-verified formalizations shows frontier LLMs reach only 37.5 percent accuracy on its hardest logical-reasoning items.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMEval-Logic supplies 246 base items and 190 hard items, each accompanied by reference formalizations whose satisfiability is checked by Z3 and whose natural-to-formal mappings are scored against expert rubrics. The hard subset further decomposes each item into 938 multi-step sub-questions over closed model spaces. When fourteen frontier LLMs are tested, the highest hard-item accuracy observed is 37.5 percent and the highest joint formalization score is 60.16 percent, indicating that existing models cannot yet maintain strict logical entailment across realistic Chinese scenarios even under favorable prompting conditions.
What carries the argument
The closed-loop adversarial hardening pipeline that starts from expert-audited natural-language items and reference formalizations, verifies them with Z3, and iteratively generates harder variants while preserving logical closure.
If this is right
- Models must improve their ability to map natural-language premises to precise logical forms without external symbol guidance.
- Benchmark saturation will require new items that survive repeated adversarial refinement rather than templated generation.
- Joint solver-plus-rubric evaluation becomes a necessary complement to surface-level accuracy when judging logical competence.
- Chinese situational scenarios expose gaps that English-centric benchmarks may have masked.
- Progress on logical reasoning will be measurable only after models exceed the 37.5 percent hard-item threshold observed here.
Where Pith is reading between the lines
- The same pipeline could be applied to other languages or to domains such as legal or medical reasoning where strict entailment matters.
- If the low formalization scores persist, training objectives that directly optimize for Z3-consistent derivations may be required rather than relying solely on next-token prediction.
- The existence of a 190-item hard subset with closed model spaces suggests a practical route for constructing future adversarial benchmarks without exhaustive human re-annotation.
Load-bearing premise
The expert-audited natural-language items together with their reference formalizations supply reliable ground truth for what counts as correct logical reasoning in realistic situations.
What would settle it
A model that scores above 80 percent on the hard subset while still producing contradictions on equivalent problems when the surface wording is slightly altered or when the same premises are presented in a different order.
read the original abstract
Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LLMEval-Logic, a Chinese logical-reasoning benchmark constructed from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with reference formalizations, verifies answers via the Z3 solver, supplies expert rubrics for natural-to-formal grading, and applies closed-loop adversarial hardening. The released benchmark comprises a 246-item Base subset (with 1,400 rubric atoms) and a 190-item Hard subset (with 938 multi-step sub-questions). Evaluation of 14 frontier LLMs reports that the strongest model reaches only 37.5% Hard Item Accuracy and 60.16% joint Z3+Rubric formalization score.
Significance. If the construction pipeline is sound, the work supplies a publicly released, solver-verified benchmark that directly targets saturation and annotation weaknesses in prior logical-reasoning suites. The combination of Z3 verification, expert rubrics, and adversarial hardening constitutes a reproducible and falsifiable evaluation framework; the GitHub release further strengthens this contribution. The reported performance gaps provide concrete, quantitative evidence of current limitations in LLM logical reasoning on Chinese situational tasks and can usefully guide subsequent model development.
minor comments (3)
- [§3.1] §3.1: the description of the expert-audit protocol would be clearer if it explicitly stated the number of annotators per item and any tie-breaking procedure.
- [Table 1] Table 1: the column headers for rubric-atom counts could be aligned with the text definitions in §3.3 to avoid momentary ambiguity about what constitutes a 'multi-step sub-question'.
- [§5.2] §5.2: the discussion of model performance would benefit from a brief note on whether any of the 14 evaluated models had been exposed to similar Chinese logical-reasoning data during pre-training.
Simulated Author's Rebuttal
We thank the referee for the thorough and positive review. We are pleased that the referee recognizes the value of the solver-verified construction pipeline, the combination of Z3 verification with expert rubrics, the adversarial hardening process, and the quantitative evidence of current limitations in LLM logical reasoning on Chinese situational tasks. The recommendation to accept is appreciated.
Circularity Check
No significant circularity detected
full rationale
The paper constructs LLMEval-Logic via expert-audited natural-language items, reference formalizations, Z3 verification, expert rubrics, and closed-loop adversarial hardening, then directly measures performance of 14 external frontier LLMs on the resulting Base and Hard subsets. The headline results (37.5% Hard Item Accuracy and 60.16% joint Z3+Rubric score) are empirical observations on those external models using the benchmark as independent ground truth. No equations, fitted parameters, or self-citation chains reduce these measured accuracies to quantities defined by the paper's own inputs or prior work by the same authors. The derivation chain consists of benchmark construction followed by external evaluation and is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Z3 theorem prover correctly verifies the logical entailment in the annotated formulas
- domain assumption Expert-audited natural-language items and formalizations accurately represent logical reasoning tasks without systematic bias
Reference graph
Works this paper leans on
-
[1]
L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models
Parmar, Mihir and Patel, Nisarg and Varshney, Neeraj and Nakamura, Mutsumi and Luo, Man and Mashetty, Santosh and Mitra, Arindam and Baral, Chitta. L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024
work page 2024
-
[2]
FOLIO : Natural Language Reasoning with First-Order Logic
Han, Simeng and Schoelkopf, Hailey and Zhao, Yilun and Qi, Zhenting and Riddell, Martin and Zhou, Wenfei and Coady, James and Peng, David and Qiao, Yujie and Benson, Luke and Sun, Lucy and Wardle-Solano, Alexander and Szab \'o , Hannah and Zubova, Ekaterina and Burtell, Matthew and Fan, Jonathan and Liu, Yixin and Wong, Brian and Sailor, Malcolm and Ni, A...
-
[3]
Transformers as Soft Reasoners over Language
Clark, Peter and Tafjord, Oyvind and Richardson, Kyle. Transformers as Soft Reasoners over Language. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ( IJCAI ). 2020
work page 2020
-
[4]
P roof W riter: Generating Implications, Proofs, and Abductive Statements over Natural Language
Tafjord, Oyvind and Dalvi, Bhavana and Clark, Peter. P roof W riter: Generating Implications, Proofs, and Abductive Statements over Natural Language. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021
work page 2021
-
[5]
Lin, Bill Yuchen and Bras, Ronan Le and Richardson, Kyle and Sabharwal, Ashish and Poovendran, Radha and Clark, Peter and Choi, Yejin. Z ebra L ogic: On the Scaling Limits of LLM s for Logical Reasoning. arXiv preprint arXiv:2502.01100. 2025
-
[6]
P - FOLIO : Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains
Han, Simeng and Yu, Aaron and Shen, Rui and Qi, Zhenting and Riddell, Martin and Zhou, Wenfei and Qiao, Yujie and Zhao, Yilun and Yavuz, Semih and Liu, Ye and Joty, Shafiq and Zhou, Yingbo and Xiong, Caiming and Radev, Dragomir and Ying, Rex and Cohan, Arman. P - FOLIO : Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chai...
-
[7]
Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation
Yang, Yuan and Xiong, Siheng and Payani, Ali and Shareghi, Ehsan and Fekri, Faramarz. Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.375
-
[8]
Entailment-Preserving First-order Logic Representations in Natural Language Entailment
Lee, Jinu and Liu, Qi and Ma, Runzhi and Han, Vincent and Wang, Ziqi and Ji, Heng and Hockenmaier, Julia. Entailment-Preserving First-order Logic Representations in Natural Language Entailment. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.286
-
[9]
F o V er: First-Order Logic Verification for Natural Language Reasoning
Pei, Yu and Du, Yongping and Jin, Xingnan. F o V er: First-Order Logic Verification for Natural Language Reasoning. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl.a.41
-
[10]
Chung, Tsz Ting and Liu, Lemao and Yu, Mo and Yeung, Dit-Yan. D iv L ogic E val: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.47
-
[11]
SATB ench: Benchmarking LLM s' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
Wei, Anjiang and Wu, Yuheng and Wan, Yingjia and Suresh, Tarun and Tan, Huanmi and Zhou, Zhanke and Koyejo, Sanmi and Wang, Ke and Aiken, Alex. SATB ench: Benchmarking LLM s' Logical Reasoning via Automated Puzzle Generation from SAT Formulas. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/202...
-
[12]
Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Saparov, Abulhair and He, He. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. The Eleventh International Conference on Learning Representations. 2023
work page 2023
-
[13]
Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples
Saparov, Abulhair and Pang, Richard Yuanzhe and Padmakumar, Vishakh and Joshi, Nitish and Kazemi, Mehran and Kim, Najoung and He, He. Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples. Advances in Neural Information Processing Systems. 2023
work page 2023
-
[14]
de Moura, Leonardo and Bj rner, Nikolaj. Z3 : An Efficient SMT Solver. Tools and Algorithms for the Construction and Analysis of Systems. 2008
work page 2008
-
[15]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Llmeval: A preliminary study on how to evaluate large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[16]
LLME val- M ed: A Real-world Clinical Benchmark for Medical LLM s with Physician Validation
Zhang, Ming and Shen, Yujiong and Li, Zelin and Sha, Huayu and Hu, Binze and Wang, Yuhui and Huang, Chenhao and Liu, Shichun and Tong, Jingqi and Jiang, Changhao and Chai, Mingxu and Xi, Zhiheng and Dou, Shihan and Gui, Tao and Zhang, Qi and Huang, Xuanjing. LLME val- M ed: A Real-world Clinical Benchmark for Medical LLM s with Physician Validation. Findi...
-
[17]
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models , author=. 2025 , eprint=
work page 2025
-
[18]
2026 , month = mar, url =
work page 2026
-
[19]
2026 , month = feb, url =
work page 2026
-
[20]
2026 , month = apr, url =
work page 2026
-
[21]
The Measurement of Observer Agreement for Categorical Data , author =. Biometrics , volume =. 1977 , publisher =
work page 1977
-
[22]
Thomas McCoy, Ellie Pavlick, and Tal Linzen
Tom McCoy and Ellie Pavlick and Tal Linzen , editor =. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , booktitle =. 2019 , url =. doi:10.18653/V1/P19-1334 , timestamp =
-
[23]
Zhaofeng Wu and Linlu Qiu and Alexis Ross and Ekin Aky. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.102 , timestamp =
-
[24]
Rabe and Charles Staats and Mateja Jamnik and Christian Szegedy , editor =
Yuhuai Wu and Albert Qiaochu Jiang and Wenda Li and Markus N. Rabe and Charles Staats and Mateja Jamnik and Christian Szegedy , editor =. Autoformalization with Large Language Models , booktitle =. 2022 , url =
work page 2022
-
[25]
Bowman, Samuel R. and Dahl, George. What Will it Take to Fix Benchmarking in Natural Language Understanding?. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.385
-
[26]
Suzgun, Mirac and Scales, Nathan and Sch. Challenging BIG -Bench Tasks and Whether Chain-of-Thought Can Solve Them. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.824
-
[27]
Stephen A. Cook , editor =. The Complexity of Theorem-Proving Procedures , booktitle =. 1971 , url =. doi:10.1145/800157.805047 , timestamp =
-
[28]
Turing, A. M. , title =. Proceedings of the London Mathematical Society , volume =. doi:https://doi.org/10.1112/plms/s2-42.1.230 , url =. https://londmathsoc.onlinelibrary.wiley.com/doi/pdf/10.1112/plms/s2-42.1.230 , year =
- [29]
-
[30]
Ruiyang Xu and Jialun Cao and Yaojie Lu and Ming Wen and Hongyu Lin and Xianpei Han and Ben He and Shing. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =
work page 2025
-
[31]
Mehran Kazemi and Bahare Fatemi and Hritik Bansal and John Palowitch and Chrysovalantis Anastasiou and Sanket Vaibhav Mehta and Lalit K. Jain and Virginia Aglietti and Disha Jindal and Peter Chen and Nishanth Dikkala and Gladys Tyen and Xin Liu and Uri Shalit and Silvia Chiappa and Kate Olszewska and Yi Tay and Vinh Q. Tran and Quoc V. Le and Orhan Firat ...
work page 2025
-
[32]
From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics
Bowen Cao and Dongdong Zhang and Yixia Li and Junpeng Liu and Shijue Huang and Chufan Shi and Hongyuan Lu and Yaokang Wu and Guanhua Chen and Wai Lam and Furu Wei , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2601.23048 , eprinttype =. 2601.23048 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.23048 2026
-
[33]
McIntosh and Teo Susnjak and Nalin A
Timothy R. McIntosh and Teo Susnjak and Nalin A. G. Arachchilage and Tong Liu and Dan Xu and Paul A. Watters and Malka N. Halgamuge , title =. 2026 , url =. doi:10.1109/TAI.2025.3569516 , timestamp =
-
[34]
Su, Camillo Jose Taylor, and Dan Roth
Bowen Jiang and Yangxinyu Xie and Zhuoqun Hao and Xiaomeng Wang and Tanwi Mallick and Weijie Su and Camillo J. Taylor and Dan Roth , editor =. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.272 , timestamp =
-
[35]
On Memorization of Large Language Models in Logical Reasoning , booktitle =
Chulin Xie and Yangsibo Huang and Chiyuan Zhang and Da Yu and Xinyun Chen and Bill Yuchen Lin and Bo Li and Badih Ghazi and Ravi Kumar , editor =. On Memorization of Large Language Models in Logical Reasoning , booktitle =. 2025 , url =
work page 2025
-
[36]
Mihai Nadas and Laura Diosan and Andreea Tomescu , title =. 2025 , url =. doi:10.1109/ACCESS.2025.3589503 , timestamp =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.