pith. sign in

arxiv: 2605.19597 · v1 · pith:I35OYP2Pnew · submitted 2026-05-19 · 💻 cs.CL

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

Pith reviewed 2026-05-20 05:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords logical reasoning benchmarkLLM evaluationChinese languageZ3 solver verificationadversarial hardeningformalization scoringnatural language to logic
0
0 comments X

The pith

A new Chinese benchmark with Z3-verified formalizations shows frontier LLMs reach only 37.5 percent accuracy on its hardest logical-reasoning items.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLMEval-Logic, a benchmark of realistic situational scenarios in Chinese whose natural-language statements are paired with expert-audited formalizations. Items are verified for correctness by the Z3 solver and further hardened through an adversarial loop that generates challenging variants while keeping the underlying logic closed. Evaluation of fourteen frontier models on the resulting base and hard subsets documents clear performance ceilings, with the strongest model attaining 37.5 percent hard-item accuracy and 60.16 percent on joint Z3-plus-rubric formalization scoring even when reference symbols are supplied. A sympathetic reader would conclude that current LLMs still lack reliable mechanisms for deriving conclusions that follow strictly from stated premises in open-ended natural language.

Core claim

LLMEval-Logic supplies 246 base items and 190 hard items, each accompanied by reference formalizations whose satisfiability is checked by Z3 and whose natural-to-formal mappings are scored against expert rubrics. The hard subset further decomposes each item into 938 multi-step sub-questions over closed model spaces. When fourteen frontier LLMs are tested, the highest hard-item accuracy observed is 37.5 percent and the highest joint formalization score is 60.16 percent, indicating that existing models cannot yet maintain strict logical entailment across realistic Chinese scenarios even under favorable prompting conditions.

What carries the argument

The closed-loop adversarial hardening pipeline that starts from expert-audited natural-language items and reference formalizations, verifies them with Z3, and iteratively generates harder variants while preserving logical closure.

If this is right

  • Models must improve their ability to map natural-language premises to precise logical forms without external symbol guidance.
  • Benchmark saturation will require new items that survive repeated adversarial refinement rather than templated generation.
  • Joint solver-plus-rubric evaluation becomes a necessary complement to surface-level accuracy when judging logical competence.
  • Chinese situational scenarios expose gaps that English-centric benchmarks may have masked.
  • Progress on logical reasoning will be measurable only after models exceed the 37.5 percent hard-item threshold observed here.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline could be applied to other languages or to domains such as legal or medical reasoning where strict entailment matters.
  • If the low formalization scores persist, training objectives that directly optimize for Z3-consistent derivations may be required rather than relying solely on next-token prediction.
  • The existence of a 190-item hard subset with closed model spaces suggests a practical route for constructing future adversarial benchmarks without exhaustive human re-annotation.

Load-bearing premise

The expert-audited natural-language items together with their reference formalizations supply reliable ground truth for what counts as correct logical reasoning in realistic situations.

What would settle it

A model that scores above 80 percent on the hard subset while still producing contradictions on equivalent problems when the surface wording is slightly altered or when the same premises are presented in a different order.

read the original abstract

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400 expert-developed rubric atoms, and a 190-item Hard subset with 938 multi-step sub-questions over closed model spaces. Evaluating 14 frontier LLMs on LLMEval-Logic reveals substantial gaps in current models: the best model reaches only 37.5% Hard Item Accuracy, and even with reference symbols the highest joint Z3+Rubric formalization score among evaluated models reaches only 60.16%. Our benchmark is publicly available at https://github.com/llmeval/LLMEval-Logic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces LLMEval-Logic, a Chinese logical-reasoning benchmark constructed from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with reference formalizations, verifies answers via the Z3 solver, supplies expert rubrics for natural-to-formal grading, and applies closed-loop adversarial hardening. The released benchmark comprises a 246-item Base subset (with 1,400 rubric atoms) and a 190-item Hard subset (with 938 multi-step sub-questions). Evaluation of 14 frontier LLMs reports that the strongest model reaches only 37.5% Hard Item Accuracy and 60.16% joint Z3+Rubric formalization score.

Significance. If the construction pipeline is sound, the work supplies a publicly released, solver-verified benchmark that directly targets saturation and annotation weaknesses in prior logical-reasoning suites. The combination of Z3 verification, expert rubrics, and adversarial hardening constitutes a reproducible and falsifiable evaluation framework; the GitHub release further strengthens this contribution. The reported performance gaps provide concrete, quantitative evidence of current limitations in LLM logical reasoning on Chinese situational tasks and can usefully guide subsequent model development.

minor comments (3)
  1. [§3.1] §3.1: the description of the expert-audit protocol would be clearer if it explicitly stated the number of annotators per item and any tie-breaking procedure.
  2. [Table 1] Table 1: the column headers for rubric-atom counts could be aligned with the text definitions in §3.3 to avoid momentary ambiguity about what constitutes a 'multi-step sub-question'.
  3. [§5.2] §5.2: the discussion of model performance would benefit from a brief note on whether any of the 14 evaluated models had been exposed to similar Chinese logical-reasoning data during pre-training.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough and positive review. We are pleased that the referee recognizes the value of the solver-verified construction pipeline, the combination of Z3 verification with expert rubrics, the adversarial hardening process, and the quantitative evidence of current limitations in LLM logical reasoning on Chinese situational tasks. The recommendation to accept is appreciated.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper constructs LLMEval-Logic via expert-audited natural-language items, reference formalizations, Z3 verification, expert rubrics, and closed-loop adversarial hardening, then directly measures performance of 14 external frontier LLMs on the resulting Base and Hard subsets. The headline results (37.5% Hard Item Accuracy and 60.16% joint Z3+Rubric score) are empirical observations on those external models using the benchmark as independent ground truth. No equations, fitted parameters, or self-citation chains reduce these measured accuracies to quantities defined by the paper's own inputs or prior work by the same authors. The derivation chain consists of benchmark construction followed by external evaluation and is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim that the benchmark reveals genuine gaps in LLM logical reasoning rests on the assumption that Z3 is a sound verifier for the formalized items and that expert auditing produces unbiased ground truth; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • standard math Z3 theorem prover correctly verifies the logical entailment in the annotated formulas
    The pipeline relies on Z3 to confirm answers; this is a standard assumption when using an established solver.
  • domain assumption Expert-audited natural-language items and formalizations accurately represent logical reasoning tasks without systematic bias
    The benchmark construction depends on the quality and neutrality of the expert auditing step described in the abstract.

pith-pipeline@v0.9.0 · 5829 in / 1535 out tokens · 43852 ms · 2026-05-20T05:44:12.708624+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 1 internal anchor

  1. [1]

    L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

    Parmar, Mihir and Patel, Nisarg and Varshney, Neeraj and Nakamura, Mutsumi and Luo, Man and Mashetty, Santosh and Mitra, Arindam and Baral, Chitta. L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

  2. [2]

    FOLIO : Natural Language Reasoning with First-Order Logic

    Han, Simeng and Schoelkopf, Hailey and Zhao, Yilun and Qi, Zhenting and Riddell, Martin and Zhou, Wenfei and Coady, James and Peng, David and Qiao, Yujie and Benson, Luke and Sun, Lucy and Wardle-Solano, Alexander and Szab \'o , Hannah and Zubova, Ekaterina and Burtell, Matthew and Fan, Jonathan and Liu, Yixin and Wong, Brian and Sailor, Malcolm and Ni, A...

  3. [3]

    Transformers as Soft Reasoners over Language

    Clark, Peter and Tafjord, Oyvind and Richardson, Kyle. Transformers as Soft Reasoners over Language. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ( IJCAI ). 2020

  4. [4]

    P roof W riter: Generating Implications, Proofs, and Abductive Statements over Natural Language

    Tafjord, Oyvind and Dalvi, Bhavana and Clark, Peter. P roof W riter: Generating Implications, Proofs, and Abductive Statements over Natural Language. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021

  5. [5]

    Zebralogic: On the scaling limits of llms for logical reasoning.arXiv preprint arXiv:2502.01100, 2025

    Lin, Bill Yuchen and Bras, Ronan Le and Richardson, Kyle and Sabharwal, Ashish and Poovendran, Radha and Clark, Peter and Choi, Yejin. Z ebra L ogic: On the Scaling Limits of LLM s for Logical Reasoning. arXiv preprint arXiv:2502.01100. 2025

  6. [6]

    P - FOLIO : Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

    Han, Simeng and Yu, Aaron and Shen, Rui and Qi, Zhenting and Riddell, Martin and Zhou, Wenfei and Qiao, Yujie and Zhao, Yilun and Yavuz, Semih and Liu, Ye and Joty, Shafiq and Zhou, Yingbo and Xiong, Caiming and Radev, Dragomir and Ying, Rex and Cohan, Arman. P - FOLIO : Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chai...

  7. [7]

    Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation

    Yang, Yuan and Xiong, Siheng and Payani, Ali and Shareghi, Ehsan and Fekri, Faramarz. Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.375

  8. [8]

    Entailment-Preserving First-order Logic Representations in Natural Language Entailment

    Lee, Jinu and Liu, Qi and Ma, Runzhi and Han, Vincent and Wang, Ziqi and Ji, Heng and Hockenmaier, Julia. Entailment-Preserving First-order Logic Representations in Natural Language Entailment. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.286

  9. [9]

    F o V er: First-Order Logic Verification for Natural Language Reasoning

    Pei, Yu and Du, Yongping and Jin, Xingnan. F o V er: First-Order Logic Verification for Natural Language Reasoning. Transactions of the Association for Computational Linguistics. 2025. doi:10.1162/tacl.a.41

  10. [10]

    D iv L ogic E val: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models

    Chung, Tsz Ting and Liu, Lemao and Yu, Mo and Yeung, Dit-Yan. D iv L ogic E val: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.47

  11. [11]

    SATB ench: Benchmarking LLM s' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

    Wei, Anjiang and Wu, Yuheng and Wan, Yingjia and Suresh, Tarun and Tan, Huanmi and Zhou, Zhanke and Koyejo, Sanmi and Wang, Ke and Aiken, Alex. SATB ench: Benchmarking LLM s' Logical Reasoning via Automated Puzzle Generation from SAT Formulas. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/202...

  12. [12]

    Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought

    Saparov, Abulhair and He, He. Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought. The Eleventh International Conference on Learning Representations. 2023

  13. [13]

    Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples

    Saparov, Abulhair and Pang, Richard Yuanzhe and Padmakumar, Vishakh and Joshi, Nitish and Kazemi, Mehran and Kim, Najoung and He, He. Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples. Advances in Neural Information Processing Systems. 2023

  14. [14]

    Z3 : An Efficient SMT Solver

    de Moura, Leonardo and Bj rner, Nikolaj. Z3 : An Efficient SMT Solver. Tools and Algorithms for the Construction and Analysis of Systems. 2008

  15. [15]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Llmeval: A preliminary study on how to evaluate large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  16. [16]

    LLME val- M ed: A Real-world Clinical Benchmark for Medical LLM s with Physician Validation

    Zhang, Ming and Shen, Yujiong and Li, Zelin and Sha, Huayu and Hu, Binze and Wang, Yuhui and Huang, Chenhao and Liu, Shichun and Tong, Jingqi and Jiang, Changhao and Chai, Mingxu and Xi, Zhiheng and Dou, Shihan and Gui, Tao and Zhang, Qi and Huang, Xuanjing. LLME val- M ed: A Real-world Clinical Benchmark for Medical LLM s with Physician Validation. Findi...

  17. [17]

    2025 , eprint=

    LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models , author=. 2025 , eprint=

  18. [18]

    2026 , month = mar, url =

  19. [19]

    2026 , month = feb, url =

  20. [20]

    2026 , month = apr, url =

  21. [21]

    Biometrics , volume =

    The Measurement of Observer Agreement for Categorical Data , author =. Biometrics , volume =. 1977 , publisher =

  22. [22]

    Thomas McCoy, Ellie Pavlick, and Tal Linzen

    Tom McCoy and Ellie Pavlick and Tal Linzen , editor =. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , booktitle =. 2019 , url =. doi:10.18653/V1/P19-1334 , timestamp =

  23. [23]

    Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , booktitle =

    Zhaofeng Wu and Linlu Qiu and Alexis Ross and Ekin Aky. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.102 , timestamp =

  24. [24]

    Rabe and Charles Staats and Mateja Jamnik and Christian Szegedy , editor =

    Yuhuai Wu and Albert Qiaochu Jiang and Wenda Li and Markus N. Rabe and Charles Staats and Mateja Jamnik and Christian Szegedy , editor =. Autoformalization with Large Language Models , booktitle =. 2022 , url =

  25. [25]

    and Dahl, George

    Bowman, Samuel R. and Dahl, George. What Will it Take to Fix Benchmarking in Natural Language Understanding?. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.naacl-main.385

  26. [26]

    Challenging

    Suzgun, Mirac and Scales, Nathan and Sch. Challenging BIG -Bench Tasks and Whether Chain-of-Thought Can Solve Them. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.824

  27. [27]

    Cook , editor =

    Stephen A. Cook , editor =. The Complexity of Theorem-Proving Procedures , booktitle =. 1971 , url =. doi:10.1145/800157.805047 , timestamp =

  28. [28]

    Turing, A. M. , title =. Proceedings of the London Mathematical Society , volume =. doi:https://doi.org/10.1112/plms/s2-42.1.230 , url =. https://londmathsoc.onlinelibrary.wiley.com/doi/pdf/10.1112/plms/s2-42.1.230 , year =

  29. [29]

    A Note on the

    Alonzo Church , doi =. A Note on the. Journal of Symbolic Logic , number =

  30. [30]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),

    Ruiyang Xu and Jialun Cao and Yaojie Lu and Ming Wen and Hongyu Lin and Xianpei Han and Ben He and Shing. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),. 2025 , url =

  31. [31]

    Mehran Kazemi and Bahare Fatemi and Hritik Bansal and John Palowitch and Chrysovalantis Anastasiou and Sanket Vaibhav Mehta and Lalit K. Jain and Virginia Aglietti and Disha Jindal and Peter Chen and Nishanth Dikkala and Gladys Tyen and Xin Liu and Uri Shalit and Silvia Chiappa and Kate Olszewska and Yi Tay and Vinh Q. Tran and Quoc V. Le and Orhan Firat ...

  32. [32]

    From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

    Bowen Cao and Dongdong Zhang and Yixia Li and Junpeng Liu and Shijue Huang and Chufan Shi and Hongyuan Lu and Yaokang Wu and Guanhua Chen and Wai Lam and Furu Wei , title =. CoRR , volume =. 2026 , url =. doi:10.48550/ARXIV.2601.23048 , eprinttype =. 2601.23048 , timestamp =

  33. [33]

    McIntosh and Teo Susnjak and Nalin A

    Timothy R. McIntosh and Teo Susnjak and Nalin A. G. Arachchilage and Tong Liu and Dan Xu and Paul A. Watters and Malka N. Halgamuge , title =. 2026 , url =. doi:10.1109/TAI.2025.3569516 , timestamp =

  34. [34]

    Su, Camillo Jose Taylor, and Dan Roth

    Bowen Jiang and Yangxinyu Xie and Zhuoqun Hao and Xiaomeng Wang and Tanwi Mallick and Weijie Su and Camillo J. Taylor and Dan Roth , editor =. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.272 , timestamp =

  35. [35]

    On Memorization of Large Language Models in Logical Reasoning , booktitle =

    Chulin Xie and Yangsibo Huang and Chiyuan Zhang and Da Yu and Xinyun Chen and Bill Yuchen Lin and Bo Li and Badih Ghazi and Ravi Kumar , editor =. On Memorization of Large Language Models in Logical Reasoning , booktitle =. 2025 , url =

  36. [36]

    2025 , url =

    Mihai Nadas and Laura Diosan and Andreea Tomescu , title =. 2025 , url =. doi:10.1109/ACCESS.2025.3589503 , timestamp =