LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.
Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Multi-agent LLMs with human verification can generate formal representations of GDPR provisions, but structured oversight is required to handle legal nuances effectively.
citing papers explorer
-
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.
-
GDPR Auto-Formalization with AI Agents and Human Verification
Multi-agent LLMs with human verification can generate formal representations of GDPR provisions, but structured oversight is required to handle legal nuances effectively.