Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
years
2026 3representative citing papers
LLMEval-Logic is a solver-verified Chinese logical reasoning benchmark with 246 base and 190 hard items that shows frontier LLMs reach only 37.5% hard-item accuracy and 60.16% joint formalization score.
Multi-agent LLMs with human verification can generate formal representations of GDPR provisions, but structured oversight is required to handle legal nuances effectively.
citing papers explorer
-
GDPR Auto-Formalization with AI Agents and Human Verification
Multi-agent LLMs with human verification can generate formal representations of GDPR provisions, but structured oversight is required to handle legal nuances effectively.