L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Parmar, Mihir, Patel, Nisarg, Varshney, Neeraj, Nakamura, Mutsumi, Luo, Man, Mashetty, Santosh · 2024 · DOI 10.18653/v1/2024.acl-long.739

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open at publisher browse 12 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.

HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.

Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

LoFa is a new benchmark and LFR@k metric for measuring LLM resistance to sustained logical fallacy attacks via generated question-argument pairs and debate simulations.

ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

ChLogic benchmark shows persistent English-Chinese gaps in LLM logical reasoning performance, with back-translation effects varying by model and difficulty.

Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

SKIM is an adaptive multi-resolution soft-token framework that compresses procedural skills while aiming to preserve logical dependencies and task performance better than prior compression methods.

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.

FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting

cs.CL · 2026-02-25 · unverdicted · novelty 6.0

FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.

Early-Stage Product Line Validation Using LLMs: A Study on Semi-Formal Blueprint Analysis

cs.SE · 2026-04-22 · unverdicted · novelty 4.0

Reasoning-optimized LLMs achieve 88-89% accuracy on 16 feature model analysis operations applied to semi-formal textual blueprints, approaching solver-based FLAMA performance.

Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs

cs.CL · 2026-04-05 · unverdicted · novelty 4.0

Logical soundness is not a reliable criterion for neurosymbolic fact-checking with LLMs because it systematically diverges from human pragmatic inferences.

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

cs.AI · 2026-05-12

citing papers explorer

Showing 12 of 12 citing papers after filters.

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs cs.CL · 2026-06-30 · unverdicted · none · ref 61
ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.
HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs cs.AI · 2026-06-22 · unverdicted · none · ref 3
HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling cs.CL · 2026-06-01 · unverdicted · none · ref 40
Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies cs.CL · 2026-06-30 · unverdicted · none · ref 24
LoFa is a new benchmark and LFR@k metric for measuring LLM resistance to sustained logical fallacy attacks via generated question-argument pairs and debate simulations.
ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions cs.CL · 2026-06-16 · unverdicted · none · ref 17
ChLogic benchmark shows persistent English-Chinese gaps in LLM logical reasoning performance, with back-translation effects varying by model and difficulty.
Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models cs.CL · 2026-06-10 · unverdicted · none · ref 64
SKIM is an adaptive multi-resolution soft-token framework that compresses procedural skills while aiming to preserve logical dependencies and task performance better than prior compression methods.
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models cs.CL · 2026-05-19 · unverdicted · none · ref 48
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations cs.CL · 2026-05-08 · unverdicted · none · ref 18 · 2 links
GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.
FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting cs.CL · 2026-02-25 · unverdicted · none · ref 22
FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
Early-Stage Product Line Validation Using LLMs: A Study on Semi-Formal Blueprint Analysis cs.SE · 2026-04-22 · unverdicted · none · ref 31
Reasoning-optimized LLMs achieve 88-89% accuracy on 16 feature model analysis operations applied to semi-formal textual blueprints, approaching solver-based FLAMA performance.
Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs cs.CL · 2026-04-05 · unverdicted · none · ref 4
Logical soundness is not a reliable criterion for neurosymbolic fact-checking with LLMs because it systematically diverges from human pragmatic inferences.
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs cs.AI · 2026-05-12 · unreviewed · ref 36

L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer