ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.
L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 12roles
background 1polarities
background 1representative citing papers
HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.
Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
LoFa is a new benchmark and LFR@k metric for measuring LLM resistance to sustained logical fallacy attacks via generated question-argument pairs and debate simulations.
ChLogic benchmark shows persistent English-Chinese gaps in LLM logical reasoning performance, with back-translation effects varying by model and difficulty.
SKIM is an adaptive multi-resolution soft-token framework that compresses procedural skills while aiming to preserve logical dependencies and task performance better than prior compression methods.
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.
FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
Reasoning-optimized LLMs achieve 88-89% accuracy on 16 feature model analysis operations applied to semi-formal textual blueprints, approaching solver-based FLAMA performance.
Logical soundness is not a reliable criterion for neurosymbolic fact-checking with LLMs because it systematically diverges from human pragmatic inferences.
citing papers explorer
-
ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs
ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.
-
HOLMES: Evaluating Higher-Order Logical Reasoning in LLMs
HOLMES is the first real-world benchmark for higher-order symbolic reasoning in LLMs, where models average 50.64% accuracy and the best reaches 59.54%.
-
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling
Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
-
Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies
LoFa is a new benchmark and LFR@k metric for measuring LLM resistance to sustained logical fallacy attacks via generated question-argument pairs and debate simulations.
-
ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions
ChLogic benchmark shows persistent English-Chinese gaps in LLM logical reasoning performance, with back-translation effects varying by model and difficulty.
-
Adaptive Multi-Resolution Procedural Knowledge Compression for Large Language Models
SKIM is an adaptive multi-resolution soft-token framework that compresses procedural skills while aiming to preserve logical dependencies and task performance better than prior compression methods.
-
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
-
GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
GSM-SEM is a reusable framework for creating semantically variant augmentations of math benchmarks like GSM8K that alter facts but preserve answers and difficulty, with evaluations showing LLM performance drops of up to 28% on the new variants.
-
FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting
FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
-
Early-Stage Product Line Validation Using LLMs: A Study on Semi-Formal Blueprint Analysis
Reasoning-optimized LLMs achieve 88-89% accuracy on 16 feature model analysis operations applied to semi-formal textual blueprints, approaching solver-based FLAMA performance.
-
Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs
Logical soundness is not a reliable criterion for neurosymbolic fact-checking with LLMs because it systematically diverges from human pragmatic inferences.
- LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs