MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
L ogic B ench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 1polarities
background 1representative citing papers
FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
Reasoning-optimized LLMs achieve 88-89% accuracy on 16 feature model analysis operations applied to semi-formal textual blueprints, approaching solver-based FLAMA performance.
Logical soundness is not a reliable criterion for neurosymbolic fact-checking with LLMs because it systematically diverges from human pragmatic inferences.
citing papers explorer
-
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
-
FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting
FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
-
Early-Stage Product Line Validation Using LLMs: A Study on Semi-Formal Blueprint Analysis
Reasoning-optimized LLMs achieve 88-89% accuracy on 16 feature model analysis operations applied to semi-formal textual blueprints, approaching solver-based FLAMA performance.
-
Position: Logical Soundness is not a Reliable Criterion for Neurosymbolic Fact-Checking with LLMs
Logical soundness is not a reliable criterion for neurosymbolic fact-checking with LLMs because it systematically diverges from human pragmatic inferences.
- GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations