LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

Man Li; Weibin Lin; Xiaoke Fang; Xinyi Zhou; Zenghui Zhou; Zheng Zheng

arxiv: 2605.23965 · v2 · pith:CQDAFVWKnew · submitted 2026-05-12 · 💻 cs.AI · cs.LG· cs.SE

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

Zenghui Zhou , Man Li , Xiaoke Fang , Xinyi Zhou , Weibin Lin , Zheng Zheng This is my paper

Pith reviewed 2026-06-30 22:12 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.SE

keywords metamorphic testingLLM reasoning evaluationfirst-order logicreasoning reliabilitylogical invariancerobustness testing

0 comments

The pith

LGMT uses first-order logic equivalences to build test cases that check whether LLMs give consistent answers across logically equivalent questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LGMT, a framework that turns formal logical equivalences into metamorphic relations to generate multiple versions of the same reasoning problem. Traditional benchmarks compare a single answer against a reference and can miss cases where a model succeeds by luck or surface pattern rather than stable reasoning. LGMT instead checks whether the model produces consistent outputs on all versions, revealing defects when it does not. Experiments across six LLMs show these defects are common and often missed by reference-based tests, especially under changes to symbols or conclusions. The work concludes that evaluation should test robustness to logical invariance instead of isolated correctness.

Core claim

LGMT derives metamorphic relations directly from first-order logic equivalences, constructs sets of semantically invariant test cases, and identifies reasoning defects by checking output consistency across each set. On six state-of-the-art LLMs the method uncovers substantial defects that reference-based evaluations overlook, with particular sensitivity to symbol-level and conclusion-level variations, and shows that Few-shot Chain-of-Thought prompting reduces but does not eliminate the inconsistencies.

What carries the argument

Metamorphic relations derived from first-order logic equivalences, used to generate invariant test cases and perform cross-case consistency checking.

If this is right

LLMs remain sensitive to symbol-level and conclusion-level changes even when the underlying logic is preserved.
Few-shot Chain-of-Thought prompting reduces but does not remove the detected inconsistencies.
Evaluation of reasoning should shift from single-question correctness to checks of invariance under logical transformations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consistency-checking approach could be applied to other formal systems such as temporal or modal logic.
Training objectives that explicitly reward cross-equivalence consistency might reduce the defects LGMT detects.
The framework could be adapted to test robustness in non-strictly logical domains such as causal or probabilistic reasoning.

Load-bearing premise

That equivalences taken from first-order logic produce test cases whose meaning remains fixed enough to diagnose genuine reasoning defects rather than mere differences in phrasing or output style.

What would settle it

A collection of logical equivalences on which every tested LLM produces fully consistent and correct answers, yet the same models still commit clear reasoning errors on equivalent problems outside the chosen relations.

Figures

Figures reproduced from arXiv: 2605.23965 by Man Li, Weibin Lin, Xiaoke Fang, Xinyi Zhou, Zenghui Zhou, Zheng Zheng.

**Figure 2.** Figure 2: demonstrates how lexical MRs induce semantic drift. First, we establish a baseline where the model correctly deduces the conclusion (“I own a car.”) from a specific premise (“My car has four wheels.”), returning an accurate judgment of “Yes”. Then, a standard lexical MR modifies the premise by replacing the target noun “car” with its broader hypernym, “vehicle,” while keeping the conclusion unchanged [18] … view at source ↗

**Figure 1.** Figure 1: In the first query, the model is presented with a set of premises and a conclusion. The model answers “Unknown”, correctly aligning with the ground-truth label to indicate it cannot derive the conclusion from the given premises. In the second query, we modify only a single premise by applying an equivalence transformation. Specifically, the rule stating that “all social media applications have chat featur… view at source ↗

**Figure 3.** Figure 3: Overview of the LGMT Framework. represented in both natural language (premises and conclusion) and its corresponding FOL form, which serves as the basis for subsequent transformations. (2) Logic-Grounded MR Design. We define a set of MRs grounded in first-order logic, including formula-level (MR-E), symbol-level (MR-S), premise-level (MR-P), and conclusion-level (MR-C) transformations. These MRs are forma… view at source ↗

**Figure 4.** Figure 4: Average MVR across MR categories. MR-C and MR-S induce the highest inconsistency, while MR-P shows substantially lower sensitivity [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Hidden Defect Rate (HDR) across models and prompting strategies [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: False Unreported Rate (FUR) across models and prompting strategies. Finding 4.4: Few-shot CoT generally achieves the lowest FUR, but non-trivial blind spots remain. Implication: As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LGMT gives a concrete way to generate logic-based test pairs for checking LLM consistency on reasoning, but the reported defects could still be driven by surface token changes rather than failures of logic.

read the letter

The core idea here is to take first-order logic equivalences, turn them into metamorphic relations, and then measure whether an LLM gives consistent answers across those equivalent prompts. That is new enough in the LLM evaluation space; most prior metamorphic testing work on LLMs has used simpler syntactic or semantic-preserving rewrites without the explicit FOL grounding. The experiments on six models do show that standard accuracy numbers look better than the consistency numbers under these transformations, and that symbol swaps and conclusion rephrasing are particularly disruptive. That finding is worth knowing.

The soft spot is exactly the one flagged in the stress-test note. Even if two prompts are logically equivalent, the token sequences and attention patterns can differ enough that inconsistency might reflect sensitivity to surface form rather than a reasoning defect. The abstract gives no sign of an ablation that holds surface features fixed while varying only the logical structure, nor any human rating of semantic equivalence, nor any check that the chosen FOL relations actually produce near-identical embeddings. Without those controls the central claim that LGMT diagnoses genuine reasoning failures is not yet secured.

The method itself is externally grounded in standard logic rather than fitted parameters, which is a plus. The citation pattern looks standard for the area. The work is aimed at people building or auditing LLM reasoning systems who need something more than static benchmarks. It is coherent on its own terms and deserves a serious referee who can press on the invariance question and ask for the missing implementation details. I would send it to review rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes LGMT, an oracle-free metamorphic testing framework that derives relations from first-order logic equivalences to generate semantically invariant test cases, then detects LLM reasoning defects via cross-case consistency. Experiments on six state-of-the-art LLMs are reported to expose substantial hidden defects missed by static reference-based benchmarks; models are especially sensitive to symbol-level and conclusion-level variations, and Few-shot CoT only partially mitigates the issues. The work concludes that evaluation should shift from isolated correctness to robustness under logical invariance.

Significance. If the metamorphic relations are shown to isolate genuine reasoning failures rather than surface sensitivity, LGMT would supply a scalable, logic-grounded alternative to static benchmarks and could materially improve diagnosis of LLM reasoning reliability.

major comments (2)

[Abstract (central claim and experiments paragraph)] The central claim that output inconsistency diagnoses reasoning defects (rather than token/attention-level surface effects) rests on the unverified assumption that FOL equivalences produce test cases whose semantic invariance is tight enough for LLMs; the abstract provides no quantitative validation such as human equivalence ratings or surface-feature ablations to separate these explanations.
[Abstract (experiments paragraph)] Without details on the exact metamorphic relations, the consistency metric, dataset construction, or statistical controls, it is impossible to determine whether the reported defects support the claim that LGMT outperforms traditional evaluations.

minor comments (1)

[Abstract] The abstract refers to 'six state-of-the-art LLMs' and 'advanced prompting such as Few-shot CoT' without naming the models or specifying the prompting variants and baselines used.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed feedback on the abstract. We address each major comment below with references to the full manuscript, which provides the supporting methodology and experiments.

read point-by-point responses

Referee: [Abstract (central claim and experiments paragraph)] The central claim that output inconsistency diagnoses reasoning defects (rather than token/attention-level surface effects) rests on the unverified assumption that FOL equivalences produce test cases whose semantic invariance is tight enough for LLMs; the abstract provides no quantitative validation such as human equivalence ratings or surface-feature ablations to separate these explanations.

Authors: The metamorphic relations are derived directly from established first-order logic equivalences (detailed in Section 3.1 and Table 1), which are mathematically guaranteed to preserve semantic meaning. This formal grounding, rather than empirical human ratings, underpins the claim that inconsistencies indicate reasoning defects. Section 5 includes controls that isolate logical variations (e.g., symbol and conclusion changes) while holding surface features constant where possible, and reports statistically significant differences across six LLMs. We maintain that the logical foundation suffices without additional human studies for the core argument. revision: no
Referee: [Abstract (experiments paragraph)] Without details on the exact metamorphic relations, the consistency metric, dataset construction, or statistical controls, it is impossible to determine whether the reported defects support the claim that LGMT outperforms traditional evaluations.

Authors: The abstract summarizes the approach at a high level due to length constraints. Full details appear in the manuscript: metamorphic relations are enumerated in Section 3 and Table 1; the consistency metric (cross-case agreement rate) is defined in Section 4.2; dataset construction from logical templates is described in Section 4.1; and statistical controls (multiple runs, significance testing) are reported in Section 5. These elements support the finding that LGMT detects defects missed by static benchmarks, with sensitivity analyses for symbol-level and conclusion-level variations. revision: no

standing simulated objections not resolved

Additional quantitative validation such as new human equivalence ratings or expanded surface-feature ablations would require experiments outside the current manuscript scope.

Circularity Check

0 steps flagged

No circularity detected in LGMT derivation

full rationale

The LGMT framework derives metamorphic relations directly from standard first-order logic equivalences, which are external mathematical facts independent of the paper. No equations, fitted parameters, self-citations, or ansatzes are presented that reduce any claimed prediction or result to the method's own inputs by construction. The approach is self-contained against external benchmarks in logic, and the empirical evaluation on LLMs does not involve renaming known results or load-bearing self-references. This is the normal honest outcome for a method grounded in established formal logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard properties of first-order logic; no free parameters, invented entities, or ad-hoc axioms visible in abstract.

axioms (1)

standard math First-order logic equivalences define semantically invariant transformations
Invoked to generate metamorphic relations from formal logical equivalences.

pith-pipeline@v0.9.1-grok · 5709 in / 1098 out tokens · 21928 ms · 2026-06-30T22:12:03.434681+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

86 extracted references · 35 canonical work pages · 6 internal anchors

[1]

Cao, C., Li, M., Dai, J., Yang, J., Zhao, Z., Zhang, S., et al., 2025. Towards advanced mathematical reasoning for llms via first-order logic theorem proving, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. pp. 12429–12449. doi:10.18653/v1/2025. emnlp-main.628

work page doi:10.18653/v1/2025 2025
[2]

Thinking like a developer? comparing the attention of humans with neural models of code,

Chen, S., Jin, S., Xie, X., 2021. Testing your question answer- ing software via asking recursively, in: Preceeding of the Interna- tionalConferenceonAutomatedSoftwareEngineering,pp.104–116. doi:10.1109/ASE51524.2021.9678670

work page doi:10.1109/ase51524.2021.9678670 2021
[3]

Metamorphic testing: A new approach for generating next test cases

Chen, T.Y., Cheung, S.C., Yiu, S.M., 1998. Metamorphic testing: A new approach for generating next test cases. Technical Report HKUST-CS98-01.HongKongUniversityofScienceandTechnology

1998
[4]

Chen,T.Y.,Kuo,F.C.,Liu,H.,Poon,P.L.,Towey,D.,Tse,T.H.,etal.,
[5]

ACM Computing Surveys 51, 1–27

Metamorphictesting:areviewofchallengesandopportunities. ACM Computing Surveys 51, 1–27. doi:10.1145/3143561

work page doi:10.1145/3143561
[6]

Zaletel, and Joel E

Cho,S.,Ruberto,S.,Terragni,V.,2025. Metamorphictestingoflarge languagemodelsfornaturallanguageprocessing.doi:10.48550/arXiv. 2511.02108

work page internal anchor Pith review doi:10.48550/arxiv 2025
[7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark,P.,Cowhey,I.,Etzioni,O.,Khot,T.,Sabharwal,A.,Schoenick, C., et al., 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. doi:10.48550/ARXIV.1803.05457

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.05457 2018
[8]

Transformers as soft reasoners over language, in: Proceedings of the International Joint Conference on Artificial Intelligence, pp

Clark, P., Tafjord, O., Richardson, K., 2021. Transformers as soft reasoners over language, in: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3882–3890

2021
[9]

Errors of measurement in statistics

Cochran, W.G., 1968. Errors of measurement in statistics. Techno- metrics 10, 637–666. doi:10.1080/00401706.1968.10490621

work page doi:10.1080/00401706.1968.10490621 1968
[10]

Explaining answers with entailment trees, in: Proceedingsofthe2021ConferenceonEmpiricalMethodsinNatural LanguageProcessing,AssociationforComputationalLinguistics.pp

Dalvi, B., Jansen, P., Tafjord, O., Xie, Z., Smith, H., Pipatanangkura, L., et al., 2021. Explaining answers with entailment trees, in: Proceedingsofthe2021ConferenceonEmpiricalMethodsinNatural LanguageProcessing,AssociationforComputationalLinguistics.pp. 7358–7370. doi:10.18653/v1/2021.emnlp-main.585

work page doi:10.18653/v1/2021.emnlp-main.585 2021
[11]

DeepSeek-V3 Technical Report

DeepSeek,Liu, A.,Feng, B.,Xue, B.,Wang, B.,Wu,B., etal., 2025. Deepseek-v3 technical report. doi:10.48550/arXiv.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2025
[12]

DeepSeek-AI, Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., et al.,
[13]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. doi:10.48550/arXiv.2406.11931

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.11931
[14]

Dziri, N., Lu, X., Sclar, M., Li, X.L., Jiang, L., Lin, B.Y., et al.,
[15]

70293–70332

Faith and fate: limits of transformers on compositionality, in: Proceedings of the International Conference on Neural Information Processing Systems, pp. 70293–70332
[16]

Logical consis- tency of large language models in fact-checking, in: The Thirteenth International Conference on Learning Representations

Ghosh, B., Hasan, S., Arafat, N.A., Khan, A., 2025. Logical consis- tency of large language models in fact-checking, in: The Thirteenth International Conference on Learning Representations. URL:https: //openreview.net/forum?id=SimlDuN0YT

2025
[17]

Intelligentvirtualassistantswithllm-basedprocessautomation.URL: https://arxiv.org/abs/2312.06677,arXiv:2312.06677

Guan, Y., Wang, D., Chu, Z., Wang, S., Ni, F., Song, R., et al., 2023. Intelligentvirtualassistantswithllm-basedprocessautomation.URL: https://arxiv.org/abs/2312.06677,arXiv:2312.06677

arXiv 2023
[18]

Fang, J., Jiang, H., Wang, K., Ma, Y ., Shi, J., Wang, X., He, X., and Chua, T

Han, S., Schoelkopf, H., Zhao, Y., Qi, Z., Riddell, M., Zhou, W., et al., 2024. Folio: natural language reasoning with first-order logic, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 22017–22031. doi:10.18653/v1/ 2024.emnlp-main.1229

work page doi:10.18653/v1/ 2024
[19]

Conditional andmodalreasoninginlargelanguagemodels,in:Proceedingsofthe Conference on Empirical Methods in Natural Language Processing, pp

Holliday, W.H., Mandelkern, M., Zhang, C.E., 2024. Conditional andmodalreasoninginlargelanguagemodels,in:Proceedingsofthe Conference on Empirical Methods in Natural Language Processing, pp. 3800–3821. doi:10.18653/v1/2024.emnlp-main.222

work page doi:10.18653/v1/2024.emnlp-main.222 2024
[20]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., et al.,
[21]

2025 , publisher =

Asurveyonhallucinationinlargelanguagemodels:Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 43, 1–55. doi:10.1145/3703155

work page doi:10.1145/3703155
[22]

Metal: metamorphic testing frameworkforanalyzinglarge-languagemodelqualities,in:Proceed- ing of the IEEE Conference on Software Testing, Verification and Validation, IEEE

Hyun, S., Guo, M., Babar, M.A., 2024. Metal: metamorphic testing frameworkforanalyzinglarge-languagemodelqualities,in:Proceed- ing of the IEEE Conference on Software Testing, Verification and Validation, IEEE. pp. 117–128. doi:10.1109/ICST60714.2024.00019

work page doi:10.1109/icst60714.2024.00019 2024
[23]

16889–16914

Jiang,J.,Wang,J.,Yan,Y.,Liu,Y.,Zhu,J.,Zhang,M.,etal.,2025.Do largelanguagemodelsexcelincomplexlogicalreasoningwithformal language?, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 16889–16914. doi:10. 18653/v1/2025.emnlp-main.855

2025
[24]

Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., et al.,
[25]

Swe-bench: Can language models resolve real-world github issues? doi:10.48550/arXiv.2310.06770

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06770
[26]

Retrieval-augmentedgenerationforknowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, Curran Associates, Inc

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., etal.,2020. Retrieval-augmentedgenerationforknowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 9459–9474

2020
[27]

Drowzee: metamorphic testing for fact-conflicting hallucination detection in large language models

Li, N., Li, Y., Liu, Y., Shi, L., Wang, K., Wang, H., 2024. Drowzee: metamorphic testing for fact-conflicting hallucination detection in large language models. Proceedings of the ACM on Programming Languages 8, 1843–1872. doi:10.1145/3689776

work page doi:10.1145/3689776 2024
[28]

Evaluating the logical reasoning abilities of large reasoning models

Liu, H., Ding, Y., Fu, Z., Zhang, C., Liu, X., Zhang, Y., 2025. Evaluating the logical reasoning abilities of large reasoning models. doi:10.48550/arXiv.2505.11854

work page doi:10.48550/arxiv.2505.11854 2025
[29]

Logiqa: a challenge dataset for machine reading comprehension with logical reasoning, in: Proceedings of the International Joint Confer- enceonArtificialIntelligence,pp.3622–3628

Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., Zhang, Y., 2021. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning, in: Proceedings of the International Joint Confer- enceonArtificialIntelligence,pp.3622–3628. URL:https://dl.acm. org/doi/10.5555/3491440.3491941

work page doi:10.5555/3491440.3491941 2021
[30]

Towardslogiglue:Abriefsurveyandabenchmarkfor analyzing logical reasoning capabilities of language models

Luo,M.,Kumbhar,S.,shen,M.,Parmar,M.,Varshney,N.,Banerjee, P.,etal.,2023. Towardslogiglue:Abriefsurveyandabenchmarkfor analyzing logical reasoning capabilities of language models. doi:10. 48550/ARXIV.2310.00836

arXiv 2023
[31]

Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey, in: First Conference on Language Modeling

Mondorf, P., Plank, B., 2024. Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey, in: First Conference on Language Modeling. URL:https://openreview.net/ forum?id=Lmjgl2n11u

2024
[32]

Murphy, C., Kaiser, G.E., Hu, L., Wu, L., 2008. Properties of machine learning applications for use in metamorphic testing, in: ProceedingsoftheInternationalConferenceonSoftwareEngineering & Knowledge Engineering, Knowledge Systems Institute Graduate School. pp. 867–872

2008
[33]

Olausson, T., Gu, A., Lipkin, B., Zhang, C., Solar-Lezama, A., Tenenbaum, J., et al., 2023. Linc: a neurosymbolic approach for logicalreasoningbycombininglanguagemodelswithfirst-orderlogic provers, in: Proceedings of the Conference on Empirical Methods in NaturalLanguageProcessing,pp.5153–5176. doi:10.18653/v1/2023. emnlp-main.313

work page doi:10.18653/v1/2023 2023
[34]

Logic-lm: Empoweringlargelanguagemodelswithsymbolicsolversforfaithful logical reasoning, in: Proceeding of the Conference on Empirical MethodsinNaturalLanguageProcessing

Pan, L., Albalak, A., Wang, X., Wang, W.Y., 2023. Logic-lm: Empoweringlargelanguagemodelswithsymbolicsolversforfaithful logical reasoning, in: Proceeding of the Conference on Empirical MethodsinNaturalLanguageProcessing. URL:https://openreview. net/forum?id=nWXMv949ZH&noteId=qt0t8SsVvT

2023
[35]

Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering

Park, S., Subramonyam, H., Kulkarni, C., 2024. Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering. URL:https://arxiv.org/abs/2312.06024, arXiv:2312.06024

arXiv 2024
[36]

Parmar, M., Patel, N., Varshney, N., Nakamura, M., Luo, M., Mashetty,S.,etal.,2024. Logicbench:Towardssystematicevaluation oflogicalreasoningabilityoflargelanguagemodels,in:Proceedings Zenghui Zhou et al.:Preprint submitted to ElsevierPage 16 of 25 Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs oftheAnnualMeetingoftheA...

work page doi:10.18653/v1/2024.acl-long.739 2024
[37]

Patel, N., Kulkarni, M., Parmar, M., Budhiraja, A., Nakamura, M., Varshney, N., et al., 2024. Multi-logieval: towards evaluating multi- step logical reasoning ability of large language models, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp. 20856–20879. doi:10.18653/v1/2024.emnlp-main. 1160

work page doi:10.18653/v1/2024.emnlp-main 2024
[38]

Large language models meet symbolic provers for logical reasoning eval- uation, in: Proceeding of the International Conference on Learning Representations

Qi, C., Ma, R., Li, B., Du, H., Hui, B., Wu, J., et al., 2024. Large language models meet symbolic provers for logical reasoning eval- uation, in: Proceeding of the International Conference on Learning Representations. URL:https://openreview.net/forum?id=C25SgeXWjE

2024
[39]

Code llama: Open foundation models for code

Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., et al., 2024. Code llama: Open foundation models for code. doi:10. 48550/arXiv.2308.12950

Pith/arXiv arXiv 2024
[40]

Language models are greedy reasoners: a systematic formal analysis of chain-of-thought, in: The International Conference on Learning Representations

Saparov, A., He, H., 2022. Language models are greedy reasoners: a systematic formal analysis of chain-of-thought, in: The International Conference on Learning Representations. URL:https://openreview. net/forum?id=qFVVBzXxR2V

2022
[41]

IEEETransactionsonSoftwareEngineering 42, 805–824

Segura,S.,Fraser,G.,Sanchez,A.B.,Ruiz-Cortés,A.,2016.Asurvey onmetamorphictesting. IEEETransactionsonSoftwareEngineering 42, 805–824. doi:10.1109/TSE.2016.2532875

work page doi:10.1109/tse.2016.2532875 2016
[42]

Arelargelanguagemodelsgoodatfuzzyreasoning?, in: Proceedings of the International Conference on Computational Intelligence and Intelligent Systems, pp

Singh,S.,2024. Arelargelanguagemodelsgoodatfuzzyreasoning?, in: Proceedings of the International Conference on Computational Intelligence and Intelligent Systems, pp. 1–6. doi:10.1145/3708778. 3708779

work page doi:10.1145/3708778 2024
[43]

Sinha, K., Sodhani, S., Dong, J., Pineau, J., Hamilton, W.L., 2019. Clutrr: a diagnostic benchmark for inductive reasoning from text, in: ProceedingsoftheConferenceonEmpiricalMethodsinNaturalLan- guage Processing and the International Joint Conference on Natural Language Processing, pp. 4505–4514. doi:10.18653/v1/D19-1458

work page doi:10.18653/v1/d19-1458 2019
[44]

Zico Kolter, Matt Fredrikson, and Spyros Matsoukas

Sok, C., Luz, D., Haddam, Y., 2025. Metarag: Metamorphic testing for hallucination detection in rag systems. doi:10.48550/arXiv.2509. 09360

work page doi:10.48550/arxiv.2509 2025
[45]

Challenging big-bench tasks and whether chain- of-thought can solve them, in: Findings of the Association for Com- putational Linguistics: ACL 2023, pp

Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H.W., et al., 2023. Challenging big-bench tasks and whether chain- of-thought can solve them, in: Findings of the Association for Com- putational Linguistics: ACL 2023, pp. 13003–13051. doi:10.18653/ v1/2023.findings-acl.824

2023
[46]

Proofwriter: generating implications,proofs,andabductivestatementsovernaturallanguage, in: Findings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pp

Tafjord, O., Dalvi, B., Clark, P., 2021. Proofwriter: generating implications,proofs,andabductivestatementsovernaturallanguage, in: Findings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pp. 3621–3634. doi:10.18653/v1/2021.findings-acl. 317

work page doi:10.18653/v1/2021.findings-acl 2021
[47]

Diagnosing the first-order logical reasoning ability through logicnli, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp

Tian, J., Li, Y., Chen, W., Xiao, L., He, H., Jin, Y., 2021. Diagnosing the first-order logical reasoning ability through logicnli, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp. 3738–3747. doi:10.18653/v1/2021.emnlp-main.303

work page doi:10.18653/v1/2021.emnlp-main.303 2021
[48]

Wan, Y., Wang, W., Yang, Y., Yuan, Y., Huang, J.t., He, P., et al.,
[49]

LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

Logicasker: Evaluating and improving the logical reasoning ability of large language models, in: Proceedings of the International Conference on Empirical Methods in Natural Language Processing, pp. 2124–2155. doi:10.18653/v1/2024.emnlp-main.128

work page doi:10.18653/v1/2024.emnlp-main.128 2024
[50]

Self-consistency improves chain of thought reasoning in language models, in: The Eleventh International Conference on Learning Representations

Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., et al., 2022. Self-consistency improves chain of thought reasoning in language models, in: The Eleventh International Conference on Learning Representations

2022
[51]

Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the International Conference on Neural Information Processing Systems, pp

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., et al., 2022. Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the International Conference on Neural Information Processing Systems, pp. 24824–24837

2022
[52]

A systematic literature review of hallucinations in large language models

Woesle, C., Fischer-Brandies, L., Buettner, R., 2025. A systematic literature review of hallucinations in large language models. IEEE Access 13, 148231–148253. doi:10.1109/ACCESS.2025.3601206

work page doi:10.1109/access.2025.3601206 2025
[53]

Detecting and reducing the factual hallucinations of large language models with metamorphic testing

Wu, W., Cao, Y., Yi, N., Ou, R., Zheng, Z., 2025. Detecting and reducing the factual hallucinations of large language models with metamorphic testing. Proceedings of the ACM on Software Engineering 2, 1432–1453. doi:10.1145/3715784

work page doi:10.1145/3715784 2025
[54]

Testing and validating machine learning classifiers by metamorphic testing

Xie,X.,Ho,J.W.K.,Murphy,C.,Kaiser,G.,Xu,B.,Chen,T.Y.,2011. Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software 84, 544–558. doi:10.1016/ j.jss.2010.11.920

2011
[55]

Are largelanguagemodelsreallygoodlogicalreasoners?acomprehensive evaluation and beyond

Xu, F., Lin, Q., Han, J., Zhao, T., Liu, J., Cambria, E., 2025. Are largelanguagemodelsreallygoodlogicalreasoners?acomprehensive evaluation and beyond. IEEE Transactions on Knowledge and Data Engineering 37, 1620–1634. doi:10.1109/TKDE.2025.3536008

work page doi:10.1109/tkde.2025.3536008 2025
[56]

arXiv preprint arXiv:2404.18824 , year=

Xu,R.,Wang,Z.,Fan,R.Z.,Liu,P.,2024. Benchmarkingbenchmark leakage in large language models. doi:10.48550/arXiv.2404.18824

work page doi:10.48550/arxiv.2404.18824 2024
[57]

Hal- lucination detection in large language models with metamorphic relations

Yang, B., Al Mamun, M.A., Zhang, J.M., Uddin, G., 2025a. Hal- lucination detection in large language models with metamorphic relations. Proceedings of the ACM on Software Engineering 2, 425–
[58]

Hallucinationdetectionfor llm-based text-to-sql generation via two-stage metamorphic testing

Yang,B.,Xia,Y.,Sun,W.,Liu,Y.,2025b. Hallucinationdetectionfor llm-based text-to-sql generation via two-stage metamorphic testing. doi:10.48550/arXiv.2512.22250

work page doi:10.48550/arxiv.2512.22250
[59]

Reclor: a reading comprehension dataset requiring logical reasoning, in: Proceedings oftheInternationalConferenceonLearningRepresentations

Yu, W., Jiang, Z., Dong, Y., Feng, J., 2020. Reclor: a reading comprehension dataset requiring logical reasoning, in: Proceedings oftheInternationalConferenceonLearningRepresentations. doi:10. 48550/arXiv.2002.04326

arXiv 2020
[60]

Asurveyoflargelanguagemodelagentsforquestion answering

Yue,M.,2025. Asurveyoflargelanguagemodelagentsforquestion answering. doi:10.48550/arXiv.2503.19213

work page doi:10.48550/arxiv.2503.19213 2025
[61]

Zhang, D., Li, Z.Z., Zhang, M.L., Zhang, J., Liu, Z., Yao, Y., et al.,
[62]

IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–20doi:10.1109/TPAMI.2025.3637037

From system 1 to system 2: A survey of reasoning large lan- guage models. IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–20doi:10.1109/TPAMI.2025.3637037

work page doi:10.1109/tpami.2025.3637037 2025
[63]

Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., et al.,
[64]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Least-to-most prompting enables complex reasoning in large language models. doi:10.48550/arXiv.2205.10625

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.10625
[65]

Toolqa: A dataset for llm question answering with external tools

Zhuang, Y., Yu, Y., Wang, K., Sun, H., Zhang, C., 2023. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems 36, 50117–50143. Zenghui Zhou et al.:Preprint submitted to ElsevierPage 17 of 25 Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs A. Completeness of M...

2023
[67]

All citizens of Lawton Park use the zip code 98199
[69]

Conclusion Tom is a citizen of Washington

Daniel uses the zip code 98199. Conclusion Tom is a citizen of Washington. The ground-truth label of this instance isUnknown, since no premise connects Lawton Park or Seattle to Wash- ington. (2) FOL Representation.The corresponding symbolic representation is shown below. Premises 1.NeighbourhoodIn(lawtonPark, seattle) 2.forall x. (ResidentOf(x, lawtonPar...
[70]

LawtonParkisaneighborhoodinSeattle
[71]

For every person, either they are not a citizenofLawtonPark,ortheyusethezip code 98199
[72]

Tom is a citizen of Lawton Park
[73]

Conclusion Tom is a citizen of Washington

Daniel uses the zip code 98199. Conclusion Tom is a citizen of Washington. (5)ModelOutputsandOracleDecision.Letthemodel outputsforthesourceandfollow-uptestcasesbedenotedas 𝑦𝑠and𝑦 𝑓,respectively.UnderLGMT,ametamorphicoracle violationoccurs if 𝑦𝑠 ≠𝑦 𝑓 Since the transformation preserves logical equivalence, the correct reasoning outcome should remain unchang...
[75]

label".↪ The value for

Zero Explanation: Do not generate any reasoning, thought processes, or introductory text. Provide only the final judgment. ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly one key: "label".↪ The value for "label" must be exactly one of the following strings: "True", "False", or "Unknown".↪ Output...
[77]

reasoning

Step-by-Step Deduction: You must perform a rigorous, step-by-step logical deduction. Act like a formal proof system. Clearly state how the premises interact to evaluate the conclusion. Do not skip logical steps. ↪ ↪ ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly two keys: "reasoning" and "label...
[79]

label".↪ The value for

Zero Explanation: Do not generate any reasoning, thought processes, or introductory text. Provide only the final judgment. ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly one key: "label".↪ The value for "label" must be exactly one of the following strings: "True", "False", or "Unknown".↪ Output...
[80]

Your evaluation must rely strictly on formal logical structure

Pure Formal Logic: Treat all provided premises as absolute truth, regardless of real-world facts. Your evaluation must rely strictly on formal logical structure. ↪ ↪ ↪
[81]

reasoning

Step-by-Step Deduction: You must perform a rigorous, step-by-step logical deduction. Act like a formal proof system. Clearly state how the premises interact to evaluate the conclusion. Do not skip logical steps. ↪ ↪ ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly two keys: "reasoning" and "label...
[82]

both A and B

**Logical Connectives (Scope by Structure)** - **AND (&)**: Use "both A and B". If A is a complex sub-formula, use a comma: "both A, and B".↪ - **OR (|)**: Use "either A or B". - **Biconditional (<->)**: Use "A if and only if B". (Use a comma before'if'if A is complex).↪ - **Negation (-)**: Always use the prefix "it is not the case that".↪ Zenghui Zhou et...
[83]

Jadiel is Bitter

**Conditional Symbol Handling** - **Standard Word** (e.g.,`Bitter(x)`,`Jadiel`): Use natural phrasing.↪ Example:`Bitter(Jadiel)`-> "Jadiel is Bitter"; `-Bitter(Jadiel)`-> "it is not the case that Jadiel is Bitter". ↪ ↪ - **Abstract/Placeholder** (e.g.,`Pre1(x)`, `Con1`): Use formal phrasing.↪ Example:`Pre1(x)`-> "x has property Pre1"; `-Pre1(x)`-> "it is ...
[84]

For all x,

**Quantifiers & Variables** - Keep the order strictly left-to-right. -`all x.`-> "For all x, " -`exists x.`-> "There exists at least one x, such that "↪ - **NO Pronouns**: Always repeat the variable (x, y) or entity name. Never use "it", "he", or "they".↪
[85]

it is not the case that it is not the case that A

**No Simplification** - **Double Negation (--A)**: Translate as "it is not the case that it is not the case that A".↪ - **Redundancy (A | A)**: Translate as "either A is true or A is true".↪ - **Constants**:`& 1`-> "...and it is logically true";`| 0`-> "...or it is logically false".↪ # Examples for Reference - FOL: --Orange(Stanley) -> {"translation": "it...

Showing first 80 references.

[1] [1]

Cao, C., Li, M., Dai, J., Yang, J., Zhao, Z., Zhang, S., et al., 2025. Towards advanced mathematical reasoning for llms via first-order logic theorem proving, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. pp. 12429–12449. doi:10.18653/v1/2025. emnlp-main.628

work page doi:10.18653/v1/2025 2025

[2] [2]

Thinking like a developer? comparing the attention of humans with neural models of code,

Chen, S., Jin, S., Xie, X., 2021. Testing your question answer- ing software via asking recursively, in: Preceeding of the Interna- tionalConferenceonAutomatedSoftwareEngineering,pp.104–116. doi:10.1109/ASE51524.2021.9678670

work page doi:10.1109/ase51524.2021.9678670 2021

[3] [3]

Metamorphic testing: A new approach for generating next test cases

Chen, T.Y., Cheung, S.C., Yiu, S.M., 1998. Metamorphic testing: A new approach for generating next test cases. Technical Report HKUST-CS98-01.HongKongUniversityofScienceandTechnology

1998

[4] [4]

Chen,T.Y.,Kuo,F.C.,Liu,H.,Poon,P.L.,Towey,D.,Tse,T.H.,etal.,

[5] [5]

ACM Computing Surveys 51, 1–27

Metamorphictesting:areviewofchallengesandopportunities. ACM Computing Surveys 51, 1–27. doi:10.1145/3143561

work page doi:10.1145/3143561

[6] [6]

Zaletel, and Joel E

Cho,S.,Ruberto,S.,Terragni,V.,2025. Metamorphictestingoflarge languagemodelsfornaturallanguageprocessing.doi:10.48550/arXiv. 2511.02108

work page internal anchor Pith review doi:10.48550/arxiv 2025

[7] [7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark,P.,Cowhey,I.,Etzioni,O.,Khot,T.,Sabharwal,A.,Schoenick, C., et al., 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. doi:10.48550/ARXIV.1803.05457

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.05457 2018

[8] [8]

Transformers as soft reasoners over language, in: Proceedings of the International Joint Conference on Artificial Intelligence, pp

Clark, P., Tafjord, O., Richardson, K., 2021. Transformers as soft reasoners over language, in: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3882–3890

2021

[9] [9]

Errors of measurement in statistics

Cochran, W.G., 1968. Errors of measurement in statistics. Techno- metrics 10, 637–666. doi:10.1080/00401706.1968.10490621

work page doi:10.1080/00401706.1968.10490621 1968

[10] [10]

Explaining answers with entailment trees, in: Proceedingsofthe2021ConferenceonEmpiricalMethodsinNatural LanguageProcessing,AssociationforComputationalLinguistics.pp

Dalvi, B., Jansen, P., Tafjord, O., Xie, Z., Smith, H., Pipatanangkura, L., et al., 2021. Explaining answers with entailment trees, in: Proceedingsofthe2021ConferenceonEmpiricalMethodsinNatural LanguageProcessing,AssociationforComputationalLinguistics.pp. 7358–7370. doi:10.18653/v1/2021.emnlp-main.585

work page doi:10.18653/v1/2021.emnlp-main.585 2021

[11] [11]

DeepSeek-V3 Technical Report

DeepSeek,Liu, A.,Feng, B.,Xue, B.,Wang, B.,Wu,B., etal., 2025. Deepseek-v3 technical report. doi:10.48550/arXiv.2412.19437

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2025

[12] [12]

DeepSeek-AI, Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., et al.,

[13] [13]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. doi:10.48550/arXiv.2406.11931

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.11931

[14] [14]

Dziri, N., Lu, X., Sclar, M., Li, X.L., Jiang, L., Lin, B.Y., et al.,

[15] [15]

70293–70332

Faith and fate: limits of transformers on compositionality, in: Proceedings of the International Conference on Neural Information Processing Systems, pp. 70293–70332

[16] [16]

Logical consis- tency of large language models in fact-checking, in: The Thirteenth International Conference on Learning Representations

Ghosh, B., Hasan, S., Arafat, N.A., Khan, A., 2025. Logical consis- tency of large language models in fact-checking, in: The Thirteenth International Conference on Learning Representations. URL:https: //openreview.net/forum?id=SimlDuN0YT

2025

[17] [17]

Intelligentvirtualassistantswithllm-basedprocessautomation.URL: https://arxiv.org/abs/2312.06677,arXiv:2312.06677

Guan, Y., Wang, D., Chu, Z., Wang, S., Ni, F., Song, R., et al., 2023. Intelligentvirtualassistantswithllm-basedprocessautomation.URL: https://arxiv.org/abs/2312.06677,arXiv:2312.06677

arXiv 2023

[18] [18]

Fang, J., Jiang, H., Wang, K., Ma, Y ., Shi, J., Wang, X., He, X., and Chua, T

Han, S., Schoelkopf, H., Zhao, Y., Qi, Z., Riddell, M., Zhou, W., et al., 2024. Folio: natural language reasoning with first-order logic, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 22017–22031. doi:10.18653/v1/ 2024.emnlp-main.1229

work page doi:10.18653/v1/ 2024

[19] [19]

Conditional andmodalreasoninginlargelanguagemodels,in:Proceedingsofthe Conference on Empirical Methods in Natural Language Processing, pp

Holliday, W.H., Mandelkern, M., Zhang, C.E., 2024. Conditional andmodalreasoninginlargelanguagemodels,in:Proceedingsofthe Conference on Empirical Methods in Natural Language Processing, pp. 3800–3821. doi:10.18653/v1/2024.emnlp-main.222

work page doi:10.18653/v1/2024.emnlp-main.222 2024

[20] [20]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., et al.,

[21] [21]

2025 , publisher =

Asurveyonhallucinationinlargelanguagemodels:Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 43, 1–55. doi:10.1145/3703155

work page doi:10.1145/3703155

[22] [22]

Metal: metamorphic testing frameworkforanalyzinglarge-languagemodelqualities,in:Proceed- ing of the IEEE Conference on Software Testing, Verification and Validation, IEEE

Hyun, S., Guo, M., Babar, M.A., 2024. Metal: metamorphic testing frameworkforanalyzinglarge-languagemodelqualities,in:Proceed- ing of the IEEE Conference on Software Testing, Verification and Validation, IEEE. pp. 117–128. doi:10.1109/ICST60714.2024.00019

work page doi:10.1109/icst60714.2024.00019 2024

[23] [23]

16889–16914

Jiang,J.,Wang,J.,Yan,Y.,Liu,Y.,Zhu,J.,Zhang,M.,etal.,2025.Do largelanguagemodelsexcelincomplexlogicalreasoningwithformal language?, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 16889–16914. doi:10. 18653/v1/2025.emnlp-main.855

2025

[24] [24]

Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., et al.,

[25] [25]

Swe-bench: Can language models resolve real-world github issues? doi:10.48550/arXiv.2310.06770

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06770

[26] [26]

Retrieval-augmentedgenerationforknowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, Curran Associates, Inc

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., etal.,2020. Retrieval-augmentedgenerationforknowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 9459–9474

2020

[27] [27]

Drowzee: metamorphic testing for fact-conflicting hallucination detection in large language models

Li, N., Li, Y., Liu, Y., Shi, L., Wang, K., Wang, H., 2024. Drowzee: metamorphic testing for fact-conflicting hallucination detection in large language models. Proceedings of the ACM on Programming Languages 8, 1843–1872. doi:10.1145/3689776

work page doi:10.1145/3689776 2024

[28] [28]

Evaluating the logical reasoning abilities of large reasoning models

Liu, H., Ding, Y., Fu, Z., Zhang, C., Liu, X., Zhang, Y., 2025. Evaluating the logical reasoning abilities of large reasoning models. doi:10.48550/arXiv.2505.11854

work page doi:10.48550/arxiv.2505.11854 2025

[29] [29]

Logiqa: a challenge dataset for machine reading comprehension with logical reasoning, in: Proceedings of the International Joint Confer- enceonArtificialIntelligence,pp.3622–3628

Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., Zhang, Y., 2021. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning, in: Proceedings of the International Joint Confer- enceonArtificialIntelligence,pp.3622–3628. URL:https://dl.acm. org/doi/10.5555/3491440.3491941

work page doi:10.5555/3491440.3491941 2021

[30] [30]

Towardslogiglue:Abriefsurveyandabenchmarkfor analyzing logical reasoning capabilities of language models

Luo,M.,Kumbhar,S.,shen,M.,Parmar,M.,Varshney,N.,Banerjee, P.,etal.,2023. Towardslogiglue:Abriefsurveyandabenchmarkfor analyzing logical reasoning capabilities of language models. doi:10. 48550/ARXIV.2310.00836

arXiv 2023

[31] [31]

Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey, in: First Conference on Language Modeling

Mondorf, P., Plank, B., 2024. Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey, in: First Conference on Language Modeling. URL:https://openreview.net/ forum?id=Lmjgl2n11u

2024

[32] [32]

Murphy, C., Kaiser, G.E., Hu, L., Wu, L., 2008. Properties of machine learning applications for use in metamorphic testing, in: ProceedingsoftheInternationalConferenceonSoftwareEngineering & Knowledge Engineering, Knowledge Systems Institute Graduate School. pp. 867–872

2008

[33] [33]

Olausson, T., Gu, A., Lipkin, B., Zhang, C., Solar-Lezama, A., Tenenbaum, J., et al., 2023. Linc: a neurosymbolic approach for logicalreasoningbycombininglanguagemodelswithfirst-orderlogic provers, in: Proceedings of the Conference on Empirical Methods in NaturalLanguageProcessing,pp.5153–5176. doi:10.18653/v1/2023. emnlp-main.313

work page doi:10.18653/v1/2023 2023

[34] [34]

Logic-lm: Empoweringlargelanguagemodelswithsymbolicsolversforfaithful logical reasoning, in: Proceeding of the Conference on Empirical MethodsinNaturalLanguageProcessing

Pan, L., Albalak, A., Wang, X., Wang, W.Y., 2023. Logic-lm: Empoweringlargelanguagemodelswithsymbolicsolversforfaithful logical reasoning, in: Proceeding of the Conference on Empirical MethodsinNaturalLanguageProcessing. URL:https://openreview. net/forum?id=nWXMv949ZH&noteId=qt0t8SsVvT

2023

[35] [35]

Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering

Park, S., Subramonyam, H., Kulkarni, C., 2024. Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering. URL:https://arxiv.org/abs/2312.06024, arXiv:2312.06024

arXiv 2024

[36] [36]

Parmar, M., Patel, N., Varshney, N., Nakamura, M., Luo, M., Mashetty,S.,etal.,2024. Logicbench:Towardssystematicevaluation oflogicalreasoningabilityoflargelanguagemodels,in:Proceedings Zenghui Zhou et al.:Preprint submitted to ElsevierPage 16 of 25 Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs oftheAnnualMeetingoftheA...

work page doi:10.18653/v1/2024.acl-long.739 2024

[37] [37]

Patel, N., Kulkarni, M., Parmar, M., Budhiraja, A., Nakamura, M., Varshney, N., et al., 2024. Multi-logieval: towards evaluating multi- step logical reasoning ability of large language models, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp. 20856–20879. doi:10.18653/v1/2024.emnlp-main. 1160

work page doi:10.18653/v1/2024.emnlp-main 2024

[38] [38]

Large language models meet symbolic provers for logical reasoning eval- uation, in: Proceeding of the International Conference on Learning Representations

Qi, C., Ma, R., Li, B., Du, H., Hui, B., Wu, J., et al., 2024. Large language models meet symbolic provers for logical reasoning eval- uation, in: Proceeding of the International Conference on Learning Representations. URL:https://openreview.net/forum?id=C25SgeXWjE

2024

[39] [39]

Code llama: Open foundation models for code

Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., et al., 2024. Code llama: Open foundation models for code. doi:10. 48550/arXiv.2308.12950

Pith/arXiv arXiv 2024

[40] [40]

Language models are greedy reasoners: a systematic formal analysis of chain-of-thought, in: The International Conference on Learning Representations

Saparov, A., He, H., 2022. Language models are greedy reasoners: a systematic formal analysis of chain-of-thought, in: The International Conference on Learning Representations. URL:https://openreview. net/forum?id=qFVVBzXxR2V

2022

[41] [41]

IEEETransactionsonSoftwareEngineering 42, 805–824

Segura,S.,Fraser,G.,Sanchez,A.B.,Ruiz-Cortés,A.,2016.Asurvey onmetamorphictesting. IEEETransactionsonSoftwareEngineering 42, 805–824. doi:10.1109/TSE.2016.2532875

work page doi:10.1109/tse.2016.2532875 2016

[42] [42]

Arelargelanguagemodelsgoodatfuzzyreasoning?, in: Proceedings of the International Conference on Computational Intelligence and Intelligent Systems, pp

Singh,S.,2024. Arelargelanguagemodelsgoodatfuzzyreasoning?, in: Proceedings of the International Conference on Computational Intelligence and Intelligent Systems, pp. 1–6. doi:10.1145/3708778. 3708779

work page doi:10.1145/3708778 2024

[43] [43]

Sinha, K., Sodhani, S., Dong, J., Pineau, J., Hamilton, W.L., 2019. Clutrr: a diagnostic benchmark for inductive reasoning from text, in: ProceedingsoftheConferenceonEmpiricalMethodsinNaturalLan- guage Processing and the International Joint Conference on Natural Language Processing, pp. 4505–4514. doi:10.18653/v1/D19-1458

work page doi:10.18653/v1/d19-1458 2019

[44] [44]

Zico Kolter, Matt Fredrikson, and Spyros Matsoukas

Sok, C., Luz, D., Haddam, Y., 2025. Metarag: Metamorphic testing for hallucination detection in rag systems. doi:10.48550/arXiv.2509. 09360

work page doi:10.48550/arxiv.2509 2025

[45] [45]

Challenging big-bench tasks and whether chain- of-thought can solve them, in: Findings of the Association for Com- putational Linguistics: ACL 2023, pp

Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H.W., et al., 2023. Challenging big-bench tasks and whether chain- of-thought can solve them, in: Findings of the Association for Com- putational Linguistics: ACL 2023, pp. 13003–13051. doi:10.18653/ v1/2023.findings-acl.824

2023

[46] [46]

Proofwriter: generating implications,proofs,andabductivestatementsovernaturallanguage, in: Findings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pp

Tafjord, O., Dalvi, B., Clark, P., 2021. Proofwriter: generating implications,proofs,andabductivestatementsovernaturallanguage, in: Findings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pp. 3621–3634. doi:10.18653/v1/2021.findings-acl. 317

work page doi:10.18653/v1/2021.findings-acl 2021

[47] [47]

Diagnosing the first-order logical reasoning ability through logicnli, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp

Tian, J., Li, Y., Chen, W., Xiao, L., He, H., Jin, Y., 2021. Diagnosing the first-order logical reasoning ability through logicnli, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp. 3738–3747. doi:10.18653/v1/2021.emnlp-main.303

work page doi:10.18653/v1/2021.emnlp-main.303 2021

[48] [48]

Wan, Y., Wang, W., Yang, Y., Yuan, Y., Huang, J.t., He, P., et al.,

[49] [49]

LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

Logicasker: Evaluating and improving the logical reasoning ability of large language models, in: Proceedings of the International Conference on Empirical Methods in Natural Language Processing, pp. 2124–2155. doi:10.18653/v1/2024.emnlp-main.128

work page doi:10.18653/v1/2024.emnlp-main.128 2024

[50] [50]

Self-consistency improves chain of thought reasoning in language models, in: The Eleventh International Conference on Learning Representations

Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., et al., 2022. Self-consistency improves chain of thought reasoning in language models, in: The Eleventh International Conference on Learning Representations

2022

[51] [51]

Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the International Conference on Neural Information Processing Systems, pp

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., et al., 2022. Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the International Conference on Neural Information Processing Systems, pp. 24824–24837

2022

[52] [52]

A systematic literature review of hallucinations in large language models

Woesle, C., Fischer-Brandies, L., Buettner, R., 2025. A systematic literature review of hallucinations in large language models. IEEE Access 13, 148231–148253. doi:10.1109/ACCESS.2025.3601206

work page doi:10.1109/access.2025.3601206 2025

[53] [53]

Detecting and reducing the factual hallucinations of large language models with metamorphic testing

Wu, W., Cao, Y., Yi, N., Ou, R., Zheng, Z., 2025. Detecting and reducing the factual hallucinations of large language models with metamorphic testing. Proceedings of the ACM on Software Engineering 2, 1432–1453. doi:10.1145/3715784

work page doi:10.1145/3715784 2025

[54] [54]

Testing and validating machine learning classifiers by metamorphic testing

Xie,X.,Ho,J.W.K.,Murphy,C.,Kaiser,G.,Xu,B.,Chen,T.Y.,2011. Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software 84, 544–558. doi:10.1016/ j.jss.2010.11.920

2011

[55] [55]

Are largelanguagemodelsreallygoodlogicalreasoners?acomprehensive evaluation and beyond

Xu, F., Lin, Q., Han, J., Zhao, T., Liu, J., Cambria, E., 2025. Are largelanguagemodelsreallygoodlogicalreasoners?acomprehensive evaluation and beyond. IEEE Transactions on Knowledge and Data Engineering 37, 1620–1634. doi:10.1109/TKDE.2025.3536008

work page doi:10.1109/tkde.2025.3536008 2025

[56] [56]

arXiv preprint arXiv:2404.18824 , year=

Xu,R.,Wang,Z.,Fan,R.Z.,Liu,P.,2024. Benchmarkingbenchmark leakage in large language models. doi:10.48550/arXiv.2404.18824

work page doi:10.48550/arxiv.2404.18824 2024

[57] [57]

Hal- lucination detection in large language models with metamorphic relations

Yang, B., Al Mamun, M.A., Zhang, J.M., Uddin, G., 2025a. Hal- lucination detection in large language models with metamorphic relations. Proceedings of the ACM on Software Engineering 2, 425–

[58] [58]

Hallucinationdetectionfor llm-based text-to-sql generation via two-stage metamorphic testing

Yang,B.,Xia,Y.,Sun,W.,Liu,Y.,2025b. Hallucinationdetectionfor llm-based text-to-sql generation via two-stage metamorphic testing. doi:10.48550/arXiv.2512.22250

work page doi:10.48550/arxiv.2512.22250

[59] [59]

Reclor: a reading comprehension dataset requiring logical reasoning, in: Proceedings oftheInternationalConferenceonLearningRepresentations

Yu, W., Jiang, Z., Dong, Y., Feng, J., 2020. Reclor: a reading comprehension dataset requiring logical reasoning, in: Proceedings oftheInternationalConferenceonLearningRepresentations. doi:10. 48550/arXiv.2002.04326

arXiv 2020

[60] [60]

Asurveyoflargelanguagemodelagentsforquestion answering

Yue,M.,2025. Asurveyoflargelanguagemodelagentsforquestion answering. doi:10.48550/arXiv.2503.19213

work page doi:10.48550/arxiv.2503.19213 2025

[61] [61]

Zhang, D., Li, Z.Z., Zhang, M.L., Zhang, J., Liu, Z., Yao, Y., et al.,

[62] [62]

IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–20doi:10.1109/TPAMI.2025.3637037

From system 1 to system 2: A survey of reasoning large lan- guage models. IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–20doi:10.1109/TPAMI.2025.3637037

work page doi:10.1109/tpami.2025.3637037 2025

[63] [63]

Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., et al.,

[64] [64]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Least-to-most prompting enables complex reasoning in large language models. doi:10.48550/arXiv.2205.10625

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.10625

[65] [65]

Toolqa: A dataset for llm question answering with external tools

Zhuang, Y., Yu, Y., Wang, K., Sun, H., Zhang, C., 2023. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems 36, 50117–50143. Zenghui Zhou et al.:Preprint submitted to ElsevierPage 17 of 25 Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs A. Completeness of M...

2023

[66] [67]

All citizens of Lawton Park use the zip code 98199

[67] [69]

Conclusion Tom is a citizen of Washington

Daniel uses the zip code 98199. Conclusion Tom is a citizen of Washington. The ground-truth label of this instance isUnknown, since no premise connects Lawton Park or Seattle to Wash- ington. (2) FOL Representation.The corresponding symbolic representation is shown below. Premises 1.NeighbourhoodIn(lawtonPark, seattle) 2.forall x. (ResidentOf(x, lawtonPar...

[68] [70]

LawtonParkisaneighborhoodinSeattle

[69] [71]

For every person, either they are not a citizenofLawtonPark,ortheyusethezip code 98199

[70] [72]

Tom is a citizen of Lawton Park

[71] [73]

Conclusion Tom is a citizen of Washington

Daniel uses the zip code 98199. Conclusion Tom is a citizen of Washington. (5)ModelOutputsandOracleDecision.Letthemodel outputsforthesourceandfollow-uptestcasesbedenotedas 𝑦𝑠and𝑦 𝑓,respectively.UnderLGMT,ametamorphicoracle violationoccurs if 𝑦𝑠 ≠𝑦 𝑓 Since the transformation preserves logical equivalence, the correct reasoning outcome should remain unchang...

[72] [75]

label".↪ The value for

Zero Explanation: Do not generate any reasoning, thought processes, or introductory text. Provide only the final judgment. ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly one key: "label".↪ The value for "label" must be exactly one of the following strings: "True", "False", or "Unknown".↪ Output...

[73] [77]

reasoning

Step-by-Step Deduction: You must perform a rigorous, step-by-step logical deduction. Act like a formal proof system. Clearly state how the premises interact to evaluate the conclusion. Do not skip logical steps. ↪ ↪ ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly two keys: "reasoning" and "label...

[74] [79]

label".↪ The value for

Zero Explanation: Do not generate any reasoning, thought processes, or introductory text. Provide only the final judgment. ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly one key: "label".↪ The value for "label" must be exactly one of the following strings: "True", "False", or "Unknown".↪ Output...

[75] [80]

Your evaluation must rely strictly on formal logical structure

Pure Formal Logic: Treat all provided premises as absolute truth, regardless of real-world facts. Your evaluation must rely strictly on formal logical structure. ↪ ↪ ↪

[76] [81]

reasoning

Step-by-Step Deduction: You must perform a rigorous, step-by-step logical deduction. Act like a formal proof system. Clearly state how the premises interact to evaluate the conclusion. Do not skip logical steps. ↪ ↪ ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly two keys: "reasoning" and "label...

[77] [82]

both A and B

**Logical Connectives (Scope by Structure)** - **AND (&)**: Use "both A and B". If A is a complex sub-formula, use a comma: "both A, and B".↪ - **OR (|)**: Use "either A or B". - **Biconditional (<->)**: Use "A if and only if B". (Use a comma before'if'if A is complex).↪ - **Negation (-)**: Always use the prefix "it is not the case that".↪ Zenghui Zhou et...

[78] [83]

Jadiel is Bitter

**Conditional Symbol Handling** - **Standard Word** (e.g.,`Bitter(x)`,`Jadiel`): Use natural phrasing.↪ Example:`Bitter(Jadiel)`-> "Jadiel is Bitter"; `-Bitter(Jadiel)`-> "it is not the case that Jadiel is Bitter". ↪ ↪ - **Abstract/Placeholder** (e.g.,`Pre1(x)`, `Con1`): Use formal phrasing.↪ Example:`Pre1(x)`-> "x has property Pre1"; `-Pre1(x)`-> "it is ...

[79] [84]

For all x,

**Quantifiers & Variables** - Keep the order strictly left-to-right. -`all x.`-> "For all x, " -`exists x.`-> "There exists at least one x, such that "↪ - **NO Pronouns**: Always repeat the variable (x, y) or entity name. Never use "it", "he", or "they".↪

[80] [85]

it is not the case that it is not the case that A

**No Simplification** - **Double Negation (--A)**: Translate as "it is not the case that it is not the case that A".↪ - **Redundancy (A | A)**: Translate as "either A is true or A is true".↪ - **Constants**:`& 1`-> "...and it is logically true";`| 0`-> "...or it is logically false".↪ # Examples for Reference - FOL: --Orange(Stanley) -> {"translation": "it...