LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
Pith reviewed 2026-06-30 22:12 UTC · model grok-4.3
The pith
LGMT uses first-order logic equivalences to build test cases that check whether LLMs give consistent answers across logically equivalent questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LGMT derives metamorphic relations directly from first-order logic equivalences, constructs sets of semantically invariant test cases, and identifies reasoning defects by checking output consistency across each set. On six state-of-the-art LLMs the method uncovers substantial defects that reference-based evaluations overlook, with particular sensitivity to symbol-level and conclusion-level variations, and shows that Few-shot Chain-of-Thought prompting reduces but does not eliminate the inconsistencies.
What carries the argument
Metamorphic relations derived from first-order logic equivalences, used to generate invariant test cases and perform cross-case consistency checking.
If this is right
- LLMs remain sensitive to symbol-level and conclusion-level changes even when the underlying logic is preserved.
- Few-shot Chain-of-Thought prompting reduces but does not remove the detected inconsistencies.
- Evaluation of reasoning should shift from single-question correctness to checks of invariance under logical transformations.
Where Pith is reading between the lines
- The same consistency-checking approach could be applied to other formal systems such as temporal or modal logic.
- Training objectives that explicitly reward cross-equivalence consistency might reduce the defects LGMT detects.
- The framework could be adapted to test robustness in non-strictly logical domains such as causal or probabilistic reasoning.
Load-bearing premise
That equivalences taken from first-order logic produce test cases whose meaning remains fixed enough to diagnose genuine reasoning defects rather than mere differences in phrasing or output style.
What would settle it
A collection of logical equivalences on which every tested LLM produces fully consistent and correct answers, yet the same models still commit clear reasoning errors on equivalent problems outside the chosen relations.
Figures
read the original abstract
Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LGMT, an oracle-free metamorphic testing framework that derives relations from first-order logic equivalences to generate semantically invariant test cases, then detects LLM reasoning defects via cross-case consistency. Experiments on six state-of-the-art LLMs are reported to expose substantial hidden defects missed by static reference-based benchmarks; models are especially sensitive to symbol-level and conclusion-level variations, and Few-shot CoT only partially mitigates the issues. The work concludes that evaluation should shift from isolated correctness to robustness under logical invariance.
Significance. If the metamorphic relations are shown to isolate genuine reasoning failures rather than surface sensitivity, LGMT would supply a scalable, logic-grounded alternative to static benchmarks and could materially improve diagnosis of LLM reasoning reliability.
major comments (2)
- [Abstract (central claim and experiments paragraph)] The central claim that output inconsistency diagnoses reasoning defects (rather than token/attention-level surface effects) rests on the unverified assumption that FOL equivalences produce test cases whose semantic invariance is tight enough for LLMs; the abstract provides no quantitative validation such as human equivalence ratings or surface-feature ablations to separate these explanations.
- [Abstract (experiments paragraph)] Without details on the exact metamorphic relations, the consistency metric, dataset construction, or statistical controls, it is impossible to determine whether the reported defects support the claim that LGMT outperforms traditional evaluations.
minor comments (1)
- [Abstract] The abstract refers to 'six state-of-the-art LLMs' and 'advanced prompting such as Few-shot CoT' without naming the models or specifying the prompting variants and baselines used.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback on the abstract. We address each major comment below with references to the full manuscript, which provides the supporting methodology and experiments.
read point-by-point responses
-
Referee: [Abstract (central claim and experiments paragraph)] The central claim that output inconsistency diagnoses reasoning defects (rather than token/attention-level surface effects) rests on the unverified assumption that FOL equivalences produce test cases whose semantic invariance is tight enough for LLMs; the abstract provides no quantitative validation such as human equivalence ratings or surface-feature ablations to separate these explanations.
Authors: The metamorphic relations are derived directly from established first-order logic equivalences (detailed in Section 3.1 and Table 1), which are mathematically guaranteed to preserve semantic meaning. This formal grounding, rather than empirical human ratings, underpins the claim that inconsistencies indicate reasoning defects. Section 5 includes controls that isolate logical variations (e.g., symbol and conclusion changes) while holding surface features constant where possible, and reports statistically significant differences across six LLMs. We maintain that the logical foundation suffices without additional human studies for the core argument. revision: no
-
Referee: [Abstract (experiments paragraph)] Without details on the exact metamorphic relations, the consistency metric, dataset construction, or statistical controls, it is impossible to determine whether the reported defects support the claim that LGMT outperforms traditional evaluations.
Authors: The abstract summarizes the approach at a high level due to length constraints. Full details appear in the manuscript: metamorphic relations are enumerated in Section 3 and Table 1; the consistency metric (cross-case agreement rate) is defined in Section 4.2; dataset construction from logical templates is described in Section 4.1; and statistical controls (multiple runs, significance testing) are reported in Section 5. These elements support the finding that LGMT detects defects missed by static benchmarks, with sensitivity analyses for symbol-level and conclusion-level variations. revision: no
- Additional quantitative validation such as new human equivalence ratings or expanded surface-feature ablations would require experiments outside the current manuscript scope.
Circularity Check
No circularity detected in LGMT derivation
full rationale
The LGMT framework derives metamorphic relations directly from standard first-order logic equivalences, which are external mathematical facts independent of the paper. No equations, fitted parameters, self-citations, or ansatzes are presented that reduce any claimed prediction or result to the method's own inputs by construction. The approach is self-contained against external benchmarks in logic, and the empirical evaluation on LLMs does not involve renaming known results or load-bearing self-references. This is the normal honest outcome for a method grounded in established formal logic.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math First-order logic equivalences define semantically invariant transformations
Reference graph
Works this paper leans on
-
[1]
Cao, C., Li, M., Dai, J., Yang, J., Zhao, Z., Zhang, S., et al., 2025. Towards advanced mathematical reasoning for llms via first-order logic theorem proving, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics. pp. 12429–12449. doi:10.18653/v1/2025. emnlp-main.628
-
[2]
Thinking like a developer? comparing the attention of humans with neural models of code,
Chen, S., Jin, S., Xie, X., 2021. Testing your question answer- ing software via asking recursively, in: Preceeding of the Interna- tionalConferenceonAutomatedSoftwareEngineering,pp.104–116. doi:10.1109/ASE51524.2021.9678670
-
[3]
Metamorphic testing: A new approach for generating next test cases
Chen, T.Y., Cheung, S.C., Yiu, S.M., 1998. Metamorphic testing: A new approach for generating next test cases. Technical Report HKUST-CS98-01.HongKongUniversityofScienceandTechnology
1998
-
[4]
Chen,T.Y.,Kuo,F.C.,Liu,H.,Poon,P.L.,Towey,D.,Tse,T.H.,etal.,
-
[5]
ACM Computing Surveys 51, 1–27
Metamorphictesting:areviewofchallengesandopportunities. ACM Computing Surveys 51, 1–27. doi:10.1145/3143561
-
[6]
Cho,S.,Ruberto,S.,Terragni,V.,2025. Metamorphictestingoflarge languagemodelsfornaturallanguageprocessing.doi:10.48550/arXiv. 2511.02108
work page internal anchor Pith review doi:10.48550/arxiv 2025
-
[7]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Clark,P.,Cowhey,I.,Etzioni,O.,Khot,T.,Sabharwal,A.,Schoenick, C., et al., 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. doi:10.48550/ARXIV.1803.05457
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1803.05457 2018
-
[8]
Transformers as soft reasoners over language, in: Proceedings of the International Joint Conference on Artificial Intelligence, pp
Clark, P., Tafjord, O., Richardson, K., 2021. Transformers as soft reasoners over language, in: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 3882–3890
2021
-
[9]
Errors of measurement in statistics
Cochran, W.G., 1968. Errors of measurement in statistics. Techno- metrics 10, 637–666. doi:10.1080/00401706.1968.10490621
-
[10]
Dalvi, B., Jansen, P., Tafjord, O., Xie, Z., Smith, H., Pipatanangkura, L., et al., 2021. Explaining answers with entailment trees, in: Proceedingsofthe2021ConferenceonEmpiricalMethodsinNatural LanguageProcessing,AssociationforComputationalLinguistics.pp. 7358–7370. doi:10.18653/v1/2021.emnlp-main.585
-
[11]
DeepSeek,Liu, A.,Feng, B.,Xue, B.,Wang, B.,Wu,B., etal., 2025. Deepseek-v3 technical report. doi:10.48550/arXiv.2412.19437
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.19437 2025
-
[12]
DeepSeek-AI, Zhu, Q., Guo, D., Shao, Z., Yang, D., Wang, P., et al.,
-
[13]
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence. doi:10.48550/arXiv.2406.11931
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2406.11931
-
[14]
Dziri, N., Lu, X., Sclar, M., Li, X.L., Jiang, L., Lin, B.Y., et al.,
-
[15]
70293–70332
Faith and fate: limits of transformers on compositionality, in: Proceedings of the International Conference on Neural Information Processing Systems, pp. 70293–70332
-
[16]
Logical consis- tency of large language models in fact-checking, in: The Thirteenth International Conference on Learning Representations
Ghosh, B., Hasan, S., Arafat, N.A., Khan, A., 2025. Logical consis- tency of large language models in fact-checking, in: The Thirteenth International Conference on Learning Representations. URL:https: //openreview.net/forum?id=SimlDuN0YT
2025
-
[17]
Guan, Y., Wang, D., Chu, Z., Wang, S., Ni, F., Song, R., et al., 2023. Intelligentvirtualassistantswithllm-basedprocessautomation.URL: https://arxiv.org/abs/2312.06677,arXiv:2312.06677
arXiv 2023
-
[18]
Fang, J., Jiang, H., Wang, K., Ma, Y ., Shi, J., Wang, X., He, X., and Chua, T
Han, S., Schoelkopf, H., Zhao, Y., Qi, Z., Riddell, M., Zhou, W., et al., 2024. Folio: natural language reasoning with first-order logic, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 22017–22031. doi:10.18653/v1/ 2024.emnlp-main.1229
-
[19]
Holliday, W.H., Mandelkern, M., Zhang, C.E., 2024. Conditional andmodalreasoninginlargelanguagemodels,in:Proceedingsofthe Conference on Empirical Methods in Natural Language Processing, pp. 3800–3821. doi:10.18653/v1/2024.emnlp-main.222
-
[20]
Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., et al.,
-
[21]
Asurveyonhallucinationinlargelanguagemodels:Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems 43, 1–55. doi:10.1145/3703155
-
[22]
Hyun, S., Guo, M., Babar, M.A., 2024. Metal: metamorphic testing frameworkforanalyzinglarge-languagemodelqualities,in:Proceed- ing of the IEEE Conference on Software Testing, Verification and Validation, IEEE. pp. 117–128. doi:10.1109/ICST60714.2024.00019
-
[23]
16889–16914
Jiang,J.,Wang,J.,Yan,Y.,Liu,Y.,Zhu,J.,Zhang,M.,etal.,2025.Do largelanguagemodelsexcelincomplexlogicalreasoningwithformal language?, in: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 16889–16914. doi:10. 18653/v1/2025.emnlp-main.855
2025
-
[24]
Jimenez, C.E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., et al.,
-
[25]
Swe-bench: Can language models resolve real-world github issues? doi:10.48550/arXiv.2310.06770
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06770
-
[26]
Retrieval-augmentedgenerationforknowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, Curran Associates, Inc
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., etal.,2020. Retrieval-augmentedgenerationforknowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 9459–9474
2020
-
[27]
Drowzee: metamorphic testing for fact-conflicting hallucination detection in large language models
Li, N., Li, Y., Liu, Y., Shi, L., Wang, K., Wang, H., 2024. Drowzee: metamorphic testing for fact-conflicting hallucination detection in large language models. Proceedings of the ACM on Programming Languages 8, 1843–1872. doi:10.1145/3689776
-
[28]
Evaluating the logical reasoning abilities of large reasoning models
Liu, H., Ding, Y., Fu, Z., Zhang, C., Liu, X., Zhang, Y., 2025. Evaluating the logical reasoning abilities of large reasoning models. doi:10.48550/arXiv.2505.11854
-
[29]
Liu, J., Cui, L., Liu, H., Huang, D., Wang, Y., Zhang, Y., 2021. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning, in: Proceedings of the International Joint Confer- enceonArtificialIntelligence,pp.3622–3628. URL:https://dl.acm. org/doi/10.5555/3491440.3491941
-
[30]
Luo,M.,Kumbhar,S.,shen,M.,Parmar,M.,Varshney,N.,Banerjee, P.,etal.,2023. Towardslogiglue:Abriefsurveyandabenchmarkfor analyzing logical reasoning capabilities of language models. doi:10. 48550/ARXIV.2310.00836
arXiv 2023
-
[31]
Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey, in: First Conference on Language Modeling
Mondorf, P., Plank, B., 2024. Beyond accuracy: Evaluating the reasoning behavior of large language models - a survey, in: First Conference on Language Modeling. URL:https://openreview.net/ forum?id=Lmjgl2n11u
2024
-
[32]
Murphy, C., Kaiser, G.E., Hu, L., Wu, L., 2008. Properties of machine learning applications for use in metamorphic testing, in: ProceedingsoftheInternationalConferenceonSoftwareEngineering & Knowledge Engineering, Knowledge Systems Institute Graduate School. pp. 867–872
2008
-
[33]
Olausson, T., Gu, A., Lipkin, B., Zhang, C., Solar-Lezama, A., Tenenbaum, J., et al., 2023. Linc: a neurosymbolic approach for logicalreasoningbycombininglanguagemodelswithfirst-orderlogic provers, in: Proceedings of the Conference on Empirical Methods in NaturalLanguageProcessing,pp.5153–5176. doi:10.18653/v1/2023. emnlp-main.313
-
[34]
Logic-lm: Empoweringlargelanguagemodelswithsymbolicsolversforfaithful logical reasoning, in: Proceeding of the Conference on Empirical MethodsinNaturalLanguageProcessing
Pan, L., Albalak, A., Wang, X., Wang, W.Y., 2023. Logic-lm: Empoweringlargelanguagemodelswithsymbolicsolversforfaithful logical reasoning, in: Proceeding of the Conference on Empirical MethodsinNaturalLanguageProcessing. URL:https://openreview. net/forum?id=nWXMv949ZH¬eId=qt0t8SsVvT
2023
-
[35]
Park, S., Subramonyam, H., Kulkarni, C., 2024. Thinking assistants: Llm-based conversational assistants that help users think by asking rather than answering. URL:https://arxiv.org/abs/2312.06024, arXiv:2312.06024
arXiv 2024
-
[36]
Parmar, M., Patel, N., Varshney, N., Nakamura, M., Luo, M., Mashetty,S.,etal.,2024. Logicbench:Towardssystematicevaluation oflogicalreasoningabilityoflargelanguagemodels,in:Proceedings Zenghui Zhou et al.:Preprint submitted to ElsevierPage 16 of 25 Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs oftheAnnualMeetingoftheA...
-
[37]
Patel, N., Kulkarni, M., Parmar, M., Budhiraja, A., Nakamura, M., Varshney, N., et al., 2024. Multi-logieval: towards evaluating multi- step logical reasoning ability of large language models, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp. 20856–20879. doi:10.18653/v1/2024.emnlp-main. 1160
-
[38]
Large language models meet symbolic provers for logical reasoning eval- uation, in: Proceeding of the International Conference on Learning Representations
Qi, C., Ma, R., Li, B., Du, H., Hui, B., Wu, J., et al., 2024. Large language models meet symbolic provers for logical reasoning eval- uation, in: Proceeding of the International Conference on Learning Representations. URL:https://openreview.net/forum?id=C25SgeXWjE
2024
-
[39]
Code llama: Open foundation models for code
Rozière, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X.E., et al., 2024. Code llama: Open foundation models for code. doi:10. 48550/arXiv.2308.12950
Pith/arXiv arXiv 2024
-
[40]
Language models are greedy reasoners: a systematic formal analysis of chain-of-thought, in: The International Conference on Learning Representations
Saparov, A., He, H., 2022. Language models are greedy reasoners: a systematic formal analysis of chain-of-thought, in: The International Conference on Learning Representations. URL:https://openreview. net/forum?id=qFVVBzXxR2V
2022
-
[41]
IEEETransactionsonSoftwareEngineering 42, 805–824
Segura,S.,Fraser,G.,Sanchez,A.B.,Ruiz-Cortés,A.,2016.Asurvey onmetamorphictesting. IEEETransactionsonSoftwareEngineering 42, 805–824. doi:10.1109/TSE.2016.2532875
-
[42]
Singh,S.,2024. Arelargelanguagemodelsgoodatfuzzyreasoning?, in: Proceedings of the International Conference on Computational Intelligence and Intelligent Systems, pp. 1–6. doi:10.1145/3708778. 3708779
-
[43]
Sinha, K., Sodhani, S., Dong, J., Pineau, J., Hamilton, W.L., 2019. Clutrr: a diagnostic benchmark for inductive reasoning from text, in: ProceedingsoftheConferenceonEmpiricalMethodsinNaturalLan- guage Processing and the International Joint Conference on Natural Language Processing, pp. 4505–4514. doi:10.18653/v1/D19-1458
-
[44]
Zico Kolter, Matt Fredrikson, and Spyros Matsoukas
Sok, C., Luz, D., Haddam, Y., 2025. Metarag: Metamorphic testing for hallucination detection in rag systems. doi:10.48550/arXiv.2509. 09360
-
[45]
Challenging big-bench tasks and whether chain- of-thought can solve them, in: Findings of the Association for Com- putational Linguistics: ACL 2023, pp
Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H.W., et al., 2023. Challenging big-bench tasks and whether chain- of-thought can solve them, in: Findings of the Association for Com- putational Linguistics: ACL 2023, pp. 13003–13051. doi:10.18653/ v1/2023.findings-acl.824
2023
-
[46]
Tafjord, O., Dalvi, B., Clark, P., 2021. Proofwriter: generating implications,proofs,andabductivestatementsovernaturallanguage, in: Findings of the Association for Computational Linguistics: ACL- IJCNLP 2021, pp. 3621–3634. doi:10.18653/v1/2021.findings-acl. 317
-
[47]
Tian, J., Li, Y., Chen, W., Xiao, L., He, H., Jin, Y., 2021. Diagnosing the first-order logical reasoning ability through logicnli, in: Proceed- ings of the Conference on Empirical Methods in Natural Language Processing, pp. 3738–3747. doi:10.18653/v1/2021.emnlp-main.303
-
[48]
Wan, Y., Wang, W., Yang, Y., Yuan, Y., Huang, J.t., He, P., et al.,
-
[49]
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models
Logicasker: Evaluating and improving the logical reasoning ability of large language models, in: Proceedings of the International Conference on Empirical Methods in Natural Language Processing, pp. 2124–2155. doi:10.18653/v1/2024.emnlp-main.128
-
[50]
Self-consistency improves chain of thought reasoning in language models, in: The Eleventh International Conference on Learning Representations
Wang, X., Wei, J., Schuurmans, D., Le, Q.V., Chi, E.H., Narang, S., et al., 2022. Self-consistency improves chain of thought reasoning in language models, in: The Eleventh International Conference on Learning Representations
2022
-
[51]
Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the International Conference on Neural Information Processing Systems, pp
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., et al., 2022. Chain-of-thought prompting elicits reasoning in large language models, in: Proceedings of the International Conference on Neural Information Processing Systems, pp. 24824–24837
2022
-
[52]
A systematic literature review of hallucinations in large language models
Woesle, C., Fischer-Brandies, L., Buettner, R., 2025. A systematic literature review of hallucinations in large language models. IEEE Access 13, 148231–148253. doi:10.1109/ACCESS.2025.3601206
-
[53]
Detecting and reducing the factual hallucinations of large language models with metamorphic testing
Wu, W., Cao, Y., Yi, N., Ou, R., Zheng, Z., 2025. Detecting and reducing the factual hallucinations of large language models with metamorphic testing. Proceedings of the ACM on Software Engineering 2, 1432–1453. doi:10.1145/3715784
-
[54]
Testing and validating machine learning classifiers by metamorphic testing
Xie,X.,Ho,J.W.K.,Murphy,C.,Kaiser,G.,Xu,B.,Chen,T.Y.,2011. Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software 84, 544–558. doi:10.1016/ j.jss.2010.11.920
2011
-
[55]
Are largelanguagemodelsreallygoodlogicalreasoners?acomprehensive evaluation and beyond
Xu, F., Lin, Q., Han, J., Zhao, T., Liu, J., Cambria, E., 2025. Are largelanguagemodelsreallygoodlogicalreasoners?acomprehensive evaluation and beyond. IEEE Transactions on Knowledge and Data Engineering 37, 1620–1634. doi:10.1109/TKDE.2025.3536008
-
[56]
arXiv preprint arXiv:2404.18824 , year=
Xu,R.,Wang,Z.,Fan,R.Z.,Liu,P.,2024. Benchmarkingbenchmark leakage in large language models. doi:10.48550/arXiv.2404.18824
-
[57]
Hal- lucination detection in large language models with metamorphic relations
Yang, B., Al Mamun, M.A., Zhang, J.M., Uddin, G., 2025a. Hal- lucination detection in large language models with metamorphic relations. Proceedings of the ACM on Software Engineering 2, 425–
-
[58]
Hallucinationdetectionfor llm-based text-to-sql generation via two-stage metamorphic testing
Yang,B.,Xia,Y.,Sun,W.,Liu,Y.,2025b. Hallucinationdetectionfor llm-based text-to-sql generation via two-stage metamorphic testing. doi:10.48550/arXiv.2512.22250
-
[59]
Yu, W., Jiang, Z., Dong, Y., Feng, J., 2020. Reclor: a reading comprehension dataset requiring logical reasoning, in: Proceedings oftheInternationalConferenceonLearningRepresentations. doi:10. 48550/arXiv.2002.04326
arXiv 2020
-
[60]
Asurveyoflargelanguagemodelagentsforquestion answering
Yue,M.,2025. Asurveyoflargelanguagemodelagentsforquestion answering. doi:10.48550/arXiv.2503.19213
-
[61]
Zhang, D., Li, Z.Z., Zhang, M.L., Zhang, J., Liu, Z., Yao, Y., et al.,
-
[62]
IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–20doi:10.1109/TPAMI.2025.3637037
From system 1 to system 2: A survey of reasoning large lan- guage models. IEEE Transactions on Pattern Analysis and Machine Intelligence , 1–20doi:10.1109/TPAMI.2025.3637037
-
[63]
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., et al.,
-
[64]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Least-to-most prompting enables complex reasoning in large language models. doi:10.48550/arXiv.2205.10625
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.10625
-
[65]
Toolqa: A dataset for llm question answering with external tools
Zhuang, Y., Yu, Y., Wang, K., Sun, H., Zhang, C., 2023. Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems 36, 50117–50143. Zenghui Zhou et al.:Preprint submitted to ElsevierPage 17 of 25 Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs A. Completeness of M...
2023
-
[67]
All citizens of Lawton Park use the zip code 98199
-
[69]
Conclusion Tom is a citizen of Washington
Daniel uses the zip code 98199. Conclusion Tom is a citizen of Washington. The ground-truth label of this instance isUnknown, since no premise connects Lawton Park or Seattle to Wash- ington. (2) FOL Representation.The corresponding symbolic representation is shown below. Premises 1.NeighbourhoodIn(lawtonPark, seattle) 2.forall x. (ResidentOf(x, lawtonPar...
-
[70]
LawtonParkisaneighborhoodinSeattle
-
[71]
For every person, either they are not a citizenofLawtonPark,ortheyusethezip code 98199
-
[72]
Tom is a citizen of Lawton Park
-
[73]
Conclusion Tom is a citizen of Washington
Daniel uses the zip code 98199. Conclusion Tom is a citizen of Washington. (5)ModelOutputsandOracleDecision.Letthemodel outputsforthesourceandfollow-uptestcasesbedenotedas 𝑦𝑠and𝑦 𝑓,respectively.UnderLGMT,ametamorphicoracle violationoccurs if 𝑦𝑠 ≠𝑦 𝑓 Since the transformation preserves logical equivalence, the correct reasoning outcome should remain unchang...
-
[75]
label".↪ The value for
Zero Explanation: Do not generate any reasoning, thought processes, or introductory text. Provide only the final judgment. ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly one key: "label".↪ The value for "label" must be exactly one of the following strings: "True", "False", or "Unknown".↪ Output...
-
[77]
reasoning
Step-by-Step Deduction: You must perform a rigorous, step-by-step logical deduction. Act like a formal proof system. Clearly state how the premises interact to evaluate the conclusion. Do not skip logical steps. ↪ ↪ ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly two keys: "reasoning" and "label...
-
[79]
label".↪ The value for
Zero Explanation: Do not generate any reasoning, thought processes, or introductory text. Provide only the final judgment. ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly one key: "label".↪ The value for "label" must be exactly one of the following strings: "True", "False", or "Unknown".↪ Output...
-
[80]
Your evaluation must rely strictly on formal logical structure
Pure Formal Logic: Treat all provided premises as absolute truth, regardless of real-world facts. Your evaluation must rely strictly on formal logical structure. ↪ ↪ ↪
-
[81]
reasoning
Step-by-Step Deduction: You must perform a rigorous, step-by-step logical deduction. Act like a formal proof system. Clearly state how the premises interact to evaluate the conclusion. Do not skip logical steps. ↪ ↪ ↪ ↪ # Output Format: You must output a single, strictly formatted JSON object. The JSON must contain exactly two keys: "reasoning" and "label...
-
[82]
both A and B
**Logical Connectives (Scope by Structure)** - **AND (&)**: Use "both A and B". If A is a complex sub-formula, use a comma: "both A, and B".↪ - **OR (|)**: Use "either A or B". - **Biconditional (<->)**: Use "A if and only if B". (Use a comma before'if'if A is complex).↪ - **Negation (-)**: Always use the prefix "it is not the case that".↪ Zenghui Zhou et...
-
[83]
Jadiel is Bitter
**Conditional Symbol Handling** - **Standard Word** (e.g.,`Bitter(x)`,`Jadiel`): Use natural phrasing.↪ Example:`Bitter(Jadiel)`-> "Jadiel is Bitter"; `-Bitter(Jadiel)`-> "it is not the case that Jadiel is Bitter". ↪ ↪ - **Abstract/Placeholder** (e.g.,`Pre1(x)`, `Con1`): Use formal phrasing.↪ Example:`Pre1(x)`-> "x has property Pre1"; `-Pre1(x)`-> "it is ...
-
[84]
For all x,
**Quantifiers & Variables** - Keep the order strictly left-to-right. -`all x.`-> "For all x, " -`exists x.`-> "There exists at least one x, such that "↪ - **NO Pronouns**: Always repeat the variable (x, y) or entity name. Never use "it", "he", or "they".↪
-
[85]
it is not the case that it is not the case that A
**No Simplification** - **Double Negation (--A)**: Translate as "it is not the case that it is not the case that A".↪ - **Redundancy (A | A)**: Translate as "either A is true or A is true".↪ - **Constants**:`& 1`-> "...and it is logically true";`| 0`-> "...or it is logically false".↪ # Examples for Reference - FOL: --Orange(Stanley) -> {"translation": "it...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.