Identifying the Achilles' Heel: An Iterative Method for Dynamically Uncovering Factual Errors in Large Language Models
Pith reviewed 2026-05-24 04:40 UTC · model grok-4.3
The pith
HalluHunter extracts fact triplets from knowledge graphs to generate questions and iteratively targets LLM errors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HalluHunter employs a knowledge-graph-based approach, extracting fact triplets to generate diverse question types for single- and multi-hop reasoning using rule-based Natural Language Processing techniques, with an iterative process that begins with random selection and shifts to adaptive selection of triplets where LLMs frequently err, revealing factual inaccuracies in up to 55% of questions on nine prominent LLMs.
What carries the argument
The iterative adaptive selection step that uses prior LLM performance to prioritize error-prone fact triplets for new question generation.
If this is right
- HalluHunter triggers factual errors in up to 55% of tested questions across nine LLMs.
- Adaptive selection of triplets exposes weaknesses in existing factuality benchmarks while preserving question coverage.
- The framework supplies a fully automated alternative that reduces reliance on human labor and avoids test-data contamination.
- The method maintains coverage of questions even as it focuses on harder cases.
Where Pith is reading between the lines
- The same triplet-extraction and iteration pattern could be applied to commonsense or reasoning errors beyond strict factuality.
- If knowledge-graph coverage is incomplete, the method would systematically miss errors on facts outside the graph.
- Repeated runs on the same model could track whether fine-tuning or updates reduce the error rate on the same triplet set.
Load-bearing premise
The extracted fact triplets from the knowledge graph are accurate ground-truth facts and the rule-based NLP conversion produces questions that correctly and unambiguously test the LLM's knowledge of those facts without introducing new factual errors or ambiguities.
What would settle it
If independent verification shows that the generated questions do not match the original triplets or introduce their own factual distortions, then the reported error rates would not measure true LLM factuality.
Figures
read the original abstract
Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs' veracity are limited by the need for extensive human labor, test data contamination, or limited scope, hindering efficient and effective exposure of errors. To address these challenges, we propose HalluHunter, a novel, fully automated framework for systematically uncovering factual inaccuracies in LLMs. HalluHunter employs a knowledge-graph-based approach, extracting fact triplets to generate diverse question types for single- and multi-hop reasoning using rule-based Natural Language Processing (NLP) techniques. Its iterative process starts with random triplet selection for question generation, followed by adaptive selection in subsequent iterations, targeting triplets where LLMs frequently err based on their performance analysis. Our extensive tests on nine prominent LLMs reveal that HalluHunter can trigger factual errors in up to 55% of tested questions. Moreover, we demonstrate that HalluHunter's test cases, particularly in adaptive selection, could further expose the weaknesses in benchmarking the factuality in LLMs meanwhile maintaining the coverage of questions. All code, data, and results are available at this link: https://github.com/Mysterchan/HalluHunter.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HalluHunter, a fully automated iterative framework that extracts fact triplets from a knowledge graph and applies rule-based NLP to generate single- and multi-hop questions. It begins with random triplet selection and shifts to adaptive selection targeting triplets where LLMs err most frequently. Experiments on nine LLMs are reported to trigger factual errors in up to 55% of questions, with the adaptive mode claimed to expose weaknesses in existing factuality benchmarks while preserving question coverage. Code, data, and results are released publicly.
Significance. If the generated questions can be shown to be faithful tests of the extracted facts, the approach would supply a scalable, low-human-effort method for dynamically surfacing LLM factual errors and could complement static benchmarks. The public release of artifacts supports reproducibility and further experimentation by the community.
major comments (2)
- [Method (§3)] Method section (described in abstract and §3): The 55% error-triggering result and all downstream claims rest on the unverified premise that KG-extracted triplets constitute accurate ground-truth facts and that the rule-based NLP conversion produces questions that unambiguously test exactly those facts. No human audit, external cross-check, or inter-annotator agreement on triplet accuracy or question fidelity is reported; if either step introduces errors or ambiguities, the measured LLM failures are not guaranteed to be factual inaccuracies.
- [Experiments / Results] Experiments / Results: The abstract states that HalluHunter triggers errors in up to 55% of tested questions across nine LLMs, yet provides no description of how factual errors were independently verified, what baselines or coverage metrics were used, or how question validity was ensured. This information is load-bearing for interpreting the performance numbers and the claim that adaptive selection better exposes benchmarking weaknesses.
minor comments (2)
- [Abstract] The GitHub link is supplied, which is helpful for reproducibility; however, the abstract could more explicitly state the precise coverage metric used to support the claim that adaptive selection maintains coverage.
- [Method (§3)] Notation for single-hop versus multi-hop question generation and the precise adaptive-selection criterion (e.g., error-frequency threshold) should be formalized with pseudocode or equations to improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below with clarifications on our automated approach and indicate planned revisions.
read point-by-point responses
-
Referee: [Method (§3)] Method section (described in abstract and §3): The 55% error-triggering result and all downstream claims rest on the unverified premise that KG-extracted triplets constitute accurate ground-truth facts and that the rule-based NLP conversion produces questions that unambiguously test exactly those facts. No human audit, external cross-check, or inter-annotator agreement on triplet accuracy or question fidelity is reported; if either step introduces errors or ambiguities, the measured LLM failures are not guaranteed to be factual inaccuracies.
Authors: We acknowledge that the manuscript reports no human audit or inter-annotator agreement on triplet accuracy or question fidelity. The design is intentionally fully automated to overcome the human-labor limitations highlighted in the introduction, relying on established knowledge graphs (e.g., standard sources like Wikidata) whose facts are treated as ground truth per common KG-QA practice, with deterministic rule-based conversion to ensure direct testing of each triplet. To address the concern, we will add a new paragraph in the revised §3 explicitly stating these assumptions, discussing potential sources of error in extraction and generation, and providing concrete examples of triplets and generated questions to illustrate fidelity. revision: partial
-
Referee: [Experiments / Results] Experiments / Results: The abstract states that HalluHunter triggers errors in up to 55% of tested questions across nine LLMs, yet provides no description of how factual errors were independently verified, what baselines or coverage metrics were used, or how question validity was ensured. This information is load-bearing for interpreting the performance numbers and the claim that adaptive selection better exposes benchmarking weaknesses.
Authors: Factual errors are identified automatically by checking whether an LLM's response is inconsistent with the source fact triplet; we will expand §4 to describe this verification process in detail, specify the coverage metrics (e.g., number of unique triplets and question types covered), and clarify any baseline comparisons. These additions will support interpretation of the 55% result and the adaptive selection claims. revision: yes
Circularity Check
No circularity; empirical error rates measured independently of inputs
full rationale
The paper's core result (up to 55% factual errors triggered) is an empirical measurement obtained by running generated questions against external LLMs; the method extracts triplets from a public KG and applies rule-based NLP conversion without any fitted parameters, self-referential definitions, or load-bearing self-citations. No derivation step reduces by construction to its own inputs, and the framework is presented as self-contained with released code and data artifacts.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Knowledge graphs contain accurate factual triplets suitable for testing LLMs.
- domain assumption Rule-based NLP techniques can reliably convert triplets into valid single- and multi-hop questions that test the intended facts.
Reference graph
Works this paper leans on
- [1]
-
[2]
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. ArXiv abs/2302.04023 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Bollacker, Colin Evans, Praveen K
Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor
-
[4]
Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD Conference
-
[5]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[6]
Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Michael E. Sherr, Clay Shields, David A. Wagner, and Wenchao Zhou. 2016. Hidden Voice Commands. In USENIX Security Symposium
work page 2016
-
[7]
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing Your Question Answering Software via Asking Recursively. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 104–116
work page 2021
-
[8]
Brown, Miljan Martic, Shane Legg, and Dario Amodei
Paul Francis Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. NeurIPS (2017)
work page 2017
-
[9]
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In North American Chapter of the Association for Computational Linguistics
work page 2019
-
[10]
Yue Zhang Cunxiang Wang, Pai. 2021. Can Generative Pre-trained Language Models Serve As Knowledge Bases for Closed-book QA?. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
work page 2021
-
[11]
Yao Deng, Guannan Lou, James Xi Zheng, Tianyi Zhang, Miryung Kim, Huai Liu, Chen Wang, and Tsong Yueh Chen. 2021. BMT: Behavior Driven Development- based Metamorphic Testing for Autonomous Driving Models. 2021 IEEE/ACM 6th International Workshop on Metamorphic Testing (MET) (2021), 32–36. https: //api.semanticscholar.org/CorpusID:236190690
work page 2021
-
[12]
Yinlin Deng, Chun Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang
-
[13]
Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2022)
Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2022). https://api. semanticscholar.org/CorpusID:257378693
work page 2022
-
[14]
Yao Deng, Xi Zheng, Tianyi Zhang, Huai Liu, Guannan Lou, Miryung Kim, and Tsong Yueh Chen. 2020. A Declarative Metamorphic Testing Framework for Autonomous Driving. IEEE Transactions on Software Engineering 49 (2020), 1964–
work page 2020
-
[15]
https://api.semanticscholar.org/CorpusID:252111232
-
[16]
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jian- feng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. CoRR abs/1905.03197 (2019). arXiv:1905.03197 http://arxiv.org/abs/1905.03197
- [17]
-
[18]
Wee Chung Gan and Hwee Tou Ng. 2019. Improving the Robustness of Ques- tion Answering Systems to Question Paraphrasing. In Annual Meeting of the Association for Computational Linguistics
work page 2019
-
[19]
Shuzheng Gao, Xinjie Wen, Cuiyun Gao, Wenxuan Wang, and Michael R. Lyu
-
[20]
Constructing Effective In-Context Demonstration for Code Intelligence Tasks: An Empirical Study. ArXiv abs/2304.07575 (2023)
-
[21]
Cindy Gordon. 2023. ChatGPT Is The Fastest Growing App In The History Of Web Applications. https://www.forbes.com/sites/cindygordon/2023/02/02/chatgpt-is- the-fastest-growing-ap-in-the-history-of-web-applications. Accessed: 2023-07- 01
work page 2023
-
[22]
Shashij Gupta. 2020. Machine Translation Testing via Pathological Invariance. 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) (2020), 107–109
work page 2020
-
[23]
Fitash Ul Haq, Donghwan Shin, and Lionel Claude Briand. 2022. Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective Optimization. 2022 IEEE/ACM 44th International Conference on Software Engineer- ing (ICSE) (2022), 811–822. https://api.semanticscholar.org/CorpusID:249928681
work page 2022
-
[24]
Fitash Ul Haq, Donghwan Shin, Shiva Nejati, and Lionel Claude Briand. 2019. Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous Car Case Study. 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST) (2019), 85–95. https://api.semanticscholar.org/ CorpusID:208526910
work page 2019
-
[25]
Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing Machine Translation via Referential Transparency. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 410–422
work page 2021
-
[26]
Kung-Hsiang Huang, Hou Pong Chan, and Heng Ji. 2023. Zero-shot Faithful Factual Error Correction. In Annual Meeting of the Association for Computational Linguistics
work page 2023
-
[27]
Zhao, James Sharp, Wenjie Ruan, Jie Meng, and Xiaowei Huang
Wei Huang, Youcheng Sun, Xing-E. Zhao, James Sharp, Wenjie Ruan, Jie Meng, and Xiaowei Huang. 2019. Coverage-Guided Testing for Recurrent Neural Net- works. IEEE Transactions on Reliability (2019)
work page 2019
-
[28]
Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2021. DeepCrime: mutation testing of deep learning systems based on real faults. Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021)
work page 2021
-
[29]
Cambria, Pekka Marttinen, and Philip S
Shaoxiong Ji, Shirui Pan, E. Cambria, Pekka Marttinen, and Philip S. Yu. 2020. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. IEEE Transactions on Neural Networks and Learning Systems (2020)
work page 2020
-
[30]
Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. EMNLP (2017). , , Wenxuan Wang, Juluan Shi, Zhaopeng Tu, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu
work page 2017
-
[31]
Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, and Zhaopeng Tu
- [32]
-
[33]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehen- sion. arXiv:1705.03551 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2022. Large Language Models are Few- shot Testers: Exploring LLM-based General Bug Reproduction. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (2022), 2312–2323. https://api.semanticscholar.org/CorpusID:252519508
work page 2022
-
[35]
Nora Kassner, Philipp Dufter, and Hinrich Schütze. 2021. Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models. EACL (2021)
work page 2021
- [36]
-
[37]
Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Research. T...
work page 2019
-
[38]
Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Amran Hossen Bhuiyan, Shafiq R. Joty, and J. Huang. 2023. A Systematic Study and Comprehen- sive Evaluation of ChatGPT on Benchmark Datasets. In Annual Meeting of the Association for Computational Linguistics
work page 2023
-
[39]
David Lazer, Matthew A. Baum, Yochai Benkler, Adam J. Berinsky, Kelly M. Green- hill, Filippo Menczer, Miriam J. Metzger, Brendan Nyhan, Gordon Pennycook, David M. Rothschild, Michael Schudson, Steven A. Sloman, Cass Robert Sunstein, Emily A. Thorson, Duncan J. Watts, and Jonathan Zittrain. 2018. The science of fake news. Science (2018)
work page 2018
-
[40]
Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen
-
[41]
2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (2023), 919–931
CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (2023), 919–931. https://api.semanticscholar.org/CorpusID: 259860757
work page 2023
-
[42]
Sam Levin. 2018. Tesla fatal crash: ’autopilot’ mode sped up car before driver killed, report finds [Online]. https://www.theguardian.com/technology/2018/jun/ 07/tesla-fatal-crash-silicon-valley-autopilot-mode-report. Accessed: 2018-06
work page 2018
-
[43]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
work page 2004
-
[44]
Lin, Jacob Hilton, and Owain Evans
Stephanie C. Lin, Jacob Hilton, and Owain Evans. 2021. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Annual Meeting of the Association for Computational Linguistics
work page 2021
-
[45]
Sun, Zhenyu Chen, and Baowen Xu
Zixi Liu, Yang Feng, Yining Yin, J. Sun, Zhenyu Chen, and Baowen Xu. 2022. QATest: A Uniform Fuzzing Framework for Question Answering Systems. Pro- ceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022)
work page 2022
-
[46]
Yuanfu Luo, Malika Meghjani, Qi Heng Ho, David Hsu, and Daniela Rus. 2021. Interactive Planning for Autonomous Urban Driving in Adversarial Scenarios. 2021 IEEE International Conference on Robotics and Automation (ICRA) (2021), 5261–5267
work page 2021
-
[47]
L. Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. Deep- Gauge: Multi-Granularity Testing Criteria for Deep Learning Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018), 120–131. https://api.semanticscholar.org/Co...
work page 2018
-
[48]
Potsawee Manakul, Adian Liusie, and Mark John Francis Gales. [n. d.]. SelfCheck- GPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. ArXiv ([n. d.])
-
[49]
Thomas McCoy, Ellie Pavlick, and Tal Linzen
R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. ACL (2019)
work page 2019
-
[50]
George A. Miller. 1995. WordNet: A Lexical Database for English.Commun. ACM (1995)
work page 1995
-
[51]
Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2021. Fast Model Editing at Scale. ICLR (2021)
work page 2021
-
[52]
Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D. Manning, and Chelsea Finn. 2022. Memory-Based Model Editing at Scale. ACL (2022)
work page 2022
-
[53]
OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Annual Meeting of the Association for Computational Linguistics
work page 2002
-
[55]
Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Sekhar Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Proceedings of the 26th Symposium on Operating Systems Principles (2017)
work page 2017
-
[56]
Language Models as Knowledge Bases?
Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. 2019. Language Models as Knowledge Bases? ArXiv abs/1909.01066 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[57]
Hung Viet Pham, Mijung Kim, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan
-
[58]
DEVIATE: A Deep Learning Variance Testing Framework. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 1286–1290
work page 2021
-
[59]
Daw Khin Po. 2020. Similarity Based Information Retrieval Using Levenshtein Distance Algorithm. International Journal of Advances in Scientific Research and Engineering (2020). https://api.semanticscholar.org/CorpusID:218792424
work page 2020
-
[60]
Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing machine learning based systems: a systematic mapping. Empir. Softw. Eng. 25 (2020), 5193–5254
work page 2020
-
[61]
Qingchao Shen, Junjie Chen, J Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural Test Generation for Precise Testing of Question Answering Software. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022)
work page 2022
-
[62]
Clemencia Siro and Tunde Oluwaseyi Ajayi. 2023. Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming. ArXiv (2023)
work page 2023
-
[63]
Clemencia Siro and Tunde Oluwaseyi Ajayi. 2023. Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming
work page 2023
-
[64]
Zhang, Mark Harman, Mike Papadakis, and Lu Zhang
Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2019. Automatic Testing and Improvement of Machine Translation. 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE) (2019), 974–985. https://api.semanticscholar.org/CorpusID:203836074
work page 2019
-
[65]
Zhang, Mark Harman, Mike Papadakis, and Lu Zhang
Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2020. Automatic Testing and Improvement of Machine Translation. 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE) (2020), 974–985
work page 2020
-
[66]
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl- edge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , Jill Burstein, Chr...
-
[67]
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl- edge. NAACL (2019)
work page 2019
-
[68]
Jen tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, and Michael R. Lyu. 2022. AEON: a method for automatic evaluation of NLP test cases. Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (2022)
work page 2022
- [69]
-
[70]
Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM (2014)
work page 2014
-
[71]
Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu. 2023. BiasAsker: Measuring the Bias in Conversational AI System. FSE (2023)
work page 2023
-
[72]
Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. RobOT: Robustness-Oriented Testing for Deep Learning Systems. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 300–311
work page 2021
-
[73]
Wenxuan Wang, Jingyuan Huang, Chang Chen, Jiazhen Gu, Jianping Zhang, Weibin Wu, Pinjia He, and Michael R. Lyu. 2023. Validating Multimedia Content Moderation Software via Semantic Fusion. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2023). https://api. semanticscholar.org/CorpusID:258840941
work page 2023
-
[74]
Wenxuan Wang, Jingyuan Huang, Jen tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, and Michael R. Lyu. 2023. An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software. 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2023), 1339–1351. https://api.semanticscholar.org/CorpusID:...
work page 2023
- [75]
-
[76]
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2021. Finetuned Language Models Are Zero-Shot Learners. ICLR (2021)
work page 2021
- [77]
-
[78]
Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail E. Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing. The Journal of systems and software (2011)
work page 2011
-
[79]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for The Earth is Flat? Unveiling Factual Errors in Large Language Models , , Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[80]
J Zhang and Mark Harman. 2021. "Ignorance and Prejudice" in Software Fairness. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 1436–1447
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.