Identifying the Achilles' Heel: An Iterative Method for Dynamically Uncovering Factual Errors in Large Language Models

Jen-tse Huang; Juluan Shi; Michael R. Lyu; Wenxiang Jiao; Wenxuan Wang; Yifei Zhang; Youliang Yuan; Yuk-Kit Chan; Zhaopeng Tu; Zixuan Ling

arxiv: 2401.00761 · v2 · submitted 2024-01-01 · 💻 cs.SE · cs.AI· cs.CL

Identifying the Achilles' Heel: An Iterative Method for Dynamically Uncovering Factual Errors in Large Language Models

Wenxuan Wang , Yuk-Kit Chan , Zixuan Ling , Juluan Shi , Youliang Yuan , Jen-tse Huang , Yifei Zhang , Wenxiang Jiao

show 2 more authors

Zhaopeng Tu Michael R. Lyu

This is my paper

Pith reviewed 2026-05-24 04:40 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords factual errorslarge language modelsknowledge graphsautomated evaluationhallucination detectioniterative testingfactuality benchmarking

0 comments

The pith

HalluHunter extracts fact triplets from knowledge graphs to generate questions and iteratively targets LLM errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HalluHunter as a fully automated framework to uncover factual inaccuracies in LLMs without heavy human involvement or test contamination. It extracts fact triplets from a knowledge graph and applies rule-based NLP to create single-hop and multi-hop questions. The process starts with random triplet selection and then adapts in later rounds by choosing triplets where the tested model has already failed. Experiments across nine LLMs show the method surfaces errors in up to 55 percent of questions while also revealing gaps in how factuality is currently benchmarked.

Core claim

HalluHunter employs a knowledge-graph-based approach, extracting fact triplets to generate diverse question types for single- and multi-hop reasoning using rule-based Natural Language Processing techniques, with an iterative process that begins with random selection and shifts to adaptive selection of triplets where LLMs frequently err, revealing factual inaccuracies in up to 55% of questions on nine prominent LLMs.

What carries the argument

The iterative adaptive selection step that uses prior LLM performance to prioritize error-prone fact triplets for new question generation.

If this is right

HalluHunter triggers factual errors in up to 55% of tested questions across nine LLMs.
Adaptive selection of triplets exposes weaknesses in existing factuality benchmarks while preserving question coverage.
The framework supplies a fully automated alternative that reduces reliance on human labor and avoids test-data contamination.
The method maintains coverage of questions even as it focuses on harder cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same triplet-extraction and iteration pattern could be applied to commonsense or reasoning errors beyond strict factuality.
If knowledge-graph coverage is incomplete, the method would systematically miss errors on facts outside the graph.
Repeated runs on the same model could track whether fine-tuning or updates reduce the error rate on the same triplet set.

Load-bearing premise

The extracted fact triplets from the knowledge graph are accurate ground-truth facts and the rule-based NLP conversion produces questions that correctly and unambiguously test the LLM's knowledge of those facts without introducing new factual errors or ambiguities.

What would settle it

If independent verification shows that the generated questions do not match the original triplets or introduce their own factual distortions, then the reported error rates would not measure true LLM factuality.

Figures

Figures reproduced from arXiv: 2401.00761 by Jen-tse Huang, Juluan Shi, Michael R. Lyu, Wenxiang Jiao, Wenxuan Wang, Yifei Zhang, Youliang Yuan, Yuk-Kit Chan, Zhaopeng Tu, Zixuan Ling.

**Figure 2.** Figure 2: The retrieval process for fact triplets. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The proposed rule-based method for Question Generation. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs' veracity are limited by the need for extensive human labor, test data contamination, or limited scope, hindering efficient and effective exposure of errors. To address these challenges, we propose HalluHunter, a novel, fully automated framework for systematically uncovering factual inaccuracies in LLMs. HalluHunter employs a knowledge-graph-based approach, extracting fact triplets to generate diverse question types for single- and multi-hop reasoning using rule-based Natural Language Processing (NLP) techniques. Its iterative process starts with random triplet selection for question generation, followed by adaptive selection in subsequent iterations, targeting triplets where LLMs frequently err based on their performance analysis. Our extensive tests on nine prominent LLMs reveal that HalluHunter can trigger factual errors in up to 55% of tested questions. Moreover, we demonstrate that HalluHunter's test cases, particularly in adaptive selection, could further expose the weaknesses in benchmarking the factuality in LLMs meanwhile maintaining the coverage of questions. All code, data, and results are available at this link: https://github.com/Mysterchan/HalluHunter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HalluHunter's adaptive selection on KG triplets offers a practical automated pipeline for LLM factuality tests, but the 55% claim rests on unverified assumptions about triplet accuracy and question fidelity.

read the letter

The paper's core idea is to pull triplets from a knowledge graph, convert them into single- and multi-hop questions via rules, and then iterate by favoring triplets that already tripped up the model. The adaptive loop is the element that stands out from prior static KG or template approaches. It tests the method on nine LLMs and releases the code and data, which is useful for anyone who wants to reproduce or extend the pipeline. That combination addresses a real gap in scalable factuality evaluation without relying solely on human-written tests. The 55% error-triggering rate is presented as evidence that the approach works, and the GitHub link supports checking the implementation directly. The main weakness is that the evaluation treats the extracted triplets as ground truth and the rule-based questions as faithful probes, yet the description gives no human audit, cross-source check, or inter-annotator numbers on whether the questions actually test the intended facts without introducing new ambiguities or errors. If those steps inject noise, the adaptive selection could simply amplify generation artifacts rather than model-specific factual gaps. This is worth a referee's time for groups working on automated LLM testing. A reader who needs concrete code and an iterative selection strategy will get value from it, even if the current experiments require tighter controls on question validity. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes HalluHunter, a fully automated iterative framework that extracts fact triplets from a knowledge graph and applies rule-based NLP to generate single- and multi-hop questions. It begins with random triplet selection and shifts to adaptive selection targeting triplets where LLMs err most frequently. Experiments on nine LLMs are reported to trigger factual errors in up to 55% of questions, with the adaptive mode claimed to expose weaknesses in existing factuality benchmarks while preserving question coverage. Code, data, and results are released publicly.

Significance. If the generated questions can be shown to be faithful tests of the extracted facts, the approach would supply a scalable, low-human-effort method for dynamically surfacing LLM factual errors and could complement static benchmarks. The public release of artifacts supports reproducibility and further experimentation by the community.

major comments (2)

[Method (§3)] Method section (described in abstract and §3): The 55% error-triggering result and all downstream claims rest on the unverified premise that KG-extracted triplets constitute accurate ground-truth facts and that the rule-based NLP conversion produces questions that unambiguously test exactly those facts. No human audit, external cross-check, or inter-annotator agreement on triplet accuracy or question fidelity is reported; if either step introduces errors or ambiguities, the measured LLM failures are not guaranteed to be factual inaccuracies.
[Experiments / Results] Experiments / Results: The abstract states that HalluHunter triggers errors in up to 55% of tested questions across nine LLMs, yet provides no description of how factual errors were independently verified, what baselines or coverage metrics were used, or how question validity was ensured. This information is load-bearing for interpreting the performance numbers and the claim that adaptive selection better exposes benchmarking weaknesses.

minor comments (2)

[Abstract] The GitHub link is supplied, which is helpful for reproducibility; however, the abstract could more explicitly state the precise coverage metric used to support the claim that adaptive selection maintains coverage.
[Method (§3)] Notation for single-hop versus multi-hop question generation and the precise adaptive-selection criterion (e.g., error-frequency threshold) should be formalized with pseudocode or equations to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below with clarifications on our automated approach and indicate planned revisions.

read point-by-point responses

Referee: [Method (§3)] Method section (described in abstract and §3): The 55% error-triggering result and all downstream claims rest on the unverified premise that KG-extracted triplets constitute accurate ground-truth facts and that the rule-based NLP conversion produces questions that unambiguously test exactly those facts. No human audit, external cross-check, or inter-annotator agreement on triplet accuracy or question fidelity is reported; if either step introduces errors or ambiguities, the measured LLM failures are not guaranteed to be factual inaccuracies.

Authors: We acknowledge that the manuscript reports no human audit or inter-annotator agreement on triplet accuracy or question fidelity. The design is intentionally fully automated to overcome the human-labor limitations highlighted in the introduction, relying on established knowledge graphs (e.g., standard sources like Wikidata) whose facts are treated as ground truth per common KG-QA practice, with deterministic rule-based conversion to ensure direct testing of each triplet. To address the concern, we will add a new paragraph in the revised §3 explicitly stating these assumptions, discussing potential sources of error in extraction and generation, and providing concrete examples of triplets and generated questions to illustrate fidelity. revision: partial
Referee: [Experiments / Results] Experiments / Results: The abstract states that HalluHunter triggers errors in up to 55% of tested questions across nine LLMs, yet provides no description of how factual errors were independently verified, what baselines or coverage metrics were used, or how question validity was ensured. This information is load-bearing for interpreting the performance numbers and the claim that adaptive selection better exposes benchmarking weaknesses.

Authors: Factual errors are identified automatically by checking whether an LLM's response is inconsistent with the source fact triplet; we will expand §4 to describe this verification process in detail, specify the coverage metrics (e.g., number of unique triplets and question types covered), and clarify any baseline comparisons. These additions will support interpretation of the 55% result and the adaptive selection claims. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical error rates measured independently of inputs

full rationale

The paper's core result (up to 55% factual errors triggered) is an empirical measurement obtained by running generated questions against external LLMs; the method extracts triplets from a public KG and applies rule-based NLP conversion without any fitted parameters, self-referential definitions, or load-bearing self-citations. No derivation step reduces by construction to its own inputs, and the framework is presented as self-contained with released code and data artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract provides no numerical free parameters. The framework rests on domain assumptions about the accuracy of external knowledge graphs and the fidelity of rule-based question generation; no invented entities are introduced.

axioms (2)

domain assumption Knowledge graphs contain accurate factual triplets suitable for testing LLMs.
The method begins by extracting fact triplets from a knowledge graph as the foundation for question generation.
domain assumption Rule-based NLP techniques can reliably convert triplets into valid single- and multi-hop questions that test the intended facts.
Invoked when describing generation of diverse question types for reasoning.

pith-pipeline@v0.9.0 · 5826 in / 1459 out tokens · 30571 ms · 2026-05-24T04:40:17.727640+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 7 internal anchors

[1]

Rachith Aiyappa, Jisun An, Haewoon Kwak, and Yong-Yeol Ahn. 2023. Can we trust the evaluation on ChatGPT? ArXiv abs/2303.12767 (2023)

work page arXiv 2023
[2]

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. ArXiv abs/2302.04023 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Bollacker, Colin Evans, Praveen K

Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor

work page
[4]

In SIGMOD Conference

Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD Conference

work page
[5]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[6]

Sherr, Clay Shields, David A

Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Michael E. Sherr, Clay Shields, David A. Wagner, and Wenchao Zhou. 2016. Hidden Voice Commands. In USENIX Security Symposium

work page 2016
[7]

Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing Your Question Answering Software via Asking Recursively. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 104–116

work page 2021
[8]

Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul Francis Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. NeurIPS (2017)

work page 2017
[9]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In North American Chapter of the Association for Computational Linguistics

work page 2019
[10]

Yue Zhang Cunxiang Wang, Pai. 2021. Can Generative Pre-trained Language Models Serve As Knowledge Bases for Closed-book QA?. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

work page 2021
[11]

Yao Deng, Guannan Lou, James Xi Zheng, Tianyi Zhang, Miryung Kim, Huai Liu, Chen Wang, and Tsong Yueh Chen. 2021. BMT: Behavior Driven Development- based Metamorphic Testing for Autonomous Driving Models. 2021 IEEE/ACM 6th International Workshop on Metamorphic Testing (MET) (2021), 32–36. https: //api.semanticscholar.org/CorpusID:236190690

work page 2021
[12]

Yinlin Deng, Chun Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang

work page
[13]

Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2022)

Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2022). https://api. semanticscholar.org/CorpusID:257378693

work page 2022
[14]

Yao Deng, Xi Zheng, Tianyi Zhang, Huai Liu, Guannan Lou, Miryung Kim, and Tsong Yueh Chen. 2020. A Declarative Metamorphic Testing Framework for Autonomous Driving. IEEE Transactions on Software Engineering 49 (2020), 1964–

work page 2020
[15]

https://api.semanticscholar.org/CorpusID:252111232

work page
[16]

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jian- feng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. CoRR abs/1905.03197 (2019). arXiv:1905.03197 http://arxiv.org/abs/1905.03197

work page arXiv 2019
[17]

Robert Feldt, Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Towards Au- tonomous Testing Agents via Conversational Large Language Models. ArXiv abs/2306.05152 (2023). https://api.semanticscholar.org/CorpusID:259108951

work page arXiv 2023
[18]

Wee Chung Gan and Hwee Tou Ng. 2019. Improving the Robustness of Ques- tion Answering Systems to Question Paraphrasing. In Annual Meeting of the Association for Computational Linguistics

work page 2019
[19]

Shuzheng Gao, Xinjie Wen, Cuiyun Gao, Wenxuan Wang, and Michael R. Lyu

work page
[20]

ArXiv abs/2304.07575 (2023)

Constructing Effective In-Context Demonstration for Code Intelligence Tasks: An Empirical Study. ArXiv abs/2304.07575 (2023)

work page arXiv 2023
[21]

Cindy Gordon. 2023. ChatGPT Is The Fastest Growing App In The History Of Web Applications. https://www.forbes.com/sites/cindygordon/2023/02/02/chatgpt-is- the-fastest-growing-ap-in-the-history-of-web-applications. Accessed: 2023-07- 01

work page 2023
[22]

Shashij Gupta. 2020. Machine Translation Testing via Pathological Invariance. 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) (2020), 107–109

work page 2020
[23]

Fitash Ul Haq, Donghwan Shin, and Lionel Claude Briand. 2022. Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective Optimization. 2022 IEEE/ACM 44th International Conference on Software Engineer- ing (ICSE) (2022), 811–822. https://api.semanticscholar.org/CorpusID:249928681

work page 2022
[24]

Fitash Ul Haq, Donghwan Shin, Shiva Nejati, and Lionel Claude Briand. 2019. Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous Car Case Study. 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST) (2019), 85–95. https://api.semanticscholar.org/ CorpusID:208526910

work page 2019
[25]

Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing Machine Translation via Referential Transparency. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 410–422

work page 2021
[26]

Kung-Hsiang Huang, Hou Pong Chan, and Heng Ji. 2023. Zero-shot Faithful Factual Error Correction. In Annual Meeting of the Association for Computational Linguistics

work page 2023
[27]

Zhao, James Sharp, Wenjie Ruan, Jie Meng, and Xiaowei Huang

Wei Huang, Youcheng Sun, Xing-E. Zhao, James Sharp, Wenjie Ruan, Jie Meng, and Xiaowei Huang. 2019. Coverage-Guided Testing for Recurrent Neural Net- works. IEEE Transactions on Reliability (2019)

work page 2019
[28]

Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2021. DeepCrime: mutation testing of deep learning systems based on real faults. Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021)

work page 2021
[29]

Cambria, Pekka Marttinen, and Philip S

Shaoxiong Ji, Shirui Pan, E. Cambria, Pekka Marttinen, and Philip S. Yu. 2020. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. IEEE Transactions on Neural Networks and Learning Systems (2020)

work page 2020
[30]

Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. EMNLP (2017). , , Wenxuan Wang, Juluan Shi, Zhaopeng Tu, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu

work page 2017
[31]

Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, and Zhaopeng Tu

work page
[32]

Is ChatGPT A Good Translator? A Preliminary Study.ArXiv abs/2301.08745 (2023)

work page arXiv 2023
[33]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehen- sion. arXiv:1705.03551 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2022. Large Language Models are Few- shot Testers: Exploring LLM-based General Bug Reproduction. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (2022), 2312–2323. https://api.semanticscholar.org/CorpusID:252519508

work page 2022
[35]

Nora Kassner, Philipp Dufter, and Hinrich Schütze. 2021. Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models. EACL (2021)

work page 2021
[36]

Amr Keleg and Walid Magdy. 2023. DLAMA: A Framework for Curating Cultur- ally Diverse Facts for Probing the Knowledge of Pretrained Language Models. ArXiv abs/2306.05076 (2023)

work page arXiv 2023
[37]

Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Research. T...

work page 2019
[38]

Joty, and J

Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Amran Hossen Bhuiyan, Shafiq R. Joty, and J. Huang. 2023. A Systematic Study and Comprehen- sive Evaluation of ChatGPT on Benchmark Datasets. In Annual Meeting of the Association for Computational Linguistics

work page 2023
[39]

Baum, Yochai Benkler, Adam J

David Lazer, Matthew A. Baum, Yochai Benkler, Adam J. Berinsky, Kelly M. Green- hill, Filippo Menczer, Miriam J. Metzger, Brendan Nyhan, Gordon Pennycook, David M. Rothschild, Michael Schudson, Steven A. Sloman, Cass Robert Sunstein, Emily A. Thorson, Duncan J. Watts, and Jonathan Zittrain. 2018. The science of fake news. Science (2018)

work page 2018
[40]

Lahiri, and Siddhartha Sen

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen

work page
[41]

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (2023), 919–931

CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (2023), 919–931. https://api.semanticscholar.org/CorpusID: 259860757

work page 2023
[42]

Sam Levin. 2018. Tesla fatal crash: ’autopilot’ mode sped up car before driver killed, report finds [Online]. https://www.theguardian.com/technology/2018/jun/ 07/tesla-fatal-crash-silicon-valley-autopilot-mode-report. Accessed: 2018-06

work page 2018
[43]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013

work page 2004
[44]

Lin, Jacob Hilton, and Owain Evans

Stephanie C. Lin, Jacob Hilton, and Owain Evans. 2021. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Annual Meeting of the Association for Computational Linguistics

work page 2021
[45]

Sun, Zhenyu Chen, and Baowen Xu

Zixi Liu, Yang Feng, Yining Yin, J. Sun, Zhenyu Chen, and Baowen Xu. 2022. QATest: A Uniform Fuzzing Framework for Question Answering Systems. Pro- ceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022)

work page 2022
[46]

Yuanfu Luo, Malika Meghjani, Qi Heng Ho, David Hsu, and Daniela Rus. 2021. Interactive Planning for Autonomous Urban Driving in Adversarial Scenarios. 2021 IEEE International Conference on Robotics and Automation (ICRA) (2021), 5261–5267

work page 2021
[47]

Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang

L. Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. Deep- Gauge: Multi-Granularity Testing Criteria for Deep Learning Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018), 120–131. https://api.semanticscholar.org/Co...

work page 2018
[48]

Potsawee Manakul, Adian Liusie, and Mark John Francis Gales. [n. d.]. SelfCheck- GPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. ArXiv ([n. d.])

work page
[49]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. ACL (2019)

work page 2019
[50]

George A. Miller. 1995. WordNet: A Lexical Database for English.Commun. ACM (1995)

work page 1995
[51]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2021. Fast Model Editing at Scale. ICLR (2021)

work page 2021
[52]

Manning, and Chelsea Finn

Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D. Manning, and Chelsea Finn. 2022. Memory-Based Model Editing at Scale. ACL (2022)

work page 2022
[53]

OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Annual Meeting of the Association for Computational Linguistics

work page 2002
[55]

Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Sekhar Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Proceedings of the 26th Symposium on Operating Systems Principles (2017)

work page 2017
[56]

Language Models as Knowledge Bases?

Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. 2019. Language Models as Knowledge Bases? ArXiv abs/1909.01066 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[57]

Hung Viet Pham, Mijung Kim, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan

work page
[58]

2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 1286–1290

DEVIATE: A Deep Learning Variance Testing Framework. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 1286–1290

work page 2021
[59]

Daw Khin Po. 2020. Similarity Based Information Retrieval Using Levenshtein Distance Algorithm. International Journal of Advances in Scientific Research and Engineering (2020). https://api.semanticscholar.org/CorpusID:218792424

work page 2020
[60]

Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing machine learning based systems: a systematic mapping. Empir. Softw. Eng. 25 (2020), 5193–5254

work page 2020
[61]

Qingchao Shen, Junjie Chen, J Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural Test Generation for Precise Testing of Question Answering Software. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022)

work page 2022
[62]

Clemencia Siro and Tunde Oluwaseyi Ajayi. 2023. Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming. ArXiv (2023)

work page 2023
[63]

Clemencia Siro and Tunde Oluwaseyi Ajayi. 2023. Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming

work page 2023
[64]

Zhang, Mark Harman, Mike Papadakis, and Lu Zhang

Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2019. Automatic Testing and Improvement of Machine Translation. 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE) (2019), 974–985. https://api.semanticscholar.org/CorpusID:203836074

work page 2019
[65]

Zhang, Mark Harman, Mike Papadakis, and Lu Zhang

Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2020. Automatic Testing and Improvement of Machine Translation. 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE) (2020), 974–985

work page 2020
[66]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl- edge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , Jill Burstein, Chr...

work page doi:10.18653/v1/n19-1421 2019
[67]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl- edge. NAACL (2019)

work page 2019
[68]

Jen tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, and Michael R. Lyu. 2022. AEON: a method for automatic evaluation of NLP test cases. Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (2022)

work page 2022
[69]

James Tu, Huichen Li, Xinchen Yan, Mengye Ren, Yun Chen, Ming Liang, Eilyan Bitar, Ersin Yumer, and Raquel Urtasun. 2021. Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving.ArXiv abs/2101.06784 (2021)

work page arXiv 2021
[70]

Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM (2014)

work page 2014
[71]

Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu. 2023. BiasAsker: Measuring the Bias in Conversational AI System. FSE (2023)

work page 2023
[72]

Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. RobOT: Robustness-Oriented Testing for Deep Learning Systems. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 300–311

work page 2021
[73]

Wenxuan Wang, Jingyuan Huang, Chang Chen, Jiazhen Gu, Jianping Zhang, Weibin Wu, Pinjia He, and Michael R. Lyu. 2023. Validating Multimedia Content Moderation Software via Semantic Fusion. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2023). https://api. semanticscholar.org/CorpusID:258840941

work page 2023
[74]

Wenxuan Wang, Jingyuan Huang, Jen tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, and Michael R. Lyu. 2023. An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software. 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2023), 1339–1351. https://api.semanticscholar.org/CorpusID:...

work page 2023
[75]

Wenxuan Wang, Jen tse Huang, Weibin Wu, Jianping Zhang, Yizhan Huang, Shuqing Li, Pinjia He, and Michael R. Lyu. 2023. MTTM: Metamorphic Testing for Textual Content Moderation Software. ArXiv abs/2302.05706 (2023)

work page arXiv 2023
[76]

Dai, and Quoc V

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2021. Finetuned Language Models Are Zero-Shot Learners. ICLR (2021)

work page 2021
[77]

Hao Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, and Michael R. Lyu. 2023. ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark. ArXiv abs/2303.13648 (2023)

work page arXiv 2023
[78]

Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail E. Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing. The Journal of systems and software (2011)

work page 2011
[79]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for The Earth is Flat? Unveiling Factual Errors in Large Language Models , , Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2018
[80]

Ignorance and Prejudice

J Zhang and Mark Harman. 2021. "Ignorance and Prejudice" in Software Fairness. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 1436–1447

work page 2021

Showing first 80 references.

[1] [1]

Rachith Aiyappa, Jisun An, Haewoon Kwak, and Yong-Yeol Ahn. 2023. Can we trust the evaluation on ChatGPT? ArXiv abs/2303.12767 (2023)

work page arXiv 2023

[2] [2]

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity. ArXiv abs/2302.04023 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Bollacker, Colin Evans, Praveen K

Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor

work page

[4] [4]

In SIGMOD Conference

Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD Conference

work page

[5] [5]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Pra- fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[6] [6]

Sherr, Clay Shields, David A

Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Michael E. Sherr, Clay Shields, David A. Wagner, and Wenchao Zhou. 2016. Hidden Voice Commands. In USENIX Security Symposium

work page 2016

[7] [7]

Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing Your Question Answering Software via Asking Recursively. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 104–116

work page 2021

[8] [8]

Brown, Miljan Martic, Shane Legg, and Dario Amodei

Paul Francis Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. NeurIPS (2017)

work page 2017

[9] [9]

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In North American Chapter of the Association for Computational Linguistics

work page 2019

[10] [10]

Yue Zhang Cunxiang Wang, Pai. 2021. Can Generative Pre-trained Language Models Serve As Knowledge Bases for Closed-book QA?. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

work page 2021

[11] [11]

Yao Deng, Guannan Lou, James Xi Zheng, Tianyi Zhang, Miryung Kim, Huai Liu, Chen Wang, and Tsong Yueh Chen. 2021. BMT: Behavior Driven Development- based Metamorphic Testing for Autonomous Driving Models. 2021 IEEE/ACM 6th International Workshop on Metamorphic Testing (MET) (2021), 32–36. https: //api.semanticscholar.org/CorpusID:236190690

work page 2021

[12] [12]

Yinlin Deng, Chun Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang

work page

[13] [13]

Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2022)

Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2022). https://api. semanticscholar.org/CorpusID:257378693

work page 2022

[14] [14]

Yao Deng, Xi Zheng, Tianyi Zhang, Huai Liu, Guannan Lou, Miryung Kim, and Tsong Yueh Chen. 2020. A Declarative Metamorphic Testing Framework for Autonomous Driving. IEEE Transactions on Software Engineering 49 (2020), 1964–

work page 2020

[15] [15]

https://api.semanticscholar.org/CorpusID:252111232

work page

[16] [16]

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jian- feng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. CoRR abs/1905.03197 (2019). arXiv:1905.03197 http://arxiv.org/abs/1905.03197

work page arXiv 2019

[17] [17]

Robert Feldt, Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2023. Towards Au- tonomous Testing Agents via Conversational Large Language Models. ArXiv abs/2306.05152 (2023). https://api.semanticscholar.org/CorpusID:259108951

work page arXiv 2023

[18] [18]

Wee Chung Gan and Hwee Tou Ng. 2019. Improving the Robustness of Ques- tion Answering Systems to Question Paraphrasing. In Annual Meeting of the Association for Computational Linguistics

work page 2019

[19] [19]

Shuzheng Gao, Xinjie Wen, Cuiyun Gao, Wenxuan Wang, and Michael R. Lyu

work page

[20] [20]

ArXiv abs/2304.07575 (2023)

Constructing Effective In-Context Demonstration for Code Intelligence Tasks: An Empirical Study. ArXiv abs/2304.07575 (2023)

work page arXiv 2023

[21] [21]

Cindy Gordon. 2023. ChatGPT Is The Fastest Growing App In The History Of Web Applications. https://www.forbes.com/sites/cindygordon/2023/02/02/chatgpt-is- the-fastest-growing-ap-in-the-history-of-web-applications. Accessed: 2023-07- 01

work page 2023

[22] [22]

Shashij Gupta. 2020. Machine Translation Testing via Pathological Invariance. 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) (2020), 107–109

work page 2020

[23] [23]

Fitash Ul Haq, Donghwan Shin, and Lionel Claude Briand. 2022. Efficient Online Testing for DNN-Enabled Systems using Surrogate-Assisted and Many-Objective Optimization. 2022 IEEE/ACM 44th International Conference on Software Engineer- ing (ICSE) (2022), 811–822. https://api.semanticscholar.org/CorpusID:249928681

work page 2022

[24] [24]

Fitash Ul Haq, Donghwan Shin, Shiva Nejati, and Lionel Claude Briand. 2019. Comparing Offline and Online Testing of Deep Neural Networks: An Autonomous Car Case Study. 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST) (2019), 85–95. https://api.semanticscholar.org/ CorpusID:208526910

work page 2019

[25] [25]

Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing Machine Translation via Referential Transparency. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 410–422

work page 2021

[26] [26]

Kung-Hsiang Huang, Hou Pong Chan, and Heng Ji. 2023. Zero-shot Faithful Factual Error Correction. In Annual Meeting of the Association for Computational Linguistics

work page 2023

[27] [27]

Zhao, James Sharp, Wenjie Ruan, Jie Meng, and Xiaowei Huang

Wei Huang, Youcheng Sun, Xing-E. Zhao, James Sharp, Wenjie Ruan, Jie Meng, and Xiaowei Huang. 2019. Coverage-Guided Testing for Recurrent Neural Net- works. IEEE Transactions on Reliability (2019)

work page 2019

[28] [28]

Nargiz Humbatova, Gunel Jahangirova, and Paolo Tonella. 2021. DeepCrime: mutation testing of deep learning systems based on real faults. Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis (2021)

work page 2021

[29] [29]

Cambria, Pekka Marttinen, and Philip S

Shaoxiong Ji, Shirui Pan, E. Cambria, Pekka Marttinen, and Philip S. Yu. 2020. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications. IEEE Transactions on Neural Networks and Learning Systems (2020)

work page 2020

[30] [30]

Robin Jia and Percy Liang. 2017. Adversarial Examples for Evaluating Reading Comprehension Systems. EMNLP (2017). , , Wenxuan Wang, Juluan Shi, Zhaopeng Tu, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu

work page 2017

[31] [31]

Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing Wang, and Zhaopeng Tu

work page

[32] [32]

Is ChatGPT A Good Translator? A Preliminary Study.ArXiv abs/2301.08745 (2023)

work page arXiv 2023

[33] [33]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehen- sion. arXiv:1705.03551 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2022. Large Language Models are Few- shot Testers: Exploring LLM-based General Bug Reproduction. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (2022), 2312–2323. https://api.semanticscholar.org/CorpusID:252519508

work page 2022

[35] [35]

Nora Kassner, Philipp Dufter, and Hinrich Schütze. 2021. Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models. EACL (2021)

work page 2021

[36] [36]

Amr Keleg and Walid Magdy. 2023. DLAMA: A Framework for Curating Cultur- ally Diverse Facts for Probing the Knowledge of Pretrained Language Models. ArXiv abs/2306.05076 (2023)

work page arXiv 2023

[37] [37]

Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Research. T...

work page 2019

[38] [38]

Joty, and J

Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Amran Hossen Bhuiyan, Shafiq R. Joty, and J. Huang. 2023. A Systematic Study and Comprehen- sive Evaluation of ChatGPT on Benchmark Datasets. In Annual Meeting of the Association for Computational Linguistics

work page 2023

[39] [39]

Baum, Yochai Benkler, Adam J

David Lazer, Matthew A. Baum, Yochai Benkler, Adam J. Berinsky, Kelly M. Green- hill, Filippo Menczer, Miriam J. Metzger, Brendan Nyhan, Gordon Pennycook, David M. Rothschild, Michael Schudson, Steven A. Sloman, Cass Robert Sunstein, Emily A. Thorson, Duncan J. Watts, and Jonathan Zittrain. 2018. The science of fake news. Science (2018)

work page 2018

[40] [40]

Lahiri, and Siddhartha Sen

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K. Lahiri, and Siddhartha Sen

work page

[41] [41]

2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (2023), 919–931

CodaMosa: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) (2023), 919–931. https://api.semanticscholar.org/CorpusID: 259860757

work page 2023

[42] [42]

Sam Levin. 2018. Tesla fatal crash: ’autopilot’ mode sped up car before driver killed, report finds [Online]. https://www.theguardian.com/technology/2018/jun/ 07/tesla-fatal-crash-silicon-valley-autopilot-mode-report. Accessed: 2018-06

work page 2018

[43] [43]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013

work page 2004

[44] [44]

Lin, Jacob Hilton, and Owain Evans

Stephanie C. Lin, Jacob Hilton, and Owain Evans. 2021. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Annual Meeting of the Association for Computational Linguistics

work page 2021

[45] [45]

Sun, Zhenyu Chen, and Baowen Xu

Zixi Liu, Yang Feng, Yining Yin, J. Sun, Zhenyu Chen, and Baowen Xu. 2022. QATest: A Uniform Fuzzing Framework for Question Answering Systems. Pro- ceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022)

work page 2022

[46] [46]

Yuanfu Luo, Malika Meghjani, Qi Heng Ho, David Hsu, and Daniela Rus. 2021. Interactive Planning for Autonomous Urban Driving in Adversarial Scenarios. 2021 IEEE International Conference on Robotics and Automation (ICRA) (2021), 5261–5267

work page 2021

[47] [47]

Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang

L. Ma, Felix Juefei-Xu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. Deep- Gauge: Multi-Granularity Testing Criteria for Deep Learning Systems. 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE) (2018), 120–131. https://api.semanticscholar.org/Co...

work page 2018

[48] [48]

Potsawee Manakul, Adian Liusie, and Mark John Francis Gales. [n. d.]. SelfCheck- GPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. ArXiv ([n. d.])

work page

[49] [49]

Thomas McCoy, Ellie Pavlick, and Tal Linzen

R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. ACL (2019)

work page 2019

[50] [50]

George A. Miller. 1995. WordNet: A Lexical Database for English.Commun. ACM (1995)

work page 1995

[51] [51]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. 2021. Fast Model Editing at Scale. ICLR (2021)

work page 2021

[52] [52]

Manning, and Chelsea Finn

Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D. Manning, and Chelsea Finn. 2022. Memory-Based Model Editing at Scale. ACL (2022)

work page 2022

[53] [53]

OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Annual Meeting of the Association for Computational Linguistics

work page 2002

[55] [55]

Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Sekhar Jana. 2017. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Proceedings of the 26th Symposium on Operating Systems Principles (2017)

work page 2017

[56] [56]

Language Models as Knowledge Bases?

Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. 2019. Language Models as Knowledge Bases? ArXiv abs/1909.01066 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019

[57] [57]

Hung Viet Pham, Mijung Kim, Lin Tan, Yaoliang Yu, and Nachiappan Nagappan

work page

[58] [58]

2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 1286–1290

DEVIATE: A Deep Learning Variance Testing Framework. 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2021), 1286–1290

work page 2021

[59] [59]

Daw Khin Po. 2020. Similarity Based Information Retrieval Using Levenshtein Distance Algorithm. International Journal of Advances in Scientific Research and Engineering (2020). https://api.semanticscholar.org/CorpusID:218792424

work page 2020

[60] [60]

Vincenzo Riccio, Gunel Jahangirova, Andrea Stocco, Nargiz Humbatova, Michael Weiss, and Paolo Tonella. 2020. Testing machine learning based systems: a systematic mapping. Empir. Softw. Eng. 25 (2020), 5193–5254

work page 2020

[61] [61]

Qingchao Shen, Junjie Chen, J Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural Test Generation for Precise Testing of Question Answering Software. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (2022)

work page 2022

[62] [62]

Clemencia Siro and Tunde Oluwaseyi Ajayi. 2023. Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming. ArXiv (2023)

work page 2023

[63] [63]

Clemencia Siro and Tunde Oluwaseyi Ajayi. 2023. Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming

work page 2023

[64] [64]

Zhang, Mark Harman, Mike Papadakis, and Lu Zhang

Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2019. Automatic Testing and Improvement of Machine Translation. 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE) (2019), 974–985. https://api.semanticscholar.org/CorpusID:203836074

work page 2019

[65] [65]

Zhang, Mark Harman, Mike Papadakis, and Lu Zhang

Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, and Lu Zhang. 2020. Automatic Testing and Improvement of Machine Translation. 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE) (2020), 974–985

work page 2020

[66] [66]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl- edge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , Jill Burstein, Chr...

work page doi:10.18653/v1/n19-1421 2019

[67] [67]

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A Question Answering Challenge Targeting Commonsense Knowl- edge. NAACL (2019)

work page 2019

[68] [68]

Jen tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, and Michael R. Lyu. 2022. AEON: a method for automatic evaluation of NLP test cases. Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (2022)

work page 2022

[69] [69]

James Tu, Huichen Li, Xinchen Yan, Mengye Ren, Yun Chen, Ming Liang, Eilyan Bitar, Ersin Yumer, and Raquel Urtasun. 2021. Exploring Adversarial Robustness of Multi-Sensor Perception Systems in Self Driving.ArXiv abs/2101.06784 (2021)

work page arXiv 2021

[70] [70]

Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM (2014)

work page 2014

[71] [71]

Yuxuan Wan, Wenxuan Wang, Pinjia He, Jiazhen Gu, Haonan Bai, and Michael R. Lyu. 2023. BiasAsker: Measuring the Bias in Conversational AI System. FSE (2023)

work page 2023

[72] [72]

Jingyi Wang, Jialuo Chen, Youcheng Sun, Xingjun Ma, Dongxia Wang, Jun Sun, and Peng Cheng. 2021. RobOT: Robustness-Oriented Testing for Deep Learning Systems. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 300–311

work page 2021

[73] [73]

Wenxuan Wang, Jingyuan Huang, Chang Chen, Jiazhen Gu, Jianping Zhang, Weibin Wu, Pinjia He, and Michael R. Lyu. 2023. Validating Multimedia Content Moderation Software via Semantic Fusion. Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (2023). https://api. semanticscholar.org/CorpusID:258840941

work page 2023

[74] [74]

Wenxuan Wang, Jingyuan Huang, Jen tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, and Michael R. Lyu. 2023. An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software. 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2023), 1339–1351. https://api.semanticscholar.org/CorpusID:...

work page 2023

[75] [75]

Wenxuan Wang, Jen tse Huang, Weibin Wu, Jianping Zhang, Yizhan Huang, Shuqing Li, Pinjia He, and Michael R. Lyu. 2023. MTTM: Metamorphic Testing for Textual Content Moderation Software. ArXiv abs/2302.05706 (2023)

work page arXiv 2023

[76] [76]

Dai, and Quoc V

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2021. Finetuned Language Models Are Zero-Shot Learners. ICLR (2021)

work page 2021

[77] [77]

Hao Wu, Wenxuan Wang, Yuxuan Wan, Wenxiang Jiao, and Michael R. Lyu. 2023. ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark. ArXiv abs/2303.13648 (2023)

work page arXiv 2023

[78] [78]

Xiaoyuan Xie, Joshua W. K. Ho, Christian Murphy, Gail E. Kaiser, Baowen Xu, and Tsong Yueh Chen. 2011. Testing and validating machine learning classifiers by metamorphic testing. The Journal of systems and software (2011)

work page 2011

[79] [79]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for The Earth is Flat? Unveiling Factual Errors in Large Language Models , , Diverse, Explainable Multi-hop Question Answering. arXiv:1809.09600 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2018

[80] [80]

Ignorance and Prejudice

J Zhang and Mark Harman. 2021. "Ignorance and Prejudice" in Software Fairness. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) (2021), 1436–1447

work page 2021