ABSTRACT
We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 -- 0.75.
- Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2655--2668.Google ScholarCross Ref
- Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--5.Google ScholarDigital Library
- Open AI. 2023. GPT-4 --- openai.com. https://openai.com/research/gpt-4. [Accessed 01-08-2023].Google Scholar
- Rahul Aralikatte, Shashi Narayan, Joshua Maynez, Sascha Rothe, and Ryan McDonald. 2021. Focus Attention: Promoting Faithfulness and Diversity in Summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6078--6095. Google ScholarCross Ref
- Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734 (2023).Google Scholar
- Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt Botvinick, et al. 2022. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems 35 (2022), 38176--38189.Google Scholar
- Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023).Google Scholar
- Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.Google Scholar
- Paweł Budzianowski and Ivan Vulić. 2019. Hello, It's GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems. In Proceedings of the 3rd Workshop on Neural Generation and Translation. Association for Computational Linguistics, Hong Kong, 15--22. Google ScholarCross Ref
- Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).Google Scholar
- Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing your question answering software via asking recursively. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 104--116.Google ScholarDigital Library
- Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044 (2019).Google Scholar
- Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. 2023. Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2128--2140.Google ScholarCross Ref
- Fernando López de la Mora and Sarah Nadi. 2018. An empirical study of metric-based comparisons of software libraries. In Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering. Association for Computing Machinery, New York, NY, USA, 22--31.Google Scholar
- Fernando López De La Mora and Sarah Nadi. 2018. Which library should i use?: A metric-based comparison of software libraries. In 2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results (ICSE-NIER). IEEE, Association for Computing Machinery, Gothenburg, Sweden, 37--40.Google ScholarDigital Library
- Haneen Deeb, Aldert Vrij, Lorraine Hope, Samantha Mann, Pär Anders Granhag, and Leif A Strömwall. 2018. Police officers' perceptions of statement inconsistency. Criminal Justice and Behavior 45, 5 (2018), 644--665.Google ScholarCross Ref
- Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William Cohen. 2019. Handling Divergent Reference Texts when Evaluating Table-to-Text Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4884--4895. Google ScholarCross Ref
- Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023).Google Scholar
- Nouha Dziri, Andrea Madotto, Osmar R Zaiane, and Avishek Joey Bose. 2021. Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2197--2214.Google ScholarCross Ref
- Steffen Eger and Yannik Benz. 2020. From Hero to Z\'eroe: A Benchmark of Low-Level Adversarial Attacks. arXiv preprint arXiv:2010.05648 (2020).Google Scholar
- Rehab El-Hajj and Sarah Nadi. 2020. LibComp: An IntelliJ plugin for comparing java libraries. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, 1591--1595.Google ScholarDigital Library
- Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics 9 (2021), 1012--1031.Google ScholarCross Ref
- Philip Feldman, James R Foulds, and Shimei Pan. 2023. Trapping LLM Hallucinations Using Tagged Context Prompts. arXiv preprint arXiv:2306.06085 (2023).Google Scholar
- Yang Feng, Wanying Xie, Shuhao Gu, Chenze Shao, Wen Zhang, Zhengxin Yang, and Dong Yu. 2020. Modeling fluency and faithfulness for diverse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 59--66.Google ScholarCross Ref
- Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1536--1547.Google ScholarCross Ref
- James Finnie-Ansley, Paul Denny, Brett A Becker, Andrew Luxton-Reilly, and James Prather. 2022. The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian Computing Education Conference. 10--19.Google ScholarDigital Library
- Boris A Galitsky. 2023. Truth-O-Meter: Collaborating with LLM in Fighting its Hallucinations. (2023).Google Scholar
- Leo A. Goodman. 1961. Snowball Sampling. Annals of Mathematical Statistics 32, 1 (1961), 148--170.Google ScholarCross Ref
- Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738 (2023).Google Scholar
- Lynette Hirschman and Robert Gaizauskas. 2001. Natural language question answering: the view from here. natural language engineering 7, 4 (2001), 275--300.Google Scholar
- Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu. 2018. Tell them apart: distilling technology differences from crowd-scale comparison discussions. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Association for Computing Machinery, New York, NY, USA, 214--224.Google ScholarDigital Library
- Myeongjun Jang, Deuk Sin Kwon, and Thomas Lukasiewicz. 2022. BECEL: Benchmark for Consistency Evaluation of Language Models. In Proceedings of the 29th International Conference on Computational Linguistics. 3680--3696.Google Scholar
- Myeongjun Jang and Thomas Lukasiewicz. 2023. Consistency analysis of chatgpt. arXiv preprint arXiv:2303.06273 (2023).Google Scholar
- Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328 (2017).Google Scholar
- Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How Can We Know When Language Models Know?On the Calibration of Language Models for Question Answering. Transactions of the Association for Computational Linguistics 9 (2021), 962--977. Google ScholarCross Ref
- Junaed Younus Khan and Gias Uddin. 2022. Automatic code documentation generation using gpt-3. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--6.Google ScholarDigital Library
- Junaed Younus Khan and Gias Uddin. 2023. Combining Contexts from Multiple Sources for Documentation-Specific Code Example Generation. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 683--687.Google ScholarCross Ref
- Gary LJ Lancaster, Aldert Vrij, Lorraine Hope, and Bridget Waller. 2013. Sorting the liars from the truth tellers: The benefits of asking unanticipated questions on lie detection. Applied Cognitive Psychology 27, 1 (2013), 107--114.Google ScholarCross Ref
- Enrique Larios Vargas, Maurício Aniche, Christoph Treude, Magiel Bruntink, and Georgios Gousios. 2020. Selecting third-party libraries: The practitioners' perspective. In Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 245--256.Google ScholarDigital Library
- Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8424--8445. Google ScholarCross Ref
- Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems 35 (2022), 34586--34599.Google Scholar
- Xiaozhou Li, Sergio Moreschini, Zheying Zhang, and Davide Taibi. 2022. Exploring factors and metrics to select open source software components for integration: An empirical study. Journal of Systems and Software 188 (2022), 111255.Google ScholarDigital Library
- Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334 (2022).Google Scholar
- Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models. arXiv preprint arXiv:2305.19187 (2023).Google Scholar
- Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).Google Scholar
- Mingwei Liu, Xin Peng, Andrian Marcus, Shuangshuang Xing, Christoph Treude, and Chengyuan Zhao. 2021. API-Related Developer Information Needs in Stack Overflow. IEEE Transactions on Software Engineering 48, 11 (2021), 4485--4500.Google ScholarCross Ref
- Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052 (2021).Google Scholar
- Stephen MacNeil, Andrew Tran, Dan Mogil, Seth Bernstein, Erin Ross, and Ziheng Huang. 2022. Generating diverse code explanations using the gpt-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 2. 37--39.Google ScholarDigital Library
- Potsawee Manakul, Adian Liusie, and Mark JF Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 (2023).Google Scholar
- Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. Reducing conversational agents' overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics 10 (2022), 857--872.Google ScholarCross Ref
- OpenAI. [n. d.]. Online-ChatGPT - Optimizing Language Models for Dialogue. https://online-chatgpt.com/. [Accessed 31-07-2023].Google Scholar
- Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730--27744.Google Scholar
- Stack Overflow. 2023. Stack Overflow Search. Stack Overflow, https://stackoverflow.com/search. Last updated on 01 August 2023.Google Scholar
- Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2023. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334 (2023).Google Scholar
- Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2339--2356.Google ScholarCross Ref
- Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. 2023. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813 (2023).Google Scholar
- Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Annibal, Alec Peltekian, and Yanfang Ye. 2021. CoTexT: Multi-task Learning with Code-Text Transformer. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021). Association for Computational Linguistics, Online, 40--47. Google ScholarCross Ref
- Julian Aron Prenner and Romain Robbes. 2021. Automatic Program Repair with OpenAI's Codex: Evaluating QuixBugs. arXiv preprint arXiv:2111.03922 (2021).Google Scholar
- Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118 (2020).Google Scholar
- Qingchao Shen, Junjie Chen, Jie M Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural test generation for precise testing of question answering software. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--12.Google ScholarDigital Library
- Minaoar Hossain Tanzil. 2023. ChatGPT Incorrectness Detection in Software Reviews. University of Calgary, https://github.com/minaoar/ChatGPT-Incorrectness-Detection-in-Software-Reviews.Google Scholar
- Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques Klein, and Tegawendé F Bissyandé. 2023. Is ChatGPT the Ultimate Programming Assistant-How far is it? arXiv preprint arXiv:2304.11938 (2023).Google Scholar
- Gias Uddin, Olga Baysal, Latifa Guerrouj, and Foutse Khomh. 2019. Understanding how and why developers seek and analyze API-related opinions. IEEE Transactions on Software Engineering 47, 4 (2019), 694--735.Google ScholarCross Ref
- Gias Uddin and Foutse Khomh. 2017. Automatic summarization of API reviews. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, IEEE Press, Urbana-Champaign, IL, USA, 159--170.Google ScholarCross Ref
- Gias Uddin and Foutse Khomh. 2017. Opiner: an opinion search and summarization engine for APIs. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, IEEE Press, Urbana-Champaign, IL, USA, 978--983.Google ScholarCross Ref
- Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. arXiv preprint arXiv:2307.03987 (2023).Google Scholar
- Aldert Vrij, Pär Anders Granhag, Samantha Mann, and Sharon Leal. 2011. Outsmarting the liars: Toward a cognitive lie detection approach. Current Directions in Psychological Science 20, 1 (2011), 28--32.Google ScholarCross Ref
- Aldert Vrij, Sharon Leal, Pär Anders Granhag, Samantha Mann, Ronald P Fisher, Jackie Hillman, and Kathryn Sperry. 2009. Outsmarting the liars: The benefit of asking unanticipated questions. Law and human behavior 33 (2009), 159--166.Google Scholar
- Hongmin Wang. 2019. Revisiting Challenges in Data-to-Text Generation with Fact Grounding. In Proceedings of the 12th International Conference on Natural Language Generation. Association for Computational Linguistics, Tokyo, Japan, 311--322. Google ScholarCross Ref
- Han Wang, Chunyang Chen, Zhenchang Xing, and John Grundy. 2020. DiffTech: A tool for differencing similar technologies from question-and-answer discussions. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, 1576--1580.Google ScholarDigital Library
- Han Wang, Chunyang Chen, Zhenchang Xing, and John Grundy. 2021. DiffTech: Differencing Similar Technologies from Crowd-Scale Comparison Discussions. IEEE Transactions on Software Engineering 48, 7 (2021), 2399--2416.Google ScholarCross Ref
- Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696--8708.Google ScholarCross Ref
- Anthony I Wasserman, Xianzheng Guo, Blake McMillian, Kai Qian, Ming-Yu Wei, and Qian Xu. 2017. OSSpal: finding and evaluating open source software. In IFIP International Conference on Open Source Systems. Springer, Cham, Springer International Publishing, Cham, 193--203.Google ScholarCross Ref
- Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).Google Scholar
- Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv preprint arXiv:2306.13063 (2023).Google Scholar
- Litao Yan, Miryung Kim, Bjoern Hartmann, Tianyi Zhang, and Elena L Glassman. 2022. Concept-Annotated Examples for Library Comparison. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. Association for Computing Machinery, Bend, OR, USA, 1--16.Google ScholarDigital Library
- Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. 2023. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534 (2023).Google Scholar
- Zhuosheng Zhang, Hai Zhao, and Rui Wang. 2020. Machine reading comprehension: The role of contextualized language models and beyond. arXiv preprint arXiv:2005.06249 (2020).Google Scholar
Index Terms
- ChatGPT Incorrectness Detection in Software Reviews
Recommendations
Towards Destructive Software Testing
ICIS-COMSAR '06: Proceedings of the 5th IEEE/ACIS International Conference on Computer and Information Science and 1st IEEE/ACIS International Workshop on Component-Based Software Engineering,Software Architecture and ReuseTraditional software testing checks to see if a software product meets specifications. This generally involves testing to see if the software performs all the functions called for in the Software Requirements Specifications (SRS). In contrast, this work-...
Fault classes and error detection capability of specification-based testing
Some varieties of specification-based testing rely upon methods for generating test cases from predicates in a software specification. These methods derive various test conditions from logic expressions, with the aim of detecting different types of ...
Error-Based Software Testing and Analysis
COMPSACW '11: Proceedings of the 2011 IEEE 35th Annual Computer Software and Applications Conference WorkshopsAn approach to error-based testing is described that uses simple programmer error models and focus-directed methods for detecting the effects of errors. Errors are associated with forgetting, ignorance, bandwidth and perversity. The focus-directed ...
Comments