research-article

ChatGPT Incorrectness Detection in Software Reviews

Authors:
Minaoar Hossain Tanzil

University of Calgary, Calgary, Alberta, Canada

University of Calgary, Calgary, Alberta, Canada

https://orcid.org/0000-0002-3323-4917
View Profile

,
Junaed Younus Khan

University of Calgary, Calgary, Alberta, Canada

University of Calgary, Calgary, Alberta, Canada

https://orcid.org/0000-0001-8138-1105
View Profile

,
Gias Uddin

York University, Toronto, Ontario, Canada

York University, Toronto, Ontario, Canada

https://orcid.org/0000-0003-1376-095X
View Profile

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software EngineeringApril 2024Article No.: 180Pages 1–12https://doi.org/10.1145/3597503.3639194

Published:12 April 2024Publication History

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Pages 1–12

ABSTRACT

We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 -- 0.75.

References

Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2655--2668.Google ScholarCross Ref
Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--5.Google ScholarDigital Library
Open AI. 2023. GPT-4 --- openai.com. https://openai.com/research/gpt-4. [Accessed 01-08-2023].Google Scholar
Rahul Aralikatte, Shashi Narayan, Joshua Maynez, Sascha Rothe, and Ryan McDonald. 2021. Focus Attention: Promoting Faithfulness and Diversity in Summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6078--6095. Google ScholarCross Ref
Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734 (2023).Google Scholar
Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt Botvinick, et al. 2022. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems 35 (2022), 38176--38189.Google Scholar
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023).Google Scholar
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.Google Scholar
Paweł Budzianowski and Ivan Vulić. 2019. Hello, It's GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems. In Proceedings of the 3rd Workshop on Neural Generation and Translation. Association for Computational Linguistics, Hong Kong, 15--22. Google ScholarCross Ref
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).Google Scholar
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing your question answering software via asking recursively. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 104--116.Google ScholarDigital Library
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044 (2019).Google Scholar
Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. 2023. Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2128--2140.Google ScholarCross Ref
Fernando López de la Mora and Sarah Nadi. 2018. An empirical study of metric-based comparisons of software libraries. In Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering. Association for Computing Machinery, New York, NY, USA, 22--31.Google Scholar
Fernando López De La Mora and Sarah Nadi. 2018. Which library should i use?: A metric-based comparison of software libraries. In 2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results (ICSE-NIER). IEEE, Association for Computing Machinery, Gothenburg, Sweden, 37--40.Google ScholarDigital Library
Haneen Deeb, Aldert Vrij, Lorraine Hope, Samantha Mann, Pär Anders Granhag, and Leif A Strömwall. 2018. Police officers' perceptions of statement inconsistency. Criminal Justice and Behavior 45, 5 (2018), 644--665.Google ScholarCross Ref
Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William Cohen. 2019. Handling Divergent Reference Texts when Evaluating Table-to-Text Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4884--4895. Google ScholarCross Ref
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023).Google Scholar
Nouha Dziri, Andrea Madotto, Osmar R Zaiane, and Avishek Joey Bose. 2021. Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2197--2214.Google ScholarCross Ref
Steffen Eger and Yannik Benz. 2020. From Hero to Z\'eroe: A Benchmark of Low-Level Adversarial Attacks. arXiv preprint arXiv:2010.05648 (2020).Google Scholar
Rehab El-Hajj and Sarah Nadi. 2020. LibComp: An IntelliJ plugin for comparing java libraries. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, 1591--1595.Google ScholarDigital Library
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics 9 (2021), 1012--1031.Google ScholarCross Ref
Philip Feldman, James R Foulds, and Shimei Pan. 2023. Trapping LLM Hallucinations Using Tagged Context Prompts. arXiv preprint arXiv:2306.06085 (2023).Google Scholar
Yang Feng, Wanying Xie, Shuhao Gu, Chenze Shao, Wen Zhang, Zhengxin Yang, and Dong Yu. 2020. Modeling fluency and faithfulness for diverse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 59--66.Google ScholarCross Ref
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1536--1547.Google ScholarCross Ref
James Finnie-Ansley, Paul Denny, Brett A Becker, Andrew Luxton-Reilly, and James Prather. 2022. The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian Computing Education Conference. 10--19.Google ScholarDigital Library
Boris A Galitsky. 2023. Truth-O-Meter: Collaborating with LLM in Fighting its Hallucinations. (2023).Google Scholar
Leo A. Goodman. 1961. Snowball Sampling. Annals of Mathematical Statistics 32, 1 (1961), 148--170.Google ScholarCross Ref
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738 (2023).Google Scholar
Lynette Hirschman and Robert Gaizauskas. 2001. Natural language question answering: the view from here. natural language engineering 7, 4 (2001), 275--300.Google Scholar
Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu. 2018. Tell them apart: distilling technology differences from crowd-scale comparison discussions. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Association for Computing Machinery, New York, NY, USA, 214--224.Google ScholarDigital Library
Myeongjun Jang, Deuk Sin Kwon, and Thomas Lukasiewicz. 2022. BECEL: Benchmark for Consistency Evaluation of Language Models. In Proceedings of the 29th International Conference on Computational Linguistics. 3680--3696.Google Scholar
Myeongjun Jang and Thomas Lukasiewicz. 2023. Consistency analysis of chatgpt. arXiv preprint arXiv:2303.06273 (2023).Google Scholar
Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328 (2017).Google Scholar
Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How Can We Know When Language Models Know?On the Calibration of Language Models for Question Answering. Transactions of the Association for Computational Linguistics 9 (2021), 962--977. Google ScholarCross Ref
Junaed Younus Khan and Gias Uddin. 2022. Automatic code documentation generation using gpt-3. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--6.Google ScholarDigital Library
Junaed Younus Khan and Gias Uddin. 2023. Combining Contexts from Multiple Sources for Documentation-Specific Code Example Generation. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 683--687.Google ScholarCross Ref
Gary LJ Lancaster, Aldert Vrij, Lorraine Hope, and Bridget Waller. 2013. Sorting the liars from the truth tellers: The benefits of asking unanticipated questions on lie detection. Applied Cognitive Psychology 27, 1 (2013), 107--114.Google ScholarCross Ref
Enrique Larios Vargas, Maurício Aniche, Christoph Treude, Magiel Bruntink, and Georgios Gousios. 2020. Selecting third-party libraries: The practitioners' perspective. In Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 245--256.Google ScholarDigital Library
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8424--8445. Google ScholarCross Ref
Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems 35 (2022), 34586--34599.Google Scholar
Xiaozhou Li, Sergio Moreschini, Zheying Zhang, and Davide Taibi. 2022. Exploring factors and metrics to select open source software components for integration: An empirical study. Journal of Systems and Software 188 (2022), 111255.Google ScholarDigital Library
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334 (2022).Google Scholar
Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models. arXiv preprint arXiv:2305.19187 (2023).Google Scholar
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).Google Scholar
Mingwei Liu, Xin Peng, Andrian Marcus, Shuangshuang Xing, Christoph Treude, and Chengyuan Zhao. 2021. API-Related Developer Information Needs in Stack Overflow. IEEE Transactions on Software Engineering 48, 11 (2021), 4485--4500.Google ScholarCross Ref
Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052 (2021).Google Scholar
Stephen MacNeil, Andrew Tran, Dan Mogil, Seth Bernstein, Erin Ross, and Ziheng Huang. 2022. Generating diverse code explanations using the gpt-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 2. 37--39.Google ScholarDigital Library
Potsawee Manakul, Adian Liusie, and Mark JF Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 (2023).Google Scholar
Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. Reducing conversational agents' overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics 10 (2022), 857--872.Google ScholarCross Ref
OpenAI. [n. d.]. Online-ChatGPT - Optimizing Language Models for Dialogue. https://online-chatgpt.com/. [Accessed 31-07-2023].Google Scholar
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730--27744.Google Scholar
Stack Overflow. 2023. Stack Overflow Search. Stack Overflow, https://stackoverflow.com/search. Last updated on 01 August 2023.Google Scholar
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2023. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334 (2023).Google Scholar
Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2339--2356.Google ScholarCross Ref
Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. 2023. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813 (2023).Google Scholar
Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Annibal, Alec Peltekian, and Yanfang Ye. 2021. CoTexT: Multi-task Learning with Code-Text Transformer. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021). Association for Computational Linguistics, Online, 40--47. Google ScholarCross Ref
Julian Aron Prenner and Romain Robbes. 2021. Automatic Program Repair with OpenAI's Codex: Evaluating QuixBugs. arXiv preprint arXiv:2111.03922 (2021).Google Scholar
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118 (2020).Google Scholar
Qingchao Shen, Junjie Chen, Jie M Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural test generation for precise testing of question answering software. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--12.Google ScholarDigital Library
Minaoar Hossain Tanzil. 2023. ChatGPT Incorrectness Detection in Software Reviews. University of Calgary, https://github.com/minaoar/ChatGPT-Incorrectness-Detection-in-Software-Reviews.Google Scholar
Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques Klein, and Tegawendé F Bissyandé. 2023. Is ChatGPT the Ultimate Programming Assistant-How far is it? arXiv preprint arXiv:2304.11938 (2023).Google Scholar
Gias Uddin, Olga Baysal, Latifa Guerrouj, and Foutse Khomh. 2019. Understanding how and why developers seek and analyze API-related opinions. IEEE Transactions on Software Engineering 47, 4 (2019), 694--735.Google ScholarCross Ref
Gias Uddin and Foutse Khomh. 2017. Automatic summarization of API reviews. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, IEEE Press, Urbana-Champaign, IL, USA, 159--170.Google ScholarCross Ref
Gias Uddin and Foutse Khomh. 2017. Opiner: an opinion search and summarization engine for APIs. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, IEEE Press, Urbana-Champaign, IL, USA, 978--983.Google ScholarCross Ref
Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. arXiv preprint arXiv:2307.03987 (2023).Google Scholar
Aldert Vrij, Pär Anders Granhag, Samantha Mann, and Sharon Leal. 2011. Outsmarting the liars: Toward a cognitive lie detection approach. Current Directions in Psychological Science 20, 1 (2011), 28--32.Google ScholarCross Ref
Aldert Vrij, Sharon Leal, Pär Anders Granhag, Samantha Mann, Ronald P Fisher, Jackie Hillman, and Kathryn Sperry. 2009. Outsmarting the liars: The benefit of asking unanticipated questions. Law and human behavior 33 (2009), 159--166.Google Scholar
Hongmin Wang. 2019. Revisiting Challenges in Data-to-Text Generation with Fact Grounding. In Proceedings of the 12th International Conference on Natural Language Generation. Association for Computational Linguistics, Tokyo, Japan, 311--322. Google ScholarCross Ref
Han Wang, Chunyang Chen, Zhenchang Xing, and John Grundy. 2020. DiffTech: A tool for differencing similar technologies from question-and-answer discussions. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, 1576--1580.Google ScholarDigital Library
Han Wang, Chunyang Chen, Zhenchang Xing, and John Grundy. 2021. DiffTech: Differencing Similar Technologies from Crowd-Scale Comparison Discussions. IEEE Transactions on Software Engineering 48, 7 (2021), 2399--2416.Google ScholarCross Ref
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696--8708.Google ScholarCross Ref
Anthony I Wasserman, Xianzheng Guo, Blake McMillian, Kai Qian, Ming-Yu Wei, and Qian Xu. 2017. OSSpal: finding and evaluating open source software. In IFIP International Conference on Open Source Systems. Springer, Cham, Springer International Publishing, Cham, 193--203.Google ScholarCross Ref
Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).Google Scholar
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv preprint arXiv:2306.13063 (2023).Google Scholar
Litao Yan, Miryung Kim, Bjoern Hartmann, Tianyi Zhang, and Elena L Glassman. 2022. Concept-Annotated Examples for Library Comparison. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. Association for Computing Machinery, Bend, OR, USA, 1--16.Google ScholarDigital Library
Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. 2023. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534 (2023).Google Scholar
Zhuosheng Zhang, Hai Zhao, and Rui Wang. 2020. Machine reading comprehension: The role of contextualized language models and beyond. arXiv preprint arXiv:2005.06249 (2020).Google Scholar

Index Terms

ChatGPT Incorrectness Detection in Software Reviews
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Towards Destructive Software Testing
ICIS-COMSAR '06: Proceedings of the 5th IEEE/ACIS International Conference on Computer and Information Science and 1st IEEE/ACIS International Workshop on Component-Based Software Engineering,Software Architecture and Reuse

Traditional software testing checks to see if a software product meets specifications. This generally involves testing to see if the software performs all the functions called for in the Software Requirements Specifications (SRS). In contrast, this work-...
Read More
Fault classes and error detection capability of specification-based testing

Some varieties of specification-based testing rely upon methods for generating test cases from predicates in a software specification. These methods derive various test conditions from logic expressions, with the aim of detecting different types of ...
Read More
Error-Based Software Testing and Analysis
COMPSACW '11: Proceedings of the 2011 IEEE 35th Annual Computer Software and Applications Conference Workshops

An approach to error-based testing is described that uses simple programmer error models and focus-directed methods for detecting the effects of errors. Errors are associated with forgetting, ignorance, bandwidth and perversity. The focus-directed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering
April 2024
2931 pages
ISBN:9798400702174
DOI:10.1145/3597503
Co-chairs:
Ana Paiva,
Rui Abreu,
Program Co-chairs:
Abhik Roychoudhury,
Margaret Storey
Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 April 2024
Check for updates
Author Tags
large language model
chatGPT
hallucination
testing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 55
  Total Downloads
- Downloads (Last 12 months)55
- Downloads (Last 6 weeks)55
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ChatGPT Incorrectness Detection in Software Reviews

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Towards Destructive Software Testing

Fault classes and error detection capability of specification-based testing

Error-Based Software Testing and Analysis