skip to main content
10.1145/3597503.3639194acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

ChatGPT Incorrectness Detection in Software Reviews

Published:12 April 2024Publication History

ABSTRACT

We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 -- 0.75.

References

  1. Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2655--2668.Google ScholarGoogle ScholarCross RefCross Ref
  2. Toufique Ahmed and Premkumar Devanbu. 2022. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Open AI. 2023. GPT-4 --- openai.com. https://openai.com/research/gpt-4. [Accessed 01-08-2023].Google ScholarGoogle Scholar
  4. Rahul Aralikatte, Shashi Narayan, Joshua Maynez, Sascha Rothe, and Ryan McDonald. 2021. Focus Attention: Promoting Faithfulness and Diversity in Summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 6078--6095. Google ScholarGoogle ScholarCross RefCross Ref
  5. Amos Azaria and Tom Mitchell. 2023. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734 (2023).Google ScholarGoogle Scholar
  6. Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt Botvinick, et al. 2022. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems 35 (2022), 38176--38189.Google ScholarGoogle Scholar
  7. Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023).Google ScholarGoogle Scholar
  8. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877--1901.Google ScholarGoogle Scholar
  9. Paweł Budzianowski and Ivan Vulić. 2019. Hello, It's GPT-2 - How Can I Help You? Towards the Use of Pretrained Language Models for Task-Oriented Dialogue Systems. In Proceedings of the 3rd Workshop on Neural Generation and Translation. Association for Computational Linguistics, Hong Kong, 15--22. Google ScholarGoogle ScholarCross RefCross Ref
  10. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).Google ScholarGoogle Scholar
  11. Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing your question answering software via asking recursively. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 104--116.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044 (2019).Google ScholarGoogle Scholar
  13. Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, and Pascale Fung. 2023. Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2128--2140.Google ScholarGoogle ScholarCross RefCross Ref
  14. Fernando López de la Mora and Sarah Nadi. 2018. An empirical study of metric-based comparisons of software libraries. In Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering. Association for Computing Machinery, New York, NY, USA, 22--31.Google ScholarGoogle Scholar
  15. Fernando López De La Mora and Sarah Nadi. 2018. Which library should i use?: A metric-based comparison of software libraries. In 2018 IEEE/ACM 40th International Conference on Software Engineering: New Ideas and Emerging Technologies Results (ICSE-NIER). IEEE, Association for Computing Machinery, Gothenburg, Sweden, 37--40.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Haneen Deeb, Aldert Vrij, Lorraine Hope, Samantha Mann, Pär Anders Granhag, and Leif A Strömwall. 2018. Police officers' perceptions of statement inconsistency. Criminal Justice and Behavior 45, 5 (2018), 644--665.Google ScholarGoogle ScholarCross RefCross Ref
  17. Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William Cohen. 2019. Handling Divergent Reference Texts when Evaluating Table-to-Text Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4884--4895. Google ScholarGoogle ScholarCross RefCross Ref
  18. Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023).Google ScholarGoogle Scholar
  19. Nouha Dziri, Andrea Madotto, Osmar R Zaiane, and Avishek Joey Bose. 2021. Neural Path Hunter: Reducing Hallucination in Dialogue Systems via Path Grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2197--2214.Google ScholarGoogle ScholarCross RefCross Ref
  20. Steffen Eger and Yannik Benz. 2020. From Hero to Z\'eroe: A Benchmark of Low-Level Adversarial Attacks. arXiv preprint arXiv:2010.05648 (2020).Google ScholarGoogle Scholar
  21. Rehab El-Hajj and Sarah Nadi. 2020. LibComp: An IntelliJ plugin for comparing java libraries. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, 1591--1595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics 9 (2021), 1012--1031.Google ScholarGoogle ScholarCross RefCross Ref
  23. Philip Feldman, James R Foulds, and Shimei Pan. 2023. Trapping LLM Hallucinations Using Tagged Context Prompts. arXiv preprint arXiv:2306.06085 (2023).Google ScholarGoogle Scholar
  24. Yang Feng, Wanying Xie, Shuhao Gu, Chenze Shao, Wen Zhang, Zhengxin Yang, and Dong Yu. 2020. Modeling fluency and faithfulness for diverse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 59--66.Google ScholarGoogle ScholarCross RefCross Ref
  25. Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1536--1547.Google ScholarGoogle ScholarCross RefCross Ref
  26. James Finnie-Ansley, Paul Denny, Brett A Becker, Andrew Luxton-Reilly, and James Prather. 2022. The robots are coming: Exploring the implications of openai codex on introductory programming. In Proceedings of the 24th Australasian Computing Education Conference. 10--19.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Boris A Galitsky. 2023. Truth-O-Meter: Collaborating with LLM in Fighting its Hallucinations. (2023).Google ScholarGoogle Scholar
  28. Leo A. Goodman. 1961. Snowball Sampling. Annals of Mathematical Statistics 32, 1 (1961), 148--170.Google ScholarGoogle ScholarCross RefCross Ref
  29. Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. 2023. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738 (2023).Google ScholarGoogle Scholar
  30. Lynette Hirschman and Robert Gaizauskas. 2001. Natural language question answering: the view from here. natural language engineering 7, 4 (2001), 275--300.Google ScholarGoogle Scholar
  31. Yi Huang, Chunyang Chen, Zhenchang Xing, Tian Lin, and Yang Liu. 2018. Tell them apart: distilling technology differences from crowd-scale comparison discussions. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Association for Computing Machinery, New York, NY, USA, 214--224.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Myeongjun Jang, Deuk Sin Kwon, and Thomas Lukasiewicz. 2022. BECEL: Benchmark for Consistency Evaluation of Language Models. In Proceedings of the 29th International Conference on Computational Linguistics. 3680--3696.Google ScholarGoogle Scholar
  33. Myeongjun Jang and Thomas Lukasiewicz. 2023. Consistency analysis of chatgpt. arXiv preprint arXiv:2303.06273 (2023).Google ScholarGoogle Scholar
  34. Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328 (2017).Google ScholarGoogle Scholar
  35. Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How Can We Know When Language Models Know?On the Calibration of Language Models for Question Answering. Transactions of the Association for Computational Linguistics 9 (2021), 962--977. Google ScholarGoogle ScholarCross RefCross Ref
  36. Junaed Younus Khan and Gias Uddin. 2022. Automatic code documentation generation using gpt-3. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Junaed Younus Khan and Gias Uddin. 2023. Combining Contexts from Multiple Sources for Documentation-Specific Code Example Generation. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 683--687.Google ScholarGoogle ScholarCross RefCross Ref
  38. Gary LJ Lancaster, Aldert Vrij, Lorraine Hope, and Bridget Waller. 2013. Sorting the liars from the truth tellers: The benefits of asking unanticipated questions on lie detection. Applied Cognitive Psychology 27, 1 (2013), 107--114.Google ScholarGoogle ScholarCross RefCross Ref
  39. Enrique Larios Vargas, Maurício Aniche, Christoph Treude, Magiel Bruntink, and Georgios Gousios. 2020. Selecting third-party libraries: The practitioners' perspective. In Proceedings of the 28th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. 245--256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating Training Data Makes Language Models Better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 8424--8445. Google ScholarGoogle ScholarCross RefCross Ref
  41. Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems 35 (2022), 34586--34599.Google ScholarGoogle Scholar
  42. Xiaozhou Li, Sergio Moreschini, Zheying Zhang, and Davide Taibi. 2022. Exploring factors and metrics to select open source software components for integration: An empirical study. Journal of Systems and Software 188 (2022), 111255.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334 (2022).Google ScholarGoogle Scholar
  44. Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2023. Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models. arXiv preprint arXiv:2305.19187 (2023).Google ScholarGoogle Scholar
  45. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).Google ScholarGoogle Scholar
  46. Mingwei Liu, Xin Peng, Andrian Marcus, Shuangshuang Xing, Christoph Treude, and Chengyuan Zhao. 2021. API-Related Developer Information Needs in Stack Overflow. IEEE Transactions on Software Engineering 48, 11 (2021), 4485--4500.Google ScholarGoogle ScholarCross RefCross Ref
  47. Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052 (2021).Google ScholarGoogle Scholar
  48. Stephen MacNeil, Andrew Tran, Dan Mogil, Seth Bernstein, Erin Ross, and Ziheng Huang. 2022. Generating diverse code explanations using the gpt-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 2. 37--39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Potsawee Manakul, Adian Liusie, and Mark JF Gales. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 (2023).Google ScholarGoogle Scholar
  50. Sabrina J Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. 2022. Reducing conversational agents' overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics 10 (2022), 857--872.Google ScholarGoogle ScholarCross RefCross Ref
  51. OpenAI. [n. d.]. Online-ChatGPT - Optimizing Language Models for Dialogue. https://online-chatgpt.com/. [Accessed 31-07-2023].Google ScholarGoogle Scholar
  52. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730--27744.Google ScholarGoogle Scholar
  53. Stack Overflow. 2023. Stack Overflow Search. Stack Overflow, https://stackoverflow.com/search. Last updated on 01 August 2023.Google ScholarGoogle Scholar
  54. Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2023. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334 (2023).Google ScholarGoogle Scholar
  55. Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2339--2356.Google ScholarGoogle ScholarCross RefCross Ref
  56. Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. 2023. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813 (2023).Google ScholarGoogle Scholar
  57. Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Annibal, Alec Peltekian, and Yanfang Ye. 2021. CoTexT: Multi-task Learning with Code-Text Transformer. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021). Association for Computational Linguistics, Online, 40--47. Google ScholarGoogle ScholarCross RefCross Ref
  58. Julian Aron Prenner and Romain Robbes. 2021. Automatic Program Repair with OpenAI's Codex: Evaluating QuixBugs. arXiv preprint arXiv:2111.03922 (2021).Google ScholarGoogle Scholar
  59. Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Beyond accuracy: Behavioral testing of NLP models with CheckList. arXiv preprint arXiv:2005.04118 (2020).Google ScholarGoogle Scholar
  60. Qingchao Shen, Junjie Chen, Jie M Zhang, Haoyu Wang, Shuang Liu, and Menghan Tian. 2022. Natural test generation for precise testing of question answering software. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Minaoar Hossain Tanzil. 2023. ChatGPT Incorrectness Detection in Software Reviews. University of Calgary, https://github.com/minaoar/ChatGPT-Incorrectness-Detection-in-Software-Reviews.Google ScholarGoogle Scholar
  62. Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques Klein, and Tegawendé F Bissyandé. 2023. Is ChatGPT the Ultimate Programming Assistant-How far is it? arXiv preprint arXiv:2304.11938 (2023).Google ScholarGoogle Scholar
  63. Gias Uddin, Olga Baysal, Latifa Guerrouj, and Foutse Khomh. 2019. Understanding how and why developers seek and analyze API-related opinions. IEEE Transactions on Software Engineering 47, 4 (2019), 694--735.Google ScholarGoogle ScholarCross RefCross Ref
  64. Gias Uddin and Foutse Khomh. 2017. Automatic summarization of API reviews. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, IEEE Press, Urbana-Champaign, IL, USA, 159--170.Google ScholarGoogle ScholarCross RefCross Ref
  65. Gias Uddin and Foutse Khomh. 2017. Opiner: an opinion search and summarization engine for APIs. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, IEEE Press, Urbana-Champaign, IL, USA, 978--983.Google ScholarGoogle ScholarCross RefCross Ref
  66. Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. 2023. A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation. arXiv preprint arXiv:2307.03987 (2023).Google ScholarGoogle Scholar
  67. Aldert Vrij, Pär Anders Granhag, Samantha Mann, and Sharon Leal. 2011. Outsmarting the liars: Toward a cognitive lie detection approach. Current Directions in Psychological Science 20, 1 (2011), 28--32.Google ScholarGoogle ScholarCross RefCross Ref
  68. Aldert Vrij, Sharon Leal, Pär Anders Granhag, Samantha Mann, Ronald P Fisher, Jackie Hillman, and Kathryn Sperry. 2009. Outsmarting the liars: The benefit of asking unanticipated questions. Law and human behavior 33 (2009), 159--166.Google ScholarGoogle Scholar
  69. Hongmin Wang. 2019. Revisiting Challenges in Data-to-Text Generation with Fact Grounding. In Proceedings of the 12th International Conference on Natural Language Generation. Association for Computational Linguistics, Tokyo, Japan, 311--322. Google ScholarGoogle ScholarCross RefCross Ref
  70. Han Wang, Chunyang Chen, Zhenchang Xing, and John Grundy. 2020. DiffTech: A tool for differencing similar technologies from question-and-answer discussions. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. Association for Computing Machinery, New York, NY, USA, 1576--1580.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Han Wang, Chunyang Chen, Zhenchang Xing, and John Grundy. 2021. DiffTech: Differencing Similar Technologies from Crowd-Scale Comparison Discussions. IEEE Transactions on Software Engineering 48, 7 (2021), 2399--2416.Google ScholarGoogle ScholarCross RefCross Ref
  72. Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696--8708.Google ScholarGoogle ScholarCross RefCross Ref
  73. Anthony I Wasserman, Xianzheng Guo, Blake McMillian, Kai Qian, Ming-Yu Wei, and Qian Xu. 2017. OSSpal: finding and evaluating open source software. In IFIP International Conference on Open Source Systems. Springer, Cham, Springer International Publishing, Cham, 193--203.Google ScholarGoogle ScholarCross RefCross Ref
  74. Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).Google ScholarGoogle Scholar
  75. Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. 2023. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv preprint arXiv:2306.13063 (2023).Google ScholarGoogle Scholar
  76. Litao Yan, Miryung Kim, Bjoern Hartmann, Tianyi Zhang, and Elena L Glassman. 2022. Concept-Annotated Examples for Library Comparison. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology. Association for Computing Machinery, Bend, OR, USA, 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. 2023. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534 (2023).Google ScholarGoogle Scholar
  78. Zhuosheng Zhang, Hai Zhao, and Rui Wang. 2020. Machine reading comprehension: The role of contextualized language models and beyond. arXiv preprint arXiv:2005.06249 (2020).Google ScholarGoogle Scholar

Index Terms

  1. ChatGPT Incorrectness Detection in Software Reviews

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering
        April 2024
        2931 pages
        ISBN:9798400702174
        DOI:10.1145/3597503

        Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 12 April 2024

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate276of1,856submissions,15%

        Upcoming Conference

        ICSE 2025
      • Article Metrics

        • Downloads (Last 12 months)55
        • Downloads (Last 6 weeks)55

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader