skip to main content
10.1145/3485447.3512225acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

Published:25 April 2022Publication History

ABSTRACT

Github Copilot, trained on billions of lines of public code, has recently become the buzzword in the computer science research and practice community. Although it is designed to help developers implement safe and effective code with powerful intelligence, practitioners and researchers raise concerns about its ethical and security problems, e.g., should the copyleft licensed code be freely leveraged or insecure code be considered for training in the first place? These problems pose a significant impact on Copilot and other similar products that aim to learn knowledge from large-scale open-source code through deep learning models, which are inevitably on the rise with the fast development of artificial intelligence. To mitigate such impacts, we argue that there is a need to invent effective mechanisms for protecting open-source code from being exploited by deep learning models. Here, we design and implement a prototype, CoProtector, which utilizes data poisoning techniques to arm source code repositories for defending against such exploits. Our large-scale experiments empirically show that CoProtector is effective in achieving its purpose, significantly reducing the performance of Copilot-like deep learning models while being able to stably reveal the secretly embedded watermark backdoors.

References

  1. 2021. Armin “vax ffs” Ronacher on Twitter. Retrieved Jul 25, 2021 from https://twitter.com/mitsuhiko/status/1410886329924194309Google ScholarGoogle Scholar
  2. 2021. Can GitHub Copilot introduce insecure code in its suggestions?Retrieved Oct 13, 2021 from https://copilot.github.com/#faq-can-github-copilot-introduce-insecure-code-in-its-suggestionsGoogle ScholarGoogle Scholar
  3. 2021. Code faster with AI completions | TabNine. Retrieved Aug 4, 2021 from https://www.tabnine.com/Google ScholarGoogle Scholar
  4. 2021. eevee on Twitter. Retrieved Sep 6, 2021 from https://twitter.com/eevee/status/1410037309848752128Google ScholarGoogle Scholar
  5. 2021. GitHub Copilot is not infringing your copyright. Retrieved Jul 25, 2021 from https://juliareda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/Google ScholarGoogle Scholar
  6. 2021. GitHub Copilot · Your AI pair programmer. Retrieved Jul 25, 2021 from https://copilot.github.com/Google ScholarGoogle Scholar
  7. 2021. GitHub Support just straight up confirmed in an email that yes. Retrieved Sep 6, 2021 from https://www.reddit.com/r/programming/comments/og8gxv/github_support_just_straight_up_confirmed_in_an/Google ScholarGoogle Scholar
  8. 2021. The GNU General Public License v3.0 - GNU Project - Free Software Foundation. Retrieved Jul 25, 2021 from https://www.gnu.org/licenses/gpl-3.0.en.htmlGoogle ScholarGoogle Scholar
  9. 2021. Is GitHub’s Copilot potentially infringing copyright?Retrieved Jul 25, 2021 from https://www.technollama.co.uk/is-githubs-copilot-potentially-infringing-copyrightGoogle ScholarGoogle Scholar
  10. Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. ArXiv abs/2005.00653(2020).Google ScholarGoogle Scholar
  11. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, J. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. Henighan, R. Child, A. Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. ArXiv abs/2005.14165(2020).Google ScholarGoogle Scholar
  12. Alvin Chan, Yi Tay, Y. Ong, and A. Zhang. 2020. Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder. In FINDINGS.Google ScholarGoogle Scholar
  13. Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Ben Edwards, Taesung Lee, Ian Molloy, and B. Srivastava. 2019. Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. ArXiv abs/1811.03728(2019).Google ScholarGoogle Scholar
  14. Sen Chen, Minhui Xue, Lingling Fan, Lei Ma, Yang Liu, and Lihua Xu. 2019. How Can We Craft Large-Scale Android Malware? An Automated Poisoning Attack. 2019 IEEE 1st International Workshop on Artificial Intelligence for Mobile (AI4Mobile) (2019), 21–24.Google ScholarGoogle ScholarCross RefCross Ref
  15. Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and D. Song. 2017. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. ArXiv abs/1712.05526(2017).Google ScholarGoogle Scholar
  16. Liam Fowl, Ping-Yeh Chiang, Micah Goldblum, Jonas Geiping, Arpit Bansal, Wojciech Czaja, and Tom Goldstein. 2021. Preventing Unauthorized Use of Proprietary Data: Poisoning for Secure Dataset Release. ArXiv abs/2103.02683(2021).Google ScholarGoogle Scholar
  17. Giorgio Franceschelli and Mirco Musolesi. 2021. Copyright in Generative Deep Learning. ArXiv abs/2105.09266(2021).Google ScholarGoogle Scholar
  18. Yaroslav Golubev, Maria Eliseeva, Nikita Povarov, and T. Bryksin. 2020. A Study of Potential Code Borrowing and License Violations in Java Projects on GitHub. Proceedings of the 17th International Conference on Mining Software Repositories (2020).Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Tianyu Gu, Brendan Dolan-Gavitt, and S. Garg. 2017. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. ArXiv abs/1708.06733(2017).Google ScholarGoogle Scholar
  20. Xiaodong Gu, H. Zhang, and S. Kim. 2018. Deep Code Search. 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE) (2018), 933–944.Google ScholarGoogle Scholar
  21. Sorami Hisamoto, Matt Post, and Kevin Duh. 2020. Membership Inference Attacks on Sequence-to-Sequence Models: Is My Data In Your Machine Translation System?Transactions of the Association for Computational Linguistics 8 (2020), 49–63.Google ScholarGoogle Scholar
  22. H. Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. ArXiv abs/1909.09436(2019).Google ScholarGoogle Scholar
  23. Wan Soo Kim and Kyogu Lee. 2020. Digital Watermarking For Protecting Audio Classification Datasets. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)(2020), 2842–2846.Google ScholarGoogle Scholar
  24. Pang Wei Koh, Jacob Steinhardt, and Percy Liang. 2018. Stronger Data Poisoning Attacks Break Data Sanitization Defenses. ArXiv abs/1811.00741(2018).Google ScholarGoogle Scholar
  25. Tofunmi Kupoluyi, Moumena Chaqfeh, Matteo Varvello, Waleed Hashmi, Lakshminarayanan Subramanian, and Yasir Zaki. 2021. Muzeel: A Dynamic JavaScript Analyzer for Dead Code Elimination in Today’s Web. ArXiv abs/2106.08948(2021).Google ScholarGoogle Scholar
  26. Yiming Li, Ziqi Zhang, Jiawang Bai, Baoyuan Wu, Yong Jiang, and Shutao Xia. 2020. Open-sourced Dataset Protection via Backdoor Watermarking. ArXiv abs/2010.05821(2020).Google ScholarGoogle Scholar
  27. Kishore Papineni, S. Roukos, T. Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In ACL.Google ScholarGoogle Scholar
  28. H. Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and R. Karri. 2021. An Empirical Cybersecurity Evaluation of GitHub Copilot’s Code Contributions. ArXiv abs/2108.09293(2021).Google ScholarGoogle Scholar
  29. Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. 2021. Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger. In ACL/IJCNLP.Google ScholarGoogle Scholar
  30. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.Google ScholarGoogle Scholar
  31. Goutham Ramakrishnan and Aws Albarghouthi. 2020. Backdoors in Neural Models of Source Code. ArXiv abs/2006.06841(2020).Google ScholarGoogle Scholar
  32. Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Herv’e J’egou. 2020. Radioactive data: tracing through training. ArXiv abs/2002.00937(2020).Google ScholarGoogle Scholar
  33. R. Schuster, Congzheng Song, Eran Tromer, and Vitaly Shmatikov. 2020. You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion. ArXiv abs/2007.02220(2020).Google ScholarGoogle Scholar
  34. A. Shafahi, W. R. Huang, Mahyar Najibi, O. Suciu, Christoph Studer, T. Dumitras, and T. Goldstein. 2018. Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. In NeurIPS.Google ScholarGoogle Scholar
  35. Juncheng Shen, Xiaolei Zhu, and De Ma. 2019. TensorClog: An Imperceptible Poisoning Attack on Deep Neural Network Applications. IEEE Access 7(2019), 41498–41506.Google ScholarGoogle ScholarCross RefCross Ref
  36. Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership Inference Attacks Against Machine Learning Models. 2017 IEEE Symposium on Security and Privacy (SP) (2017), 3–18.Google ScholarGoogle Scholar
  37. Congzheng Song and Vitaly Shmatikov. 2019. Auditing Data Provenance in Text-Generation Models. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jacob M. Springer, Bryn Reinstadler, and Una-May O’Reilly. 2020. STRATA: Simple, Gradient-Free Attacks for Models of Code.Google ScholarGoogle Scholar
  39. Jacob Steinhardt, Pang Wei Koh, and Percy Liang. 2017. Certified Defenses for Data Poisoning Attacks. In NIPS.Google ScholarGoogle Scholar
  40. Zeyu Sun, Qihao Zhu, Yingfei Xiong, Yican Sun, Lili Mou, and Lu Zhang. 2020. TreeGen: A Tree-Based Transformer Architecture for Code Generation. In AAAI.Google ScholarGoogle Scholar
  41. Brandon Tran, Jerry Li, and A. Madry. 2018. Spectral Signatures in Backdoor Attacks. In NeurIPS.Google ScholarGoogle Scholar
  42. Eric Wallace, Tony Zhao, Shi Feng, and Sameer Singh. 2021. Concealed Data Poisoning Attacks on NLP Models. In NAACL.Google ScholarGoogle Scholar
  43. B. L. Welch. 1947. The generalisation of student’s problems when several different population variances are involved.Biometrika 34 1-2(1947), 28–35.Google ScholarGoogle Scholar
  44. Changming Xu, Jun Wang, Yuqing Tang, Francisco Guzmán, Benjamin I. P. Rubinstein, and Trevor Cohn. 2021. A Targeted Attack on Black-Box Neural Machine Translation with Parallel Data Poisoning. Proceedings of the Web Conference 2021(2021).Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yanming Yang, Xin Xia, David Lo, and John C. Grundy. 2020. A Survey on Deep Learning for Software Engineering. CoRR abs/2011.14597(2020).Google ScholarGoogle Scholar
  46. Huangzhao Zhang, Zhuo Li, Ge Li, L. Ma, Yang Liu, and Zhi Jin. 2020. Generating Adversarial Examples for Holding Robustness of Source Code Processing Models. In AAAI.Google ScholarGoogle Scholar
  47. Shihao Zhao, Xingjun Ma, X. Zheng, J. Bailey, Jingjing Chen, and Yugang Jiang. 2020. Clean-Label Backdoor Attacks on Video Recognition Models. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 14431–14440.Google ScholarGoogle Scholar

Index Terms

  1. CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning
                Index terms have been assigned to the content through auto-classification.

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  WWW '22: Proceedings of the ACM Web Conference 2022
                  April 2022
                  3764 pages
                  ISBN:9781450390965
                  DOI:10.1145/3485447

                  Copyright © 2022 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 25 April 2022

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed limited

                  Acceptance Rates

                  Overall Acceptance Rate1,899of8,196submissions,23%

                  Upcoming Conference

                  WWW '24
                  The ACM Web Conference 2024
                  May 13 - 17, 2024
                  Singapore , Singapore

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader

                HTML Format

                View this article in HTML Format .

                View HTML Format