ABSTRACT
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.
In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.
All datasets, code, and trained models used in this work are publicly available.
- Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of Onward! 2019. 143--153. Google ScholarDigital Library
- Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2014. Learning natural coding conventions. In Proceedings of SIGSOFT/FSE 2014. 281--293. Google ScholarDigital Library
- Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2015. Suggesting accurate method and class names. In Proceedings of ESEC/FSE 2015. 38--49. Google ScholarDigital Library
- Miltiadis Allamanis, Earl T. Barr, Premkumar T. Devanbu, and Charles A. Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv. 51, 4 (2018), 81:1--81:37. Google ScholarDigital Library
- Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of ICML 2016, Vol. 48. 2091--2100. http://proceedings.mlr.press/v48/allamanis16.htmlGoogle Scholar
- Miltiadis Allamanis and Charles A. Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of MSR 2013. 207--216. Google ScholarCross Ref
- Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of ICML 2015, Vol. 37. 2123--2132. http://proceedings.mlr.press/v37/allamanis15.htmlGoogle Scholar
- Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Representations of Code. In Proceedings of ICLR 2019. https://openreview.net/forum?id=H1gKYo09tXGoogle Scholar
- Hlib Babii, Andrea Janes, and Romain Robbes. 2019. Modeling Vocabulary for Big Code Machine Learning. CoRR abs/1904.01873 (2019). http://arxiv.org/abs/1904.01873Google Scholar
- Rohan Bavishi, Michael Pradel, and Koushik Sen. 2018. Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts. CoRR abs/1809.05193 (2018). arXiv:1809.05193Google Scholar
- Issam Bazzi. 2002. Modelling Out-of-vocabulary Words for Robust Speech Recognition. Ph.D. Dissertation. Cambridge, MA, USA. AAI0804528.Google ScholarDigital Library
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3 (March 2003), 1137--1155. http://dl.acm.org/citation.cfm?id=944919.944966Google ScholarDigital Library
- Sahil Bhatia and Rishabh Singh. 2016. Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks. CoRR abs/1603.06129 (2016). http://arxiv.org/abs/1603.06129Google Scholar
- Pavol Bielik, Veselin Raychev, and Martin T. Vechev. 2016. PHOG: Probabilistic Model for Code. In Proceedings of ICML 2016, Vol. 48. 2933--2942. http://proceedings.mlr.press/v48/bielik16.htmlGoogle Scholar
- David W. Binkley, Marcia Davis, Dawn J. Lawrie, and Christopher Morrell. 2009. To CamelCase or Under_score. In Proceedings of ICPC 2009. 158--167. Google ScholarCross Ref
- James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-Recurrent Neural Networks. In Proceedings of ICLR 2017. https://openreview.net/forum?id=H1zJ-v5xlGoogle Scholar
- Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of ESEC/FSE 2009. 213--222. Google ScholarDigital Library
- Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax Errors Just Aren't Natural: Improving Error Reporting with Language Models. In Proceedings of MSR 2014. 252--261. Google ScholarDigital Library
- Stanley F Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13, 4 (1999), 359--394.Google ScholarDigital Library
- Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2018. Sequencer: Sequence-to-sequence learning for end-to-end program repair. arXiv preprint arXiv:1901.01808 (2018).Google Scholar
- Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of EMNLP 2014. 1724--1734. Google ScholarCross Ref
- Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR abs/1412.3555 (2014). http://arxiv.org/abs/1412.3555Google Scholar
- Anna Corazza, Sergio Di Martino, and Valerio Maggio. 2012. LINSEN: An efficient approach to split identifiers and expand abbreviations. In Proceedings of ICSM 2012. 233--242. Google ScholarDigital Library
- Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. 2007. Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing (TSLP) 5, 1 (2007), 3.Google Scholar
- Everton da Silva Maldonado, Emad Shihab, and Nikolaos Tsantalis. 2017. Using Natural Language Processing to Automatically Detect Self-Admitted Technical Debt. IEEE Transactions on Software Engineering 43, 11 (Nov 2017), 1044--1062. Google ScholarDigital Library
- Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016).Google Scholar
- Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Sailesh R, and Subhajit Roy. 2016. Program synthesis using natural language. In Proceedings of ICSE 2016. 345--356. Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019. 4171--4186. Google ScholarCross Ref
- Sergey Dudoladov. 2013. Statistical NLP for computer program source code: An information theoretic perspective on programming language verbosity. Master's thesis. School of Informatics, University of Edinburgh, United Kingdom.Google Scholar
- Eric Enslen, Emily Hill, Lori L. Pollock, and K. Vijay-Shanker. 2009. Mining source code to automatically split identifiers for software analysis. In Proceedings of MSR 2009. 71--80. Google ScholarDigital Library
- Stefan Fiott. 2015. An Investigation of Statistical Language Modelling of Different Programming Language Types Using Large Corpora. Master's thesis. School of Informatics, University of Edinburgh, United Kingdom.Google Scholar
- Christine Franks, Zhaopeng Tu, Premkumar T. Devanbu, and Vincent Hellendoorn. 2015. CACHECA: A Cache Language Model Based Code Suggestion Tool. In Proceedings of ICSE 2015 (Volume 2). 705--708. https://ieeexplore.ieee.org/document/7203048Google ScholarCross Ref
- Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of SIGSOFT/FSE 2010. 147--156. Google ScholarDigital Library
- Philip Gage. 1994. A New Algorithm for Data Compression. C Users J. 12, 2 (Feb. 1994), 23--38. http://dl.acm.org/citation.cfm?id=177910.177914Google Scholar
- ChengYue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. FRAGE: Frequency-Agnostic Word Representation. In Proceedings of NeurIPS 2018. 1341--1352. http://papers.nips.cc/paper/7408-frage-frequency-agnostic-word-representationGoogle Scholar
- Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017. Improving Neural Language Models with a Continuous Cache. In Proceedings of ICLR 2017. https://openreview.net/forum?id=B184E5qeeGoogle Scholar
- Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of SIGSOFT/FSE 2016. 631--642. Google ScholarDigital Library
- Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2017. DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning. In Proceedings of IJCAI 2017. 3675--3681. Google ScholarCross Ref
- Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Foundations and Trends® in Programming Languages 4, 1--2 (2017), 1--119.Google Scholar
- Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish K. Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In Proceedings of AAAI 2017. 1345--1351. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14603Google Scholar
- Vincent J. Hellendoorn and Premkumar Devanbu. 2017. Are Deep Neural Networks the Best Choice for Modeling Source Code?. In Proceedings of ESEC/FSE 2017. 763--773. Google ScholarDigital Library
- Vincent J. Hellendoorn, Sebastian Proksch, Harald C. Gall, and Alberto Bacchelli. 2019. When code completion fails: a case study on real-world completions. In Proceedings of ICSE 2019. 960--970. Google ScholarDigital Library
- Emily Hill, David Binkley, Dawn Lawrie, Lori Pollock, and K Vijay-Shanker. 2014. An empirical study of identifier splitting techniques. Empirical Software Engineering 19, 6 (2014), 1754--1780.Google ScholarDigital Library
- Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In Proceedings of ICSE 2012. 837--847. http://dl.acm.org/citation.cfm?id=2337223.2337322Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (Nov. 1997), 1735--1780. Google ScholarDigital Library
- Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of ACL 2019. 328--339. https://www.aclweb.org/anthology/P18-1031/Google ScholarCross Ref
- Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep Code Comment Generation. In Proceedings of ICPC 2018. 200--210. Google ScholarDigital Library
- Michael Hucka. 2018. Spiral: splitters for identifiers in source code files. J. Open Source Software 3, 24 (2018), 653.Google ScholarCross Ref
- Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In Proceedings of ACL 2016. 2073--2083. http://www.aclweb.org/anthology/P16-1195Google ScholarCross Ref
- Alan Jaffe, Jeremy Lacomis, Edward J Schwartz, Claire Le Goues, and Bogdan Vasilescu. 2018. Meaningful Variable Names for Decompiled Code: A Machine Translation Approach. In Proceedings of ICPC 2018. 20--30. Google ScholarDigital Library
- Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On Using Very Large Target Vocabulary for Neural Machine Translation. In Proceedings of ACL 2015. 1--10. Google ScholarCross Ref
- Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).Google Scholar
- René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of ISSTA 2014. 437--440. Google ScholarDigital Library
- Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-Based Statistical Translation of Programming Languages. In Proceedings of Onward! 2014. 173--184. Google ScholarDigital Library
- Rafael-Michael Karampatsis and Charles A. Sutton. 2019. Maybe Deep Neural Networks are the Best Choice for Modeling Source Code. CoRR abs/1903.05734 (2019). http://arxiv.org/abs/1903.05734Google Scholar
- Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. In Proceedings of ACL 2018. 284--294. Google ScholarCross Ref
- Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-Aware Neural Language Models.. In Proceedings of AAAI 2016. 2741--2749. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12489Google Scholar
- Jian Li, Yue Wang, Michael R. Lyu, and Irwin King. 2018. Code Completion with Neural Attention and Pointer Networks. In Proceedings of IJCAI 2018. 4159--4165. Google ScholarCross Ref
- Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology. In Proceedings of CoNLL 2013. 104--113. https://www.aclweb.org/anthology/W13-3512/Google Scholar
- Rabee Sohail Malik, Jibesh Patra, and Michael Pradel. 2019. NL2Type: inferring JavaScript function types from natural language information. In Proceedings of ICSE 2019. 304--315. Google ScholarDigital Library
- Vadim Markovtsev, Waren Long, Egor Bulychev, Romain Keramitas, Konstantin Slavnov, and Gabor Markowski. 2018. Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network. arXiv preprint arXiv:1805.11651 (2018).Google Scholar
- Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. In Proceedings of ICLR 2017. https://openreview.net/forum?id=Byj72udxeGoogle Scholar
- Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of INTERSPEECH 2010. 1045--1048. http://www.isca-speech.org/archive/interspeech_2010/i10_1045.htmlGoogle ScholarCross Ref
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of NIPS 2013. USA, 3111--3119. http://dl.acm.org/citation.cfm?id=2999792.2999959Google ScholarDigital Library
- Tomas Mikolov, Ilya Sutskever, Anoop Deoras, Le Hai Son, Stefan Kombrink, and Jan Cernock. 2012. Subword Language Modeling With Neural Networks. (08 2012).Google Scholar
- Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2013. Lexical Statistical Machine Translation for Language Migration. In Proceedings ESEC/FSE 2013. 651--654. Google ScholarDigital Library
- Thanh Nguyen, Peter C. Rigby, Anh Tuan Nguyen, Mark Karanfil, and Tien N. Nguyen. 2016. T2API: Synthesizing API Code Usage Templates from English Texts with Statistical Translation. In Proceedings of SIGSOFT/FSE 2016. 1013--1017. Google ScholarDigital Library
- Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N.Nguyen. 2013. A Statistical Semantic Language Model for Source Code. In Proceedings of ESEC/FSE 2013. New York, NY, USA, 532--542. Google ScholarDigital Library
- Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of NAACL-HLT 2018. 2227--2237. Google ScholarCross Ref
- Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Name-based Bug Detection. Proc. ACM Program. Lang. 2, OOPSLA, Article 147 (Oct. 2018), 25 pages. Google ScholarDigital Library
- Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Available: https://blog.openai.com/language-unsupervised/ (2018).Google Scholar
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Leo Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Available: https://blog.openai.com/better-language-models/ (2019).Google Scholar
- Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: Synthesizing What I Mean: Code Search and Idiomatic Snippet Synthesis. In Proceedings of ICSE 2016. 357--367. Google ScholarDigital Library
- Musfiqur Rahman, Dharani Palani, and Peter C. Rigby. 2019. Natural software revisited. In Proceedings of ICSE 2019. 37--48. Google ScholarDigital Library
- Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the "Naturalness" of Buggy Code. In Proceedings of ICSE 2016. 428--439. Google ScholarDigital Library
- Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of PLDI 2014. 419--428. Google ScholarDigital Library
- Romain Robbes and Andrea Janes. 2019. Leveraging small software engineering data sets with pre-trained neural networks. In Proceedings of ICSE (NIER) 2019. 29--32. Google ScholarDigital Library
- Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral. 2018. Syntax and Sensibility: Using language models to detect and correct syntax errors. In Proceedings of SANER 2018. 311--322. Google ScholarCross Ref
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of ACL 2016. Google ScholarCross Ref
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15 (2014), 1929--1958. http://jmlr.org/papers/v15/srivastava14a.htmlGoogle ScholarDigital Library
- Zhaopeng Tu, Zhendong Su, and Premkumar T. Devanbu. 2014. On the localness of software. In Proceedings of SIGSOFT/FSE 2014. 269--280. Google ScholarDigital Library
- Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. In Proceedings of ICSE 2019. 25--36. Google ScholarDigital Library
- Clara Vania and Adam Lopez. 2017. From Characters to Words to in Between: Do We Capture Morphology?. In Proceedings of ACL 2017. 2016--2027. Google ScholarCross Ref
- Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering clear, natural identifiers from obfuscated JS names. In Proceedings of ESEC/FSE 2017. 683--693. Google ScholarDigital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS 2017. 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-needGoogle ScholarDigital Library
- Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Proceedings of NIPS 2015. 2692--2700. http://papers.nips.cc/paper/5866-pointer-networksGoogle Scholar
- Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep Learning Code Fragments for Code Clone Detection. In Proceedings of ASE 2016. 87--98. Google ScholarDigital Library
- Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward Deep Learning Software Repositories. In Proceedings MSR 2015. 334--345. http://dl.acm.org/citation.cfm?id=2820518.2820559Google ScholarCross Ref
- Peter Willett. 2006. The Porter stemming algorithm: then and now. Program 40, 3 (2006), 219--223. Google ScholarCross Ref
- Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of MSR 2018. 476--486. Google ScholarDigital Library
Recommendations
A Survey of Machine Learning for Big Code and Naturalness
Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this ...
code2vec: learning distributed representations of code
We present a neural model for representing snippets of code as continuous distributed vectors (``code embeddings''). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of ...
IntelliCode compose: code generation using transformer
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringIn software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. ...
Comments