Big code != big vocabulary: open-vocabulary models for source code

Authors:
Rafael-Michael Karampatsis

University of Edinburgh, Edinburgh, United Kingdom

University of Edinburgh, Edinburgh, United Kingdom
View Profile

,
Hlib Babii

Free University of Bozen-Bolzano, Bozen-Bolzano, Italy

Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
View Profile

,
Romain Robbes

Free University of Bozen-Bolzano, Bozen-Bolzano, Italy

Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
View Profile

,
Charles Sutton

Google Research and University of Edinburgh

Google Research and University of Edinburgh
View Profile

,
Andrea Janes

Free University of Bozen-Bolzano, Bozen-Bolzano, Italy

Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
View Profile

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software EngineeringJune 2020Pages 1073–1085https://doi.org/10.1145/3377811.3380342

Published:01 October 2020Publication History

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

Pages 1073–1085

ABSTRACT

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.

In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.

All datasets, code, and trained models used in this work are publicly available.

References

Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of Onward! 2019. 143--153. Google ScholarDigital Library
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2014. Learning natural coding conventions. In Proceedings of SIGSOFT/FSE 2014. 281--293. Google ScholarDigital Library
Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2015. Suggesting accurate method and class names. In Proceedings of ESEC/FSE 2015. 38--49. Google ScholarDigital Library
Miltiadis Allamanis, Earl T. Barr, Premkumar T. Devanbu, and Charles A. Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv. 51, 4 (2018), 81:1--81:37. Google ScholarDigital Library
Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of ICML 2016, Vol. 48. 2091--2100. http://proceedings.mlr.press/v48/allamanis16.htmlGoogle Scholar
Miltiadis Allamanis and Charles A. Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of MSR 2013. 207--216. Google ScholarCross Ref
Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of ICML 2015, Vol. 37. 2123--2132. http://proceedings.mlr.press/v37/allamanis15.htmlGoogle Scholar
Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Representations of Code. In Proceedings of ICLR 2019. https://openreview.net/forum?id=H1gKYo09tXGoogle Scholar
Hlib Babii, Andrea Janes, and Romain Robbes. 2019. Modeling Vocabulary for Big Code Machine Learning. CoRR abs/1904.01873 (2019). http://arxiv.org/abs/1904.01873Google Scholar
Rohan Bavishi, Michael Pradel, and Koushik Sen. 2018. Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts. CoRR abs/1809.05193 (2018). arXiv:1809.05193Google Scholar
Issam Bazzi. 2002. Modelling Out-of-vocabulary Words for Robust Speech Recognition. Ph.D. Dissertation. Cambridge, MA, USA. AAI0804528.Google ScholarDigital Library
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3 (March 2003), 1137--1155. http://dl.acm.org/citation.cfm?id=944919.944966Google ScholarDigital Library
Sahil Bhatia and Rishabh Singh. 2016. Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks. CoRR abs/1603.06129 (2016). http://arxiv.org/abs/1603.06129Google Scholar
Pavol Bielik, Veselin Raychev, and Martin T. Vechev. 2016. PHOG: Probabilistic Model for Code. In Proceedings of ICML 2016, Vol. 48. 2933--2942. http://proceedings.mlr.press/v48/bielik16.htmlGoogle Scholar
David W. Binkley, Marcia Davis, Dawn J. Lawrie, and Christopher Morrell. 2009. To CamelCase or Under_score. In Proceedings of ICPC 2009. 158--167. Google ScholarCross Ref
James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-Recurrent Neural Networks. In Proceedings of ICLR 2017. https://openreview.net/forum?id=H1zJ-v5xlGoogle Scholar
Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of ESEC/FSE 2009. 213--222. Google ScholarDigital Library
Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax Errors Just Aren't Natural: Improving Error Reporting with Language Models. In Proceedings of MSR 2014. 252--261. Google ScholarDigital Library
Stanley F Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13, 4 (1999), 359--394.Google ScholarDigital Library
Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2018. Sequencer: Sequence-to-sequence learning for end-to-end program repair. arXiv preprint arXiv:1901.01808 (2018).Google Scholar
Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of EMNLP 2014. 1724--1734. Google ScholarCross Ref
Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR abs/1412.3555 (2014). http://arxiv.org/abs/1412.3555Google Scholar
Anna Corazza, Sergio Di Martino, and Valerio Maggio. 2012. LINSEN: An efficient approach to split identifiers and expand abbreviations. In Proceedings of ICSM 2012. 233--242. Google ScholarDigital Library
Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. 2007. Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing (TSLP) 5, 1 (2007), 3.Google Scholar
Everton da Silva Maldonado, Emad Shihab, and Nikolaos Tsantalis. 2017. Using Natural Language Processing to Automatically Detect Self-Admitted Technical Debt. IEEE Transactions on Software Engineering 43, 11 (Nov 2017), 1044--1062. Google ScholarDigital Library
Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016).Google Scholar
Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Sailesh R, and Subhajit Roy. 2016. Program synthesis using natural language. In Proceedings of ICSE 2016. 345--356. Google ScholarDigital Library
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019. 4171--4186. Google ScholarCross Ref
Sergey Dudoladov. 2013. Statistical NLP for computer program source code: An information theoretic perspective on programming language verbosity. Master's thesis. School of Informatics, University of Edinburgh, United Kingdom.Google Scholar
Eric Enslen, Emily Hill, Lori L. Pollock, and K. Vijay-Shanker. 2009. Mining source code to automatically split identifiers for software analysis. In Proceedings of MSR 2009. 71--80. Google ScholarDigital Library
Stefan Fiott. 2015. An Investigation of Statistical Language Modelling of Different Programming Language Types Using Large Corpora. Master's thesis. School of Informatics, University of Edinburgh, United Kingdom.Google Scholar
Christine Franks, Zhaopeng Tu, Premkumar T. Devanbu, and Vincent Hellendoorn. 2015. CACHECA: A Cache Language Model Based Code Suggestion Tool. In Proceedings of ICSE 2015 (Volume 2). 705--708. https://ieeexplore.ieee.org/document/7203048Google ScholarCross Ref
Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of SIGSOFT/FSE 2010. 147--156. Google ScholarDigital Library
Philip Gage. 1994. A New Algorithm for Data Compression. C Users J. 12, 2 (Feb. 1994), 23--38. http://dl.acm.org/citation.cfm?id=177910.177914Google Scholar
ChengYue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. FRAGE: Frequency-Agnostic Word Representation. In Proceedings of NeurIPS 2018. 1341--1352. http://papers.nips.cc/paper/7408-frage-frequency-agnostic-word-representationGoogle Scholar
Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017. Improving Neural Language Models with a Continuous Cache. In Proceedings of ICLR 2017. https://openreview.net/forum?id=B184E5qeeGoogle Scholar
Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of SIGSOFT/FSE 2016. 631--642. Google ScholarDigital Library
Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2017. DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning. In Proceedings of IJCAI 2017. 3675--3681. Google ScholarCross Ref
Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Foundations and Trends® in Programming Languages 4, 1--2 (2017), 1--119.Google Scholar
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish K. Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In Proceedings of AAAI 2017. 1345--1351. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14603Google Scholar
Vincent J. Hellendoorn and Premkumar Devanbu. 2017. Are Deep Neural Networks the Best Choice for Modeling Source Code?. In Proceedings of ESEC/FSE 2017. 763--773. Google ScholarDigital Library
Vincent J. Hellendoorn, Sebastian Proksch, Harald C. Gall, and Alberto Bacchelli. 2019. When code completion fails: a case study on real-world completions. In Proceedings of ICSE 2019. 960--970. Google ScholarDigital Library
Emily Hill, David Binkley, Dawn Lawrie, Lori Pollock, and K Vijay-Shanker. 2014. An empirical study of identifier splitting techniques. Empirical Software Engineering 19, 6 (2014), 1754--1780.Google ScholarDigital Library
Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In Proceedings of ICSE 2012. 837--847. http://dl.acm.org/citation.cfm?id=2337223.2337322Google ScholarCross Ref
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (Nov. 1997), 1735--1780. Google ScholarDigital Library
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of ACL 2019. 328--339. https://www.aclweb.org/anthology/P18-1031/Google ScholarCross Ref
Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep Code Comment Generation. In Proceedings of ICPC 2018. 200--210. Google ScholarDigital Library
Michael Hucka. 2018. Spiral: splitters for identifiers in source code files. J. Open Source Software 3, 24 (2018), 653.Google ScholarCross Ref
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In Proceedings of ACL 2016. 2073--2083. http://www.aclweb.org/anthology/P16-1195Google ScholarCross Ref
Alan Jaffe, Jeremy Lacomis, Edward J Schwartz, Claire Le Goues, and Bogdan Vasilescu. 2018. Meaningful Variable Names for Decompiled Code: A Machine Translation Approach. In Proceedings of ICPC 2018. 20--30. Google ScholarDigital Library
Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On Using Very Large Target Vocabulary for Neural Machine Translation. In Proceedings of ACL 2015. 1--10. Google ScholarCross Ref
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).Google Scholar
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of ISSTA 2014. 437--440. Google ScholarDigital Library
Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-Based Statistical Translation of Programming Languages. In Proceedings of Onward! 2014. 173--184. Google ScholarDigital Library
Rafael-Michael Karampatsis and Charles A. Sutton. 2019. Maybe Deep Neural Networks are the Best Choice for Modeling Source Code. CoRR abs/1903.05734 (2019). http://arxiv.org/abs/1903.05734Google Scholar
Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. In Proceedings of ACL 2018. 284--294. Google ScholarCross Ref
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-Aware Neural Language Models.. In Proceedings of AAAI 2016. 2741--2749. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12489Google Scholar
Jian Li, Yue Wang, Michael R. Lyu, and Irwin King. 2018. Code Completion with Neural Attention and Pointer Networks. In Proceedings of IJCAI 2018. 4159--4165. Google ScholarCross Ref
Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology. In Proceedings of CoNLL 2013. 104--113. https://www.aclweb.org/anthology/W13-3512/Google Scholar
Rabee Sohail Malik, Jibesh Patra, and Michael Pradel. 2019. NL2Type: inferring JavaScript function types from natural language information. In Proceedings of ICSE 2019. 304--315. Google ScholarDigital Library
Vadim Markovtsev, Waren Long, Egor Bulychev, Romain Keramitas, Konstantin Slavnov, and Gabor Markowski. 2018. Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network. arXiv preprint arXiv:1805.11651 (2018).Google Scholar
Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. In Proceedings of ICLR 2017. https://openreview.net/forum?id=Byj72udxeGoogle Scholar
Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of INTERSPEECH 2010. 1045--1048. http://www.isca-speech.org/archive/interspeech_2010/i10_1045.htmlGoogle ScholarCross Ref
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of NIPS 2013. USA, 3111--3119. http://dl.acm.org/citation.cfm?id=2999792.2999959Google ScholarDigital Library
Tomas Mikolov, Ilya Sutskever, Anoop Deoras, Le Hai Son, Stefan Kombrink, and Jan Cernock. 2012. Subword Language Modeling With Neural Networks. (08 2012).Google Scholar
Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2013. Lexical Statistical Machine Translation for Language Migration. In Proceedings ESEC/FSE 2013. 651--654. Google ScholarDigital Library
Thanh Nguyen, Peter C. Rigby, Anh Tuan Nguyen, Mark Karanfil, and Tien N. Nguyen. 2016. T2API: Synthesizing API Code Usage Templates from English Texts with Statistical Translation. In Proceedings of SIGSOFT/FSE 2016. 1013--1017. Google ScholarDigital Library
Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N.Nguyen. 2013. A Statistical Semantic Language Model for Source Code. In Proceedings of ESEC/FSE 2013. New York, NY, USA, 532--542. Google ScholarDigital Library
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of NAACL-HLT 2018. 2227--2237. Google ScholarCross Ref
Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Name-based Bug Detection. Proc. ACM Program. Lang. 2, OOPSLA, Article 147 (Oct. 2018), 25 pages. Google ScholarDigital Library
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Available: https://blog.openai.com/language-unsupervised/ (2018).Google Scholar
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Leo Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Available: https://blog.openai.com/better-language-models/ (2019).Google Scholar
Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: Synthesizing What I Mean: Code Search and Idiomatic Snippet Synthesis. In Proceedings of ICSE 2016. 357--367. Google ScholarDigital Library
Musfiqur Rahman, Dharani Palani, and Peter C. Rigby. 2019. Natural software revisited. In Proceedings of ICSE 2019. 37--48. Google ScholarDigital Library
Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the "Naturalness" of Buggy Code. In Proceedings of ICSE 2016. 428--439. Google ScholarDigital Library
Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of PLDI 2014. 419--428. Google ScholarDigital Library
Romain Robbes and Andrea Janes. 2019. Leveraging small software engineering data sets with pre-trained neural networks. In Proceedings of ICSE (NIER) 2019. 29--32. Google ScholarDigital Library
Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral. 2018. Syntax and Sensibility: Using language models to detect and correct syntax errors. In Proceedings of SANER 2018. 311--322. Google ScholarCross Ref
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of ACL 2016. Google ScholarCross Ref
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15 (2014), 1929--1958. http://jmlr.org/papers/v15/srivastava14a.htmlGoogle ScholarDigital Library
Zhaopeng Tu, Zhendong Su, and Premkumar T. Devanbu. 2014. On the localness of software. In Proceedings of SIGSOFT/FSE 2014. 269--280. Google ScholarDigital Library
Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. In Proceedings of ICSE 2019. 25--36. Google ScholarDigital Library
Clara Vania and Adam Lopez. 2017. From Characters to Words to in Between: Do We Capture Morphology?. In Proceedings of ACL 2017. 2016--2027. Google ScholarCross Ref
Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering clear, natural identifiers from obfuscated JS names. In Proceedings of ESEC/FSE 2017. 683--693. Google ScholarDigital Library
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS 2017. 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-needGoogle ScholarDigital Library
Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Proceedings of NIPS 2015. 2692--2700. http://papers.nips.cc/paper/5866-pointer-networksGoogle Scholar
Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep Learning Code Fragments for Code Clone Detection. In Proceedings of ASE 2016. 87--98. Google ScholarDigital Library
Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward Deep Learning Software Repositories. In Proceedings MSR 2015. 334--345. http://dl.acm.org/citation.cfm?id=2820518.2820559Google ScholarCross Ref
Peter Willett. 2006. The Porter stemming algorithm: then and now. Program 40, 3 (2006), 219--223. Google ScholarCross Ref
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of MSR 2018. 476--486. Google ScholarDigital Library

Recommendations

A Survey of Machine Learning for Big Code and Naturalness

Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. In this ...
Read More
code2vec: learning distributed representations of code

We present a neural model for representing snippets of code as continuous distributed vectors (``code embeddings''). The main idea is to represent a code snippet as a single fixed-length code vector, which can be used to predict semantic properties of ...
Read More
IntelliCode compose: code generation using transformer
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

In software development through integrated development environments (IDEs), code completion is one of the most widely used features. Nevertheless, majority of integrated development environments only support completion of methods and APIs, or arguments. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering
June 2020
1640 pages
ISBN:9781450371216
DOI:10.1145/3377811
General Chairs:
Gregg Rothermel
North Carolina State University
,
Doo-Hwan Bae
KAIST, South Korea
Copyright © 2020 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Available / v1.1
- Artifacts Evaluated & Reusable / v1.1
Author Tags
byte-pair encoding
naturalness of code
neural language models
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 116
  Total Citations
  View Citations
- 2,002
  Total Downloads
- Downloads (Last 12 months)531
- Downloads (Last 6 weeks)57
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Big code != big vocabulary: open-vocabulary models for source code

ICSE '20: Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering

ABSTRACT

References

Cited By

Recommendations

A Survey of Machine Learning for Big Code and Naturalness

code2vec: learning distributed representations of code

IntelliCode compose: code generation using transformer