skip to main content
10.1145/3377811.3380342acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections

Big code != big vocabulary: open-vocabulary models for source code

Published:01 October 2020Publication History

ABSTRACT

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source code, degrading their performance and rendering them unable to scale.

In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported.

All datasets, code, and trained models used in this work are publicly available.

References

  1. Miltiadis Allamanis. 2019. The adverse effects of code duplication in machine learning models of code. In Proceedings of Onward! 2019. 143--153. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2014. Learning natural coding conventions. In Proceedings of SIGSOFT/FSE 2014. 281--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles A. Sutton. 2015. Suggesting accurate method and class names. In Proceedings of ESEC/FSE 2015. 38--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Miltiadis Allamanis, Earl T. Barr, Premkumar T. Devanbu, and Charles A. Sutton. 2018. A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv. 51, 4 (2018), 81:1--81:37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Miltiadis Allamanis, Hao Peng, and Charles A. Sutton. 2016. A Convolutional Attention Network for Extreme Summarization of Source Code. In Proceedings of ICML 2016, Vol. 48. 2091--2100. http://proceedings.mlr.press/v48/allamanis16.htmlGoogle ScholarGoogle Scholar
  6. Miltiadis Allamanis and Charles A. Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of MSR 2013. 207--216. Google ScholarGoogle ScholarCross RefCross Ref
  7. Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of ICML 2015, Vol. 37. 2123--2132. http://proceedings.mlr.press/v37/allamanis15.htmlGoogle ScholarGoogle Scholar
  8. Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019. code2seq: Generating Sequences from Structured Representations of Code. In Proceedings of ICLR 2019. https://openreview.net/forum?id=H1gKYo09tXGoogle ScholarGoogle Scholar
  9. Hlib Babii, Andrea Janes, and Romain Robbes. 2019. Modeling Vocabulary for Big Code Machine Learning. CoRR abs/1904.01873 (2019). http://arxiv.org/abs/1904.01873Google ScholarGoogle Scholar
  10. Rohan Bavishi, Michael Pradel, and Koushik Sen. 2018. Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts. CoRR abs/1809.05193 (2018). arXiv:1809.05193Google ScholarGoogle Scholar
  11. Issam Bazzi. 2002. Modelling Out-of-vocabulary Words for Robust Speech Recognition. Ph.D. Dissertation. Cambridge, MA, USA. AAI0804528.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A Neural Probabilistic Language Model. J. Mach. Learn. Res. 3 (March 2003), 1137--1155. http://dl.acm.org/citation.cfm?id=944919.944966Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Sahil Bhatia and Rishabh Singh. 2016. Automated Correction for Syntax Errors in Programming Assignments using Recurrent Neural Networks. CoRR abs/1603.06129 (2016). http://arxiv.org/abs/1603.06129Google ScholarGoogle Scholar
  14. Pavol Bielik, Veselin Raychev, and Martin T. Vechev. 2016. PHOG: Probabilistic Model for Code. In Proceedings of ICML 2016, Vol. 48. 2933--2942. http://proceedings.mlr.press/v48/bielik16.htmlGoogle ScholarGoogle Scholar
  15. David W. Binkley, Marcia Davis, Dawn J. Lawrie, and Christopher Morrell. 2009. To CamelCase or Under_score. In Proceedings of ICPC 2009. 158--167. Google ScholarGoogle ScholarCross RefCross Ref
  16. James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2017. Quasi-Recurrent Neural Networks. In Proceedings of ICLR 2017. https://openreview.net/forum?id=H1zJ-v5xlGoogle ScholarGoogle Scholar
  17. Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples to improve code completion systems. In Proceedings of ESEC/FSE 2009. 213--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Joshua Charles Campbell, Abram Hindle, and José Nelson Amaral. 2014. Syntax Errors Just Aren't Natural: Improving Error Reporting with Language Models. In Proceedings of MSR 2014. 252--261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Stanley F Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13, 4 (1999), 359--394.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus. 2018. Sequencer: Sequence-to-sequence learning for end-to-end program repair. arXiv preprint arXiv:1901.01808 (2018).Google ScholarGoogle Scholar
  21. Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of EMNLP 2014. 1724--1734. Google ScholarGoogle ScholarCross RefCross Ref
  22. Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR abs/1412.3555 (2014). http://arxiv.org/abs/1412.3555Google ScholarGoogle Scholar
  23. Anna Corazza, Sergio Di Martino, and Valerio Maggio. 2012. LINSEN: An efficient approach to split identifiers and expand abbreviations. In Proceedings of ICSM 2012. 233--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mathias Creutz, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. 2007. Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing (TSLP) 5, 1 (2007), 3.Google ScholarGoogle Scholar
  25. Everton da Silva Maldonado, Emad Shihab, and Nikolaos Tsantalis. 2017. Using Natural Language Processing to Automatically Detect Self-Admitted Technical Debt. IEEE Transactions on Software Engineering 43, 11 (Nov 2017), 1044--1062. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Hoa Khanh Dam, Truyen Tran, and Trang Pham. 2016. A deep language model for software code. arXiv preprint arXiv:1608.02715 (2016).Google ScholarGoogle Scholar
  27. Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Sailesh R, and Subhajit Roy. 2016. Program synthesis using natural language. In Proceedings of ICSE 2016. 345--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT 2019. 4171--4186. Google ScholarGoogle ScholarCross RefCross Ref
  29. Sergey Dudoladov. 2013. Statistical NLP for computer program source code: An information theoretic perspective on programming language verbosity. Master's thesis. School of Informatics, University of Edinburgh, United Kingdom.Google ScholarGoogle Scholar
  30. Eric Enslen, Emily Hill, Lori L. Pollock, and K. Vijay-Shanker. 2009. Mining source code to automatically split identifiers for software analysis. In Proceedings of MSR 2009. 71--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Stefan Fiott. 2015. An Investigation of Statistical Language Modelling of Different Programming Language Types Using Large Corpora. Master's thesis. School of Informatics, University of Edinburgh, United Kingdom.Google ScholarGoogle Scholar
  32. Christine Franks, Zhaopeng Tu, Premkumar T. Devanbu, and Vincent Hellendoorn. 2015. CACHECA: A Cache Language Model Based Code Suggestion Tool. In Proceedings of ICSE 2015 (Volume 2). 705--708. https://ieeexplore.ieee.org/document/7203048Google ScholarGoogle ScholarCross RefCross Ref
  33. Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of SIGSOFT/FSE 2010. 147--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Philip Gage. 1994. A New Algorithm for Data Compression. C Users J. 12, 2 (Feb. 1994), 23--38. http://dl.acm.org/citation.cfm?id=177910.177914Google ScholarGoogle Scholar
  35. ChengYue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. FRAGE: Frequency-Agnostic Word Representation. In Proceedings of NeurIPS 2018. 1341--1352. http://papers.nips.cc/paper/7408-frage-frequency-agnostic-word-representationGoogle ScholarGoogle Scholar
  36. Edouard Grave, Armand Joulin, and Nicolas Usunier. 2017. Improving Neural Language Models with a Continuous Cache. In Proceedings of ICLR 2017. https://openreview.net/forum?id=B184E5qeeGoogle ScholarGoogle Scholar
  37. Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of SIGSOFT/FSE 2016. 631--642. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2017. DeepAM: Migrate APIs with Multi-modal Sequence to Sequence Learning. In Proceedings of IJCAI 2017. 3675--3681. Google ScholarGoogle ScholarCross RefCross Ref
  39. Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Foundations and Trends® in Programming Languages 4, 1--2 (2017), 1--119.Google ScholarGoogle Scholar
  40. Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish K. Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning. In Proceedings of AAAI 2017. 1345--1351. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14603Google ScholarGoogle Scholar
  41. Vincent J. Hellendoorn and Premkumar Devanbu. 2017. Are Deep Neural Networks the Best Choice for Modeling Source Code?. In Proceedings of ESEC/FSE 2017. 763--773. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Vincent J. Hellendoorn, Sebastian Proksch, Harald C. Gall, and Alberto Bacchelli. 2019. When code completion fails: a case study on real-world completions. In Proceedings of ICSE 2019. 960--970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Emily Hill, David Binkley, Dawn Lawrie, Lori Pollock, and K Vijay-Shanker. 2014. An empirical study of identifier splitting techniques. Empirical Software Engineering 19, 6 (2014), 1754--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the Naturalness of Software. In Proceedings of ICSE 2012. 837--847. http://dl.acm.org/citation.cfm?id=2337223.2337322Google ScholarGoogle ScholarCross RefCross Ref
  45. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (Nov. 1997), 1735--1780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of ACL 2019. 328--339. https://www.aclweb.org/anthology/P18-1031/Google ScholarGoogle ScholarCross RefCross Ref
  47. Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep Code Comment Generation. In Proceedings of ICPC 2018. 200--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Michael Hucka. 2018. Spiral: splitters for identifiers in source code files. J. Open Source Software 3, 24 (2018), 653.Google ScholarGoogle ScholarCross RefCross Ref
  49. Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In Proceedings of ACL 2016. 2073--2083. http://www.aclweb.org/anthology/P16-1195Google ScholarGoogle ScholarCross RefCross Ref
  50. Alan Jaffe, Jeremy Lacomis, Edward J Schwartz, Claire Le Goues, and Bogdan Vasilescu. 2018. Meaningful Variable Names for Decompiled Code: A Machine Translation Approach. In Proceedings of ICPC 2018. 20--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Sébastien Jean, KyungHyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On Using Very Large Target Vocabulary for Neural Machine Translation. In Proceedings of ACL 2015. 1--10. Google ScholarGoogle ScholarCross RefCross Ref
  52. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410 (2016).Google ScholarGoogle Scholar
  53. René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of ISSTA 2014. 437--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. 2014. Phrase-Based Statistical Translation of Programming Languages. In Proceedings of Onward! 2014. 173--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Rafael-Michael Karampatsis and Charles A. Sutton. 2019. Maybe Deep Neural Networks are the Best Choice for Modeling Source Code. CoRR abs/1903.05734 (2019). http://arxiv.org/abs/1903.05734Google ScholarGoogle Scholar
  56. Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. In Proceedings of ACL 2018. 284--294. Google ScholarGoogle ScholarCross RefCross Ref
  57. Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-Aware Neural Language Models.. In Proceedings of AAAI 2016. 2741--2749. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12489Google ScholarGoogle Scholar
  58. Jian Li, Yue Wang, Michael R. Lyu, and Irwin King. 2018. Code Completion with Neural Attention and Pointer Networks. In Proceedings of IJCAI 2018. 4159--4165. Google ScholarGoogle ScholarCross RefCross Ref
  59. Thang Luong, Richard Socher, and Christopher D. Manning. 2013. Better Word Representations with Recursive Neural Networks for Morphology. In Proceedings of CoNLL 2013. 104--113. https://www.aclweb.org/anthology/W13-3512/Google ScholarGoogle Scholar
  60. Rabee Sohail Malik, Jibesh Patra, and Michael Pradel. 2019. NL2Type: inferring JavaScript function types from natural language information. In Proceedings of ICSE 2019. 304--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Vadim Markovtsev, Waren Long, Egor Bulychev, Romain Keramitas, Konstantin Slavnov, and Gabor Markowski. 2018. Splitting source code identifiers using Bidirectional LSTM Recurrent Neural Network. arXiv preprint arXiv:1805.11651 (2018).Google ScholarGoogle Scholar
  62. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2017. Pointer Sentinel Mixture Models. In Proceedings of ICLR 2017. https://openreview.net/forum?id=Byj72udxeGoogle ScholarGoogle Scholar
  63. Tomas Mikolov, Martin Karafiát, Lukás Burget, Jan Cernocký, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Proceedings of INTERSPEECH 2010. 1045--1048. http://www.isca-speech.org/archive/interspeech_2010/i10_1045.htmlGoogle ScholarGoogle ScholarCross RefCross Ref
  64. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Proceedings of NIPS 2013. USA, 3111--3119. http://dl.acm.org/citation.cfm?id=2999792.2999959Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Tomas Mikolov, Ilya Sutskever, Anoop Deoras, Le Hai Son, Stefan Kombrink, and Jan Cernock. 2012. Subword Language Modeling With Neural Networks. (08 2012).Google ScholarGoogle Scholar
  66. Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N. Nguyen. 2013. Lexical Statistical Machine Translation for Language Migration. In Proceedings ESEC/FSE 2013. 651--654. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Thanh Nguyen, Peter C. Rigby, Anh Tuan Nguyen, Mark Karanfil, and Tien N. Nguyen. 2016. T2API: Synthesizing API Code Usage Templates from English Texts with Statistical Translation. In Proceedings of SIGSOFT/FSE 2016. 1013--1017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N.Nguyen. 2013. A Statistical Semantic Language Model for Source Code. In Proceedings of ESEC/FSE 2013. New York, NY, USA, 532--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of NAACL-HLT 2018. 2227--2237. Google ScholarGoogle ScholarCross RefCross Ref
  70. Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Name-based Bug Detection. Proc. ACM Program. Lang. 2, OOPSLA, Article 147 (Oct. 2018), 25 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Available: https://blog.openai.com/language-unsupervised/ (2018).Google ScholarGoogle Scholar
  72. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Leo Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners. Available: https://blog.openai.com/better-language-models/ (2019).Google ScholarGoogle Scholar
  73. Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: Synthesizing What I Mean: Code Search and Idiomatic Snippet Synthesis. In Proceedings of ICSE 2016. 357--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Musfiqur Rahman, Dharani Palani, and Peter C. Rigby. 2019. Natural software revisited. In Proceedings of ICSE 2019. 37--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Baishakhi Ray, Vincent Hellendoorn, Saheel Godhane, Zhaopeng Tu, Alberto Bacchelli, and Premkumar Devanbu. 2016. On the "Naturalness" of Buggy Code. In Proceedings of ICSE 2016. 428--439. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code Completion with Statistical Language Models. In Proceedings of PLDI 2014. 419--428. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Romain Robbes and Andrea Janes. 2019. Leveraging small software engineering data sets with pre-trained neural networks. In Proceedings of ICSE (NIER) 2019. 29--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral. 2018. Syntax and Sensibility: Using language models to detect and correct syntax errors. In Proceedings of SANER 2018. 311--322. Google ScholarGoogle ScholarCross RefCross Ref
  79. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of ACL 2016. Google ScholarGoogle ScholarCross RefCross Ref
  80. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15 (2014), 1929--1958. http://jmlr.org/papers/v15/srivastava14a.htmlGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  81. Zhaopeng Tu, Zhendong Su, and Premkumar T. Devanbu. 2014. On the localness of software. In Proceedings of SIGSOFT/FSE 2014. 269--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. In Proceedings of ICSE 2019. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Clara Vania and Adam Lopez. 2017. From Characters to Words to in Between: Do We Capture Morphology?. In Proceedings of ACL 2017. 2016--2027. Google ScholarGoogle ScholarCross RefCross Ref
  84. Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering clear, natural identifiers from obfuscated JS names. In Proceedings of ESEC/FSE 2017. 683--693. Google ScholarGoogle ScholarDigital LibraryDigital Library
  85. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS 2017. 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-needGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  86. Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Proceedings of NIPS 2015. 2692--2700. http://papers.nips.cc/paper/5866-pointer-networksGoogle ScholarGoogle Scholar
  87. Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep Learning Code Fragments for Code Clone Detection. In Proceedings of ASE 2016. 87--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward Deep Learning Software Repositories. In Proceedings MSR 2015. 334--345. http://dl.acm.org/citation.cfm?id=2820518.2820559Google ScholarGoogle ScholarCross RefCross Ref
  89. Peter Willett. 2006. The Porter stemming algorithm: then and now. Program 40, 3 (2006), 219--223. Google ScholarGoogle ScholarCross RefCross Ref
  90. Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. 2018. Learning to mine aligned code and natural language pairs from stack overflow. In Proceedings of MSR 2018. 476--486. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader