skip to main content
10.1145/3533767.3534220acmconferencesArticle/Chapter ViewAbstractPublication PagesisstaConference Proceedingsconference-collections
research-article
Open Access

DocTer: documentation-guided fuzzing for testing deep learning API functions

Published:18 July 2022Publication History

ABSTRACT

Input constraints are useful for many software development tasks. For example, input constraints of a function enable the generation of valid inputs, i.e., inputs that follow these constraints, to test the function deeper. API functions of deep learning (DL) libraries have DL-specific input constraints, which are described informally in the free-form API documentation. Existing constraint-extraction techniques are ineffective for extracting DL-specific input constraints.

To fill this gap, we design and implement a new technique—DocTer—to analyze API documentation to extract DL-specific input constraints for DL API functions. DocTer features a novel algorithm that automatically constructs rules to extract API parameter constraints from syntactic patterns in the form of dependency parse trees of API descriptions. These rules are then applied to a large volume of API documents in popular DL libraries to extract their input parameter constraints. To demonstrate the effectiveness of the extracted constraints, DocTer uses the constraints to enable the automatic generation of valid and invalid inputs to test DL API functions.

Our evaluation on three popular DL libraries (TensorFlow, PyTorch, and MXNet) shows that DocTer’s precision in extracting input constraints is 85.4%. DocTer detects 94 bugs from 174 API functions, including one previously unknown security vulnerability that is now documented in the CVE database, while a baseline technique without input constraints detects only 59 bugs. Most (63) of the 94 bugs are previously unknown, 54 of which have been fixed or confirmed by developers after we report them. In addition, DocTer detects 43 inconsistencies in documents, 39 of which are fixed or confirmed.

References

  1. 1999. The Java Modeling Language (JML). "https://www.cs.ucf.edu/~leavens/JML/examples.shtml" Google ScholarGoogle Scholar
  2. 2004. Beautiful Soup. https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Google ScholarGoogle Scholar
  3. 2013. American Fuzzy Lop. http://lcamtuf.coredump.cx/afl/ Google ScholarGoogle Scholar
  4. 2014. Universal Dependencies. https://universaldependencies.org/ Google ScholarGoogle Scholar
  5. 2015. libFuzzer – a library for coverage-guided fuzz testing.. http://llvm.org/docs/LibFuzzer.html Google ScholarGoogle Scholar
  6. 2016. OSS-Fuzz. https://github.com/google/oss-fuzz Google ScholarGoogle Scholar
  7. 2016. pytype. "https://github.com/google/pytype" Google ScholarGoogle Scholar
  8. 2017. What is the best programming language for Machine Learning? https://towardsdatascience.com/what-is-the-best-programming-language-for-machine-learning-a745c156d6b7 Google ScholarGoogle Scholar
  9. 2019. FuzzFactory: Domain-Specific Fuzzing with Waypoints. "https://github.com/rohanpadhye/fuzzfactory" Google ScholarGoogle Scholar
  10. 2019. incubator-mxnet. https://github.com/apache/incubator-mxnet/blob/1.6.0/python/mxnet/ndarray/ndarray.py##L64-L74 Google ScholarGoogle Scholar
  11. 2019. torch.Tensor. https://pytorch.org/docs/1.5.0/tensors.html Google ScholarGoogle Scholar
  12. 2020. tf.dtypes.DType. https://www.tensorflow.org/versions/r2.1/api_docs/python/tf/dtypes/DType Google ScholarGoogle Scholar
  13. 2022. ’s Supplementary Material. https://github.com/lin-tan/DocTer Google ScholarGoogle Scholar
  14. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard. 2016. Tensorflow: A system for large-scale machine learning. In 12th $USENIX$ symposium on operating systems design and implementation ($OSDI$ 16). 265–283. Google ScholarGoogle Scholar
  15. Steven Bird, Edward Loper, and Ewan Klein. 2009. Natural Language Processing with Python. O’Reilly Media Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Arianna Blasi, Alberto Goffi, Konstantin Kuznetsov, Alessandra Gorla, Michael D Ernst, Mauro Pezzè, and Sergio Delgado Castellanos. 2018. Translating code comments to procedure specifications. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 242–253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arxiv:1512.01274. Google ScholarGoogle Scholar
  18. Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. 2018. Testing Probabilistic Programming Systems.. ESEC/FSE 2018. Association for Computing Machinery, New York, NY, USA. isbn:9781450355735 https://doi.org/10.1145/3236024.3236057 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gordon Fraser and Andrea Arcuri. 2013. Whole Test Suite Generation. IEEE Transactions on Software Engineering, 39, 2 (2013), feb., 276 –291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Xiang Gao, Ripon K Saha, Mukul R Prasad, and Abhik Roychoudhury. 2020. Fuzz Testing based Data Augmentation to Improve Robustness of Deep Neural Networks. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Godefroid, A. Kiezun, and M. Y. Levin. 2008. Grammar-based Whitebox Fuzzing. In Proceedings of the ACM SIGPLAN conference on Programming language design and implementation. 206–215. Google ScholarGoogle Scholar
  22. Alberto Goffi, Alessandra Gorla, Michael D Ernst, and Mauro Pezzè. 2016. Automatic generation of oracles for exceptional behaviors. In Proceedings of the 25th international symposium on software testing and analysis. 213–224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. DLFuzz: Differential Fuzzing Testing of Deep Learning Systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). ACM, New York, NY, USA. 739–743. isbn:978-1-4503-5573-5 https://doi.org/10.1145/3236024.3264835 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Qianyu Guo, Xiaofei Xie, Yi Li, Xiaoyu Zhang, Yang Liu, Xiaohong Li, and Chao Shen. 2020. Audee: Automated Testing for Deep Learning Frameworks. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Q. Hu, L. Ma, X. Xie, B. Yu, Y. Liu, and J. Zhao. 2019. DeepMutation++: A Mutation Testing Framework for Deep Learning Systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1158–1161. Google ScholarGoogle Scholar
  26. Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of Real Faults in Deep Learning Systems. In Proceedings of 42nd International Conference on Software Engineering (ICSE ’20). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA. 510–520. isbn:9781450355728 https://doi.org/10.1145/3338906.3338955 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sifis Lagouvardos, Julian Dolby, Neville Grech, Anastasios Antoniadis, and Yannis Smaragdakis. 2020. Static Analysis of Shape in TensorFlow Programs.. In ECOOP 2020. Google ScholarGoogle Scholar
  29. Caroline Lemieux and Koushik Sen. 2018. FairFuzz: a targeted mutation strategy for increasing greybox fuzz testing coverage.. In ASE, Marianne Huchard, Christian Kästner, and Gordon Fraser (Eds.). ACM, 475–485. Google ScholarGoogle Scholar
  30. Shuang Liu, Jun Sun, Yang Liu, Yue Zhang, Bimlesh Wadhwa, Jin Song Dong, and Xinyu Wang. 2014. Automatic early defects detection in use case documents. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 785–790. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tao Lv, Ruishi Li, Yi Yang, Kai Chen, Xiaojing Liao, XiaoFeng Wang, Peiwei Hu, and Luyi Xing. 2020. RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1837–1852. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Majumda and R. Xu. 2007. Directed Test Generation Using Symbolic Grammars. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering. 134–143. Google ScholarGoogle Scholar
  33. Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60. Google ScholarGoogle ScholarCross RefCross Ref
  34. Manish Motwani and Yuriy Brun. 2019. Automatically generating precise Oracles from structured natural language specifications. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 188–199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Nejadgholi and J. Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 785–796. Google ScholarGoogle Scholar
  36. Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. 2019. TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing. In Proceedings of the 36th International Conference on Machine Learning, Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.) (Proceedings of Machine Learning Research, Vol. 97). PMLR, Long Beach, California, USA. 4901–4911. Google ScholarGoogle Scholar
  37. Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed Random Testing for Java. In Companion to the 22Nd ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications Companion (OOPSLA ’07). ACM, New York, NY, USA. 815–816. isbn:978-1-59593-865-7 https://doi.org/10.1145/1297846.1297902 Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit Paradkar. 2012. Inferring Method Specifications from Natural Language API Descriptions. In Proceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE Press, Piscataway, NJ, USA. 815–825. isbn:978-1-4673-1067-3 Google ScholarGoogle ScholarCross RefCross Ref
  39. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d' Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12 (2011), 2825–2830. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Hui Peng, Yan Shoshitaishvili, and Mathias Payer. 2018. T-Fuzz: Fuzzing by Program Transformation.. In IEEE Symposium on Security and Privacy. IEEE Computer Society, 697–710. isbn:978-1-5386-4353-2 Google ScholarGoogle ScholarCross RefCross Ref
  42. Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: Cross-backend Validation to Detect and Localize Bugs in Deep Learning Libraries. In Proceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, Piscataway, NJ, USA. 1027–1038. https://doi.org/10.1109/ICSE.2019.00107 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Siwakorn Srisakaokul, Zhengkai Wu, Angello Astorga, Oreoluwa Alebiosu, and Tao Xie. 2018. Multiple-Implementation Testing of Supervised Learning Software. In Proc. AAAI-18 Workshop on Engineering Dependable and Secure Machine Learning Systems (EDSMLS). Google ScholarGoogle Scholar
  44. Robert Swiecki. 2015. Honggfuzz: A general-purpose, easy-to-use fuzzer with interesting analysis options. URl: https://github. com/google/honggfuzz. Google ScholarGoogle Scholar
  45. Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. 2007. /*Icomment: Bugs or Bad Comments?*/. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles (SOSP ’07). ACM, New York, NY, USA. 145–158. isbn:978-1-59593-591-5 https://doi.org/10.1145/1294261.1294276 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Lin Tan, Yuanyuan Zhou, and Yoann Padioleau. 2011. aComment: Mining Annotations from Comments and Code to Detect Interrupt Related Concurrency Bugs. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11). ACM, New York, NY, USA. 11–20. isbn:978-1-4503-0445-0 https://doi.org/10.1145/1985793.1985796 Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. H. Tan, D. Marinov, L. Tan, and G. T. Leavens. 2012. @tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. 260–269. issn:2159-4848 https://doi.org/10.1109/ICST.2012.106 Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Saeid Tizpaz-Niari, Pavol Černỳ, and Ashutosh Trivedi. 2020. Detecting and understanding real-world differential performance bugs in machine learning libraries. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 189–199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Sakshi Udeshi and Sudipta Chattopadhyay. 2019. Grammar Based Directed Testing of Machine Learning Systems. CoRR, abs/1902.10027 (2019), arxiv:1902.10027. Google ScholarGoogle Scholar
  50. Jackson Vanover, Xuan Deng, and Cindy Rubio-González. 2020. Discovering discrepancies in numerical libraries.. In Proceedings of the 2020 International Symposium on Software Testing and Analysis (ISSTA 2020). 488–501. https://doi.org/10.1145/3395363.3397380 Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Haohan Wang, Da Sun, and Eric P Xing. 2019. What if we simply swap the two text fragments? a straightforward yet effective way to test the robustness of methods to confounding signals in nature language inference tasks. In Proceedings of the AAAI Conference on Artificial Intelligence. 33, 7136–7143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep Learning Library Testing via Effective Model Generation. In Proceedings of the 2020 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Edmund Wong, Lei Zhang, Song Wang, Taiyue Liu, and Lin Tan. 2015. Dase: Document-assisted symbolic execution for improving automated software testing. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. 1, 620–631. Google ScholarGoogle ScholarCross RefCross Ref
  54. Qian Wu, Ling Wu, Guangtai Liang, Qianxiang Wang, Tao Xie, and Hong Mei. 2013. Inferring dependency constraints on parameters for web services. In Proceedings of the 22nd international conference on World Wide Web. 1421–1432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: A Coverage-guided Fuzz Testing Framework for Deep Neural Networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2019). ACM, New York, NY, USA. 146–157. isbn:978-1-4503-6224-5 https://doi.org/10.1145/3293882.3330579 Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Mohammed Javeed Zaki. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE transactions on knowledge and data engineering, 17, 8 (2005), 1021–1035. Google ScholarGoogle Scholar
  57. Juan Zhai, Yu Shi, Minxue Pan, Guian Zhou, Yongxiang Liu, Chunrong Fang, Shiqing Ma, Lin Tan, and Xiangyu Zhang. 2020. C2S: Translating Natural Language Comments to Formal Program. In Proceedings of the 2020 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An Empirical Study on Program Failures of Deep Learning Jobs. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20). IEEE/ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on TensorFlow program bugs. In Proceedings of the 2020 International Symposium on Software Testing and Analysis (ISSTA 2018). 129–140. https://doi.org/10.1145/3213846.3213866 Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. W. Zheng, W. Wang, D. Liu, C. Zhang, Q. Zeng, Y. Deng, W. Yang, P. He, and T. Xie. 2019. Testing Untestable Neural Machine Translation: An Industrial Case. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 314–315. Google ScholarGoogle Scholar
  61. Hao Zhong, Lu Zhang, Tao Xie, and Hong Mei. 2009. Inferring resource specifications from natural language API documentation. In 2009 IEEE/ACM International Conference on Automated Software Engineering. 307–318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Yu Zhou, Changzhi Wang, Xin Yan, Taolue Chen, Sebastiano Panichella, and Harald Gall. 2018. Automatic detection and repair recommendation of directive defects in Java API documentation. IEEE Transactions on Software Engineering, 46, 9 (2018), 1004–1023. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. DocTer: documentation-guided fuzzing for testing deep learning API functions

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader