ABSTRACT
Input constraints are useful for many software development tasks. For example, input constraints of a function enable the generation of valid inputs, i.e., inputs that follow these constraints, to test the function deeper. API functions of deep learning (DL) libraries have DL-specific input constraints, which are described informally in the free-form API documentation. Existing constraint-extraction techniques are ineffective for extracting DL-specific input constraints.
To fill this gap, we design and implement a new technique—DocTer—to analyze API documentation to extract DL-specific input constraints for DL API functions. DocTer features a novel algorithm that automatically constructs rules to extract API parameter constraints from syntactic patterns in the form of dependency parse trees of API descriptions. These rules are then applied to a large volume of API documents in popular DL libraries to extract their input parameter constraints. To demonstrate the effectiveness of the extracted constraints, DocTer uses the constraints to enable the automatic generation of valid and invalid inputs to test DL API functions.
Our evaluation on three popular DL libraries (TensorFlow, PyTorch, and MXNet) shows that DocTer’s precision in extracting input constraints is 85.4%. DocTer detects 94 bugs from 174 API functions, including one previously unknown security vulnerability that is now documented in the CVE database, while a baseline technique without input constraints detects only 59 bugs. Most (63) of the 94 bugs are previously unknown, 54 of which have been fixed or confirmed by developers after we report them. In addition, DocTer detects 43 inconsistencies in documents, 39 of which are fixed or confirmed.
- 1999. The Java Modeling Language (JML). "https://www.cs.ucf.edu/~leavens/JML/examples.shtml" Google Scholar
- 2004. Beautiful Soup. https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Google Scholar
- 2013. American Fuzzy Lop. http://lcamtuf.coredump.cx/afl/ Google Scholar
- 2014. Universal Dependencies. https://universaldependencies.org/ Google Scholar
- 2015. libFuzzer – a library for coverage-guided fuzz testing.. http://llvm.org/docs/LibFuzzer.html Google Scholar
- 2016. OSS-Fuzz. https://github.com/google/oss-fuzz Google Scholar
- 2016. pytype. "https://github.com/google/pytype" Google Scholar
- 2017. What is the best programming language for Machine Learning? https://towardsdatascience.com/what-is-the-best-programming-language-for-machine-learning-a745c156d6b7 Google Scholar
- 2019. FuzzFactory: Domain-Specific Fuzzing with Waypoints. "https://github.com/rohanpadhye/fuzzfactory" Google Scholar
- 2019. incubator-mxnet. https://github.com/apache/incubator-mxnet/blob/1.6.0/python/mxnet/ndarray/ndarray.py##L64-L74 Google Scholar
- 2019. torch.Tensor. https://pytorch.org/docs/1.5.0/tensors.html Google Scholar
- 2020. tf.dtypes.DType. https://www.tensorflow.org/versions/r2.1/api_docs/python/tf/dtypes/DType Google Scholar
- 2022. ’s Supplementary Material. https://github.com/lin-tan/DocTer Google Scholar
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard. 2016. Tensorflow: A system for large-scale machine learning. In 12th $USENIX$ symposium on operating systems design and implementation ($OSDI$ 16). 265–283. Google Scholar
- Steven Bird, Edward Loper, and Ewan Klein. 2009. Natural Language Processing with Python. O’Reilly Media Inc. Google ScholarDigital Library
- Arianna Blasi, Alberto Goffi, Konstantin Kuznetsov, Alessandra Gorla, Michael D Ernst, Mauro Pezzè, and Sergio Delgado Castellanos. 2018. Translating code comments to procedure specifications. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. 242–253. Google ScholarDigital Library
- Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. arxiv:1512.01274. Google Scholar
- Saikat Dutta, Owolabi Legunsen, Zixin Huang, and Sasa Misailovic. 2018. Testing Probabilistic Programming Systems.. ESEC/FSE 2018. Association for Computing Machinery, New York, NY, USA. isbn:9781450355735 https://doi.org/10.1145/3236024.3236057 Google ScholarDigital Library
- Gordon Fraser and Andrea Arcuri. 2013. Whole Test Suite Generation. IEEE Transactions on Software Engineering, 39, 2 (2013), feb., 276 –291. Google ScholarDigital Library
- Xiang Gao, Ripon K Saha, Mukul R Prasad, and Abhik Roychoudhury. 2020. Fuzz Testing based Data Augmentation to Improve Robustness of Deep Neural Networks. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20). Google ScholarDigital Library
- P. Godefroid, A. Kiezun, and M. Y. Levin. 2008. Grammar-based Whitebox Fuzzing. In Proceedings of the ACM SIGPLAN conference on Programming language design and implementation. 206–215. Google Scholar
- Alberto Goffi, Alessandra Gorla, Michael D Ernst, and Mauro Pezzè. 2016. Automatic generation of oracles for exceptional behaviors. In Proceedings of the 25th international symposium on software testing and analysis. 213–224. Google ScholarDigital Library
- Jianmin Guo, Yu Jiang, Yue Zhao, Quan Chen, and Jiaguang Sun. 2018. DLFuzz: Differential Fuzzing Testing of Deep Learning Systems. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). ACM, New York, NY, USA. 739–743. isbn:978-1-4503-5573-5 https://doi.org/10.1145/3236024.3264835 Google ScholarDigital Library
- Qianyu Guo, Xiaofei Xie, Yi Li, Xiaoyu Zhang, Yang Liu, Xiaohong Li, and Chao Shen. 2020. Audee: Automated Testing for Deep Learning Frameworks. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). Google ScholarDigital Library
- Q. Hu, L. Ma, X. Xie, B. Yu, Y. Liu, and J. Zhao. 2019. DeepMutation++: A Mutation Testing Framework for Deep Learning Systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1158–1161. Google Scholar
- Nargiz Humbatova, Gunel Jahangirova, Gabriele Bavota, Vincenzo Riccio, Andrea Stocco, and Paolo Tonella. 2020. Taxonomy of Real Faults in Deep Learning Systems. In Proceedings of 42nd International Conference on Software Engineering (ICSE ’20). ACM. Google ScholarDigital Library
- Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A Comprehensive Study on Deep Learning Bug Characteristics. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA. 510–520. isbn:9781450355728 https://doi.org/10.1145/3338906.3338955 Google ScholarDigital Library
- Sifis Lagouvardos, Julian Dolby, Neville Grech, Anastasios Antoniadis, and Yannis Smaragdakis. 2020. Static Analysis of Shape in TensorFlow Programs.. In ECOOP 2020. Google Scholar
- Caroline Lemieux and Koushik Sen. 2018. FairFuzz: a targeted mutation strategy for increasing greybox fuzz testing coverage.. In ASE, Marianne Huchard, Christian Kästner, and Gordon Fraser (Eds.). ACM, 475–485. Google Scholar
- Shuang Liu, Jun Sun, Yang Liu, Yue Zhang, Bimlesh Wadhwa, Jin Song Dong, and Xinyu Wang. 2014. Automatic early defects detection in use case documents. In Proceedings of the 29th ACM/IEEE international conference on Automated software engineering. 785–790. Google ScholarDigital Library
- Tao Lv, Ruishi Li, Yi Yang, Kai Chen, Xiaojing Liao, XiaoFeng Wang, Peiwei Hu, and Luyi Xing. 2020. RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 1837–1852. Google ScholarDigital Library
- R. Majumda and R. Xu. 2007. Directed Test Generation Using Symbolic Grammars. In Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering. 134–143. Google Scholar
- Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations. 55–60. Google ScholarCross Ref
- Manish Motwani and Yuriy Brun. 2019. Automatically generating precise Oracles from structured natural language specifications. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 188–199. Google ScholarDigital Library
- M. Nejadgholi and J. Yang. 2019. A Study of Oracle Approximations in Testing Deep Learning Libraries. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 785–796. Google Scholar
- Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. 2019. TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing. In Proceedings of the 36th International Conference on Machine Learning, Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.) (Proceedings of Machine Learning Research, Vol. 97). PMLR, Long Beach, California, USA. 4901–4911. Google Scholar
- Carlos Pacheco and Michael D. Ernst. 2007. Randoop: Feedback-directed Random Testing for Java. In Companion to the 22Nd ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications Companion (OOPSLA ’07). ACM, New York, NY, USA. 815–816. isbn:978-1-59593-865-7 https://doi.org/10.1145/1297846.1297902 Google ScholarDigital Library
- Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit Paradkar. 2012. Inferring Method Specifications from Natural Language API Descriptions. In Proceedings of the 34th International Conference on Software Engineering (ICSE ’12). IEEE Press, Piscataway, NJ, USA. 815–825. isbn:978-1-4673-1067-3 Google ScholarCross Ref
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d' Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf Google ScholarDigital Library
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12 (2011), 2825–2830. Google ScholarDigital Library
- Hui Peng, Yan Shoshitaishvili, and Mathias Payer. 2018. T-Fuzz: Fuzzing by Program Transformation.. In IEEE Symposium on Security and Privacy. IEEE Computer Society, 697–710. isbn:978-1-5386-4353-2 Google ScholarCross Ref
- Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: Cross-backend Validation to Detect and Localize Bugs in Deep Learning Libraries. In Proceedings of the 41st International Conference on Software Engineering (ICSE ’19). IEEE Press, Piscataway, NJ, USA. 1027–1038. https://doi.org/10.1109/ICSE.2019.00107 Google ScholarDigital Library
- Siwakorn Srisakaokul, Zhengkai Wu, Angello Astorga, Oreoluwa Alebiosu, and Tao Xie. 2018. Multiple-Implementation Testing of Supervised Learning Software. In Proc. AAAI-18 Workshop on Engineering Dependable and Secure Machine Learning Systems (EDSMLS). Google Scholar
- Robert Swiecki. 2015. Honggfuzz: A general-purpose, easy-to-use fuzzer with interesting analysis options. URl: https://github. com/google/honggfuzz. Google Scholar
- Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. 2007. /*Icomment: Bugs or Bad Comments?*/. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles (SOSP ’07). ACM, New York, NY, USA. 145–158. isbn:978-1-59593-591-5 https://doi.org/10.1145/1294261.1294276 Google ScholarDigital Library
- Lin Tan, Yuanyuan Zhou, and Yoann Padioleau. 2011. aComment: Mining Annotations from Comments and Code to Detect Interrupt Related Concurrency Bugs. In Proceedings of the 33rd International Conference on Software Engineering (ICSE ’11). ACM, New York, NY, USA. 11–20. isbn:978-1-4503-0445-0 https://doi.org/10.1145/1985793.1985796 Google ScholarDigital Library
- S. H. Tan, D. Marinov, L. Tan, and G. T. Leavens. 2012. @tComment: Testing Javadoc Comments to Detect Comment-Code Inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. 260–269. issn:2159-4848 https://doi.org/10.1109/ICST.2012.106 Google ScholarDigital Library
- Saeid Tizpaz-Niari, Pavol Černỳ, and Ashutosh Trivedi. 2020. Detecting and understanding real-world differential performance bugs in machine learning libraries. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 189–199. Google ScholarDigital Library
- Sakshi Udeshi and Sudipta Chattopadhyay. 2019. Grammar Based Directed Testing of Machine Learning Systems. CoRR, abs/1902.10027 (2019), arxiv:1902.10027. Google Scholar
- Jackson Vanover, Xuan Deng, and Cindy Rubio-González. 2020. Discovering discrepancies in numerical libraries.. In Proceedings of the 2020 International Symposium on Software Testing and Analysis (ISSTA 2020). 488–501. https://doi.org/10.1145/3395363.3397380 Google ScholarDigital Library
- Haohan Wang, Da Sun, and Eric P Xing. 2019. What if we simply swap the two text fragments? a straightforward yet effective way to test the robustness of methods to confounding signals in nature language inference tasks. In Proceedings of the AAAI Conference on Artificial Intelligence. 33, 7136–7143. Google ScholarDigital Library
- Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep Learning Library Testing via Effective Model Generation. In Proceedings of the 2020 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Google ScholarDigital Library
- Edmund Wong, Lei Zhang, Song Wang, Taiyue Liu, and Lin Tan. 2015. Dase: Document-assisted symbolic execution for improving automated software testing. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. 1, 620–631. Google ScholarCross Ref
- Qian Wu, Ling Wu, Guangtai Liang, Qianxiang Wang, Tao Xie, and Hong Mei. 2013. Inferring dependency constraints on parameters for web services. In Proceedings of the 22nd international conference on World Wide Web. 1421–1432. Google ScholarDigital Library
- Xiaofei Xie, Lei Ma, Felix Juefei-Xu, Minhui Xue, Hongxu Chen, Yang Liu, Jianjun Zhao, Bo Li, Jianxiong Yin, and Simon See. 2019. DeepHunter: A Coverage-guided Fuzz Testing Framework for Deep Neural Networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2019). ACM, New York, NY, USA. 146–157. isbn:978-1-4503-6224-5 https://doi.org/10.1145/3293882.3330579 Google ScholarDigital Library
- Mohammed Javeed Zaki. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE transactions on knowledge and data engineering, 17, 8 (2005), 1021–1035. Google Scholar
- Juan Zhai, Yu Shi, Minxue Pan, Guian Zhou, Yongxiang Liu, Chunrong Fang, Shiqing Ma, Lin Tan, and Xiangyu Zhang. 2020. C2S: Translating Natural Language Comments to Formal Program. In Proceedings of the 2020 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Google ScholarDigital Library
- Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, and Mao Yang. 2020. An Empirical Study on Program Failures of Deep Learning Jobs. In Proceedings of the 42nd International Conference on Software Engineering (ICSE ’20). IEEE/ACM. Google ScholarDigital Library
- Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An empirical study on TensorFlow program bugs. In Proceedings of the 2020 International Symposium on Software Testing and Analysis (ISSTA 2018). 129–140. https://doi.org/10.1145/3213846.3213866 Google ScholarDigital Library
- W. Zheng, W. Wang, D. Liu, C. Zhang, Q. Zeng, Y. Deng, W. Yang, P. He, and T. Xie. 2019. Testing Untestable Neural Machine Translation: An Industrial Case. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 314–315. Google Scholar
- Hao Zhong, Lu Zhang, Tao Xie, and Hong Mei. 2009. Inferring resource specifications from natural language API documentation. In 2009 IEEE/ACM International Conference on Automated Software Engineering. 307–318. Google ScholarDigital Library
- Yu Zhou, Changzhi Wang, Xin Yan, Taolue Chen, Sebastiano Panichella, and Harald Gall. 2018. Automatic detection and repair recommendation of directive defects in Java API documentation. IEEE Transactions on Software Engineering, 46, 9 (2018), 1004–1023. Google ScholarCross Ref
Index Terms
- DocTer: documentation-guided fuzzing for testing deep learning API functions
Recommendations
Fuzzing deep-learning libraries via automated relational API inference
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software EngineeringDeep Learning (DL) has gained wide attention in recent years. Meanwhile, bugs in DL systems can lead to serious consequences, and may even threaten human lives. As a result, a growing body of research has been dedicated to DL model testing. However, ...
Carving UI Tests to Generate API Tests and API Specification
ICSE '23: Proceedings of the 45th International Conference on Software EngineeringModern web applications make extensive use of API calls to update the UI state in response to user events or server-side changes. For such applications, API-level testing can play an important role, in-between unit-level testing and UI-level (or end-...
Unveiling the mystery of API evolution in deep learning frameworks: a case study of tensorflow 2
ICSE-SEIP '21: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in PracticeAPI developers have been working hard to evolve APIs to provide more simple, powerful, and robust API libraries. Although API evolution has been studied for multiple domains, such as Web and Android development, API evolution for deep learning ...
Comments