skip to main content
10.1145/2976749.2978422acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Statistical Deobfuscation of Android Applications

Published:24 October 2016Publication History

ABSTRACT

This work presents a new approach for deobfuscating Android APKs based on probabilistic learning of large code bases (termed "Big Code"). The key idea is to learn a probabilistic model over thousands of non-obfuscated Android applications and to use this probabilistic model to deobfuscate new, unseen Android APKs. The concrete focus of the paper is on reversing layout obfuscation, a popular transformation which renames key program elements such as classes, packages, and methods, thus making it difficult to understand what the program does. Concretely, the paper: (i) phrases the layout deobfuscation problem of Android APKs as structured prediction in a probabilistic graphical model, (ii) instantiates this model with a rich set of features and constraints that capture the Android setting, ensuring both semantic equivalence and high prediction accuracy, and (iii) shows how to leverage powerful inference and learning algorithms to achieve overall precision and scalability of the probabilistic predictions.

We implemented our approach in a tool called DeGuard and used it to: (i) reverse the layout obfuscation performed by the popular ProGuard system on benign, open-source applications, (ii) predict third-party libraries imported by benign APKs (also obfuscated by ProGuard), and (iii) rename obfuscated program elements of Android malware. The experimental results indicate that DeGuard is practically effective: it recovers 79.1% of the program element names obfuscated with ProGuard, it predicts third-party libraries with accuracy of 91.3%, and it reveals string decoders and classes that handle sensitive data in Android malware.

References

  1. Advertising SDK Can Be Hijacked for Making Phone Calls, Geo-Locating. http://www.hotforsecurity.com/blog/advertising-sdk-can-be-hijacked-for-making-phone-calls-geo-locating-7461.html http://www.hotforsecurity.com/blog/advertising-sdk-can-be-hijacked-for-making-phone-calls-geo-locating-7461.html.Google ScholarGoogle Scholar
  2. dex2jar. https://github.com/pxb1988/dex2jar.Google ScholarGoogle Scholar
  3. F-Droid. https://f-droid.org/.Google ScholarGoogle Scholar
  4. Java Decompiler. http://jd.benow.ca/.Google ScholarGoogle Scholar
  5. Nice2Predict. https://github.com/eth-srl/Nice2Predict.Google ScholarGoogle Scholar
  6. ProGuard. http://proguard.sourceforge.net/.Google ScholarGoogle Scholar
  7. Type Erasure.https://docs.oracle.com/javase/tutorial/java/generics/genTypes.html.Google ScholarGoogle Scholar
  8. M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Learning natural coding conventions. In FSE, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Suggesting accurate method and class names. In FSE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Allamanis, D. Tarlow, A. D. Gordon, and Y. Wei. Bimodal modelling of source code and natural language. In ICML, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein, Y. Le Traon,D. Octeau, and P. McDaniel. Flowdroid: Precise context, flow, field, object-sensitive andlifecycle-aware taint analysis for android apps. In PLDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Butler, M. Wermelinger, Y. Yu, and H. Sharp. Exploring the influence of identifier names on code quality: Anempirical study. In CSMR, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Caprile and P. Tonella. Restructuring program identifier names. In ICSM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Chen, X. Wang, Y. Chen, P. Wang, Y. Lee, X. Wang, B. Ma, A. Wang, Y. Zhang,and W. Zou. Following devil's footprints: Cross-platform analysis of potentially harmful libraries on android and ios. In S&P, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  15. W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N.Sheth. Taintdroid: An information-flow tracking system for realtime privacy monitoring on smartphones. In OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Gulwani and N. Jojic. Program verification as probabilistic inference. In POPL, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Gvero and V. Kuncak. Synthesizing java expressions from free-form queries. In OOPSLA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Karaivanov, V. Raychev, and M. Vechev. Phrase-based statistical translation of programming languages. Onward!, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. O. Katz, R. El-Yaniv, and E. Yahav. Estimating types in binaries using predictive modeling. In POPL, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques -Adaptive Computation and Machine Learning. The MIT Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Kremenek, A. Y. Ng, and D. Engler. A factor graph model for software bug finding. In IJCAI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Kremenek, P. Twohey, G. Back, A. Ng, and D. Engler. From uncertainty to belief: Inferring the specification within. In OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Livshits, A. V. Nori, S. K. Rajamani, and A. Banerjee. Merlin: Specification inference for explicit information flow problems. In PLDI, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Low. Protecting Java Code via Code Obfuscation. Crossroads, 4(3), Apr. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Z. Ma, H. Wang, Y. Guo, and X. Chen. Libradar: fast and accurate detection of third-party libraries in android apps. In ICSE 2016 - Companion Volume, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. J. Maddison and D. Tarlow. Structured generative models of natural source code. In ICML, 2014.Google ScholarGoogle Scholar
  28. D. Octeau, S. Jha, M. Dering, P. McDaniel, A. Bartel, L. Li, J. Klein, and Y. Le Traon. Combining static analysis with probabilistic models to enable market-scale android inter-component analysis. In POPL, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. D. Ratliff, J. A. Bagnell, and M. Zinkevich. (Approximate) Subgradient Methods for Structured Prediction. In AISTATS, 2007.Google ScholarGoogle Scholar
  30. V. Raychev, P. Bielik, M. Vechev, and A. Krause. Learning programs from noisy data. In POPL, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. V. Raychev, M. Vechev, and A. Krause. Predicting program properties from "big code". In POPL, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In PLDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. A. Relf. Tool assisted identifier naming for improved software readability: an empirical study. In ISESE, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  34. E. C. R. Shin, D. Song, and R. Moazzezi. Recognizing functions in binaries with neural networks. In USENIX Security, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Sutton and A. McCallum. An introduction to conditional random fields. Found. Trends Mach. Learn., 4(4):267--373, Apr. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. A. Takang, P. A. Grubb, and R. D. Macredie. The effects of comments and identifier names on program comprehensibility: an experimental investigation. J. Prog. Lang., 4(3):143--167, 1996.Google ScholarGoogle Scholar
  37. O. Tripp, S. Guarnieri, M. Pistoia, and A. Aravkin. Aletheia: Improving the usability of static security analysis. In CCS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot - a Java Bytecode Optimization Framework. In Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research. IBM Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Yu. Ginmaster: A case study in android malware.https://www.sophos.com/en-us/medialibrary/PDFs/technical%20papers/Yu-VB2013.pdf.Google ScholarGoogle Scholar
  40. Y. Zhou and X. Jiang. Dissecting android malware: Characterization and evolution. In S&P, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Statistical Deobfuscation of Android Applications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CCS '16: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security
        October 2016
        1924 pages
        ISBN:9781450341394
        DOI:10.1145/2976749

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 October 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        CCS '16 Paper Acceptance Rate137of831submissions,16%Overall Acceptance Rate1,261of6,999submissions,18%

        Upcoming Conference

        CCS '24
        ACM SIGSAC Conference on Computer and Communications Security
        October 14 - 18, 2024
        Salt Lake City , UT , USA

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader