ABSTRACT
This work presents a new approach for deobfuscating Android APKs based on probabilistic learning of large code bases (termed "Big Code"). The key idea is to learn a probabilistic model over thousands of non-obfuscated Android applications and to use this probabilistic model to deobfuscate new, unseen Android APKs. The concrete focus of the paper is on reversing layout obfuscation, a popular transformation which renames key program elements such as classes, packages, and methods, thus making it difficult to understand what the program does. Concretely, the paper: (i) phrases the layout deobfuscation problem of Android APKs as structured prediction in a probabilistic graphical model, (ii) instantiates this model with a rich set of features and constraints that capture the Android setting, ensuring both semantic equivalence and high prediction accuracy, and (iii) shows how to leverage powerful inference and learning algorithms to achieve overall precision and scalability of the probabilistic predictions.
We implemented our approach in a tool called DeGuard and used it to: (i) reverse the layout obfuscation performed by the popular ProGuard system on benign, open-source applications, (ii) predict third-party libraries imported by benign APKs (also obfuscated by ProGuard), and (iii) rename obfuscated program elements of Android malware. The experimental results indicate that DeGuard is practically effective: it recovers 79.1% of the program element names obfuscated with ProGuard, it predicts third-party libraries with accuracy of 91.3%, and it reveals string decoders and classes that handle sensitive data in Android malware.
- Advertising SDK Can Be Hijacked for Making Phone Calls, Geo-Locating. http://www.hotforsecurity.com/blog/advertising-sdk-can-be-hijacked-for-making-phone-calls-geo-locating-7461.html http://www.hotforsecurity.com/blog/advertising-sdk-can-be-hijacked-for-making-phone-calls-geo-locating-7461.html.Google Scholar
- dex2jar. https://github.com/pxb1988/dex2jar.Google Scholar
- F-Droid. https://f-droid.org/.Google Scholar
- Java Decompiler. http://jd.benow.ca/.Google Scholar
- Nice2Predict. https://github.com/eth-srl/Nice2Predict.Google Scholar
- ProGuard. http://proguard.sourceforge.net/.Google Scholar
- Type Erasure.https://docs.oracle.com/javase/tutorial/java/generics/genTypes.html.Google Scholar
- M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Learning natural coding conventions. In FSE, 2014. Google ScholarDigital Library
- M. Allamanis, E. T. Barr, C. Bird, and C. Sutton. Suggesting accurate method and class names. In FSE, 2015. Google ScholarDigital Library
- M. Allamanis, D. Tarlow, A. D. Gordon, and Y. Wei. Bimodal modelling of source code and natural language. In ICML, 2015.Google ScholarDigital Library
- S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein, Y. Le Traon,D. Octeau, and P. McDaniel. Flowdroid: Precise context, flow, field, object-sensitive andlifecycle-aware taint analysis for android apps. In PLDI, 2014. Google ScholarDigital Library
- S. Butler, M. Wermelinger, Y. Yu, and H. Sharp. Exploring the influence of identifier names on code quality: Anempirical study. In CSMR, 2010. Google ScholarDigital Library
- B. Caprile and P. Tonella. Restructuring program identifier names. In ICSM, 2000. Google ScholarDigital Library
- K. Chen, X. Wang, Y. Chen, P. Wang, Y. Lee, X. Wang, B. Ma, A. Wang, Y. Zhang,and W. Zou. Following devil's footprints: Cross-platform analysis of potentially harmful libraries on android and ios. In S&P, 2016.Google ScholarCross Ref
- W. Enck, P. Gilbert, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N.Sheth. Taintdroid: An information-flow tracking system for realtime privacy monitoring on smartphones. In OSDI, 2010. Google ScholarDigital Library
- S. Gulwani and N. Jojic. Program verification as probabilistic inference. In POPL, 2007. Google ScholarDigital Library
- T. Gvero and V. Kuncak. Synthesizing java expressions from free-form queries. In OOPSLA, 2015. Google ScholarDigital Library
- S. Karaivanov, V. Raychev, and M. Vechev. Phrase-based statistical translation of programming languages. Onward!, 2014. Google ScholarDigital Library
- O. Katz, R. El-Yaniv, and E. Yahav. Estimating types in binaries using predictive modeling. In POPL, 2016. Google ScholarDigital Library
- D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques -Adaptive Computation and Machine Learning. The MIT Press, 2009. Google ScholarDigital Library
- T. Kremenek, A. Y. Ng, and D. Engler. A factor graph model for software bug finding. In IJCAI, 2007. Google ScholarDigital Library
- T. Kremenek, P. Twohey, G. Back, A. Ng, and D. Engler. From uncertainty to belief: Inferring the specification within. In OSDI, 2006. Google ScholarDigital Library
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In ICML, 2001. Google ScholarDigital Library
- B. Livshits, A. V. Nori, S. K. Rajamani, and A. Banerjee. Merlin: Specification inference for explicit information flow problems. In PLDI, 2009. Google ScholarDigital Library
- D. Low. Protecting Java Code via Code Obfuscation. Crossroads, 4(3), Apr. 1998. Google ScholarDigital Library
- Z. Ma, H. Wang, Y. Guo, and X. Chen. Libradar: fast and accurate detection of third-party libraries in android apps. In ICSE 2016 - Companion Volume, 2016. Google ScholarDigital Library
- C. J. Maddison and D. Tarlow. Structured generative models of natural source code. In ICML, 2014.Google Scholar
- D. Octeau, S. Jha, M. Dering, P. McDaniel, A. Bartel, L. Li, J. Klein, and Y. Le Traon. Combining static analysis with probabilistic models to enable market-scale android inter-component analysis. In POPL, 2016. Google ScholarDigital Library
- N. D. Ratliff, J. A. Bagnell, and M. Zinkevich. (Approximate) Subgradient Methods for Structured Prediction. In AISTATS, 2007.Google Scholar
- V. Raychev, P. Bielik, M. Vechev, and A. Krause. Learning programs from noisy data. In POPL, 2016. Google ScholarDigital Library
- V. Raychev, M. Vechev, and A. Krause. Predicting program properties from "big code". In POPL, 2015. Google ScholarDigital Library
- V. Raychev, M. Vechev, and E. Yahav. Code completion with statistical language models. In PLDI, 2014. Google ScholarDigital Library
- P. A. Relf. Tool assisted identifier naming for improved software readability: an empirical study. In ISESE, 2005.Google ScholarCross Ref
- E. C. R. Shin, D. Song, and R. Moazzezi. Recognizing functions in binaries with neural networks. In USENIX Security, 2015. Google ScholarDigital Library
- C. Sutton and A. McCallum. An introduction to conditional random fields. Found. Trends Mach. Learn., 4(4):267--373, Apr. 2012. Google ScholarDigital Library
- A. A. Takang, P. A. Grubb, and R. D. Macredie. The effects of comments and identifier names on program comprehensibility: an experimental investigation. J. Prog. Lang., 4(3):143--167, 1996.Google Scholar
- O. Tripp, S. Guarnieri, M. Pistoia, and A. Aravkin. Aletheia: Improving the usability of static security analysis. In CCS, 2014. Google ScholarDigital Library
- R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot - a Java Bytecode Optimization Framework. In Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research. IBM Press, 1999. Google ScholarDigital Library
- R. Yu. Ginmaster: A case study in android malware.https://www.sophos.com/en-us/medialibrary/PDFs/technical%20papers/Yu-VB2013.pdf.Google Scholar
- Y. Zhou and X. Jiang. Dissecting android malware: Characterization and evolution. In S&P, 2012. Google ScholarDigital Library
Index Terms
- Statistical Deobfuscation of Android Applications
Recommendations
Detecting repackaged smartphone applications in third-party android marketplaces
CODASPY '12: Proceedings of the second ACM conference on Data and Application Security and PrivacyRecent years have witnessed incredible popularity and adoption of smartphones and mobile devices, which is accompanied by large amount and wide variety of feature-rich smartphone applications. These smartphone applications (or apps), typically organized ...
Android Applications Repackaging Detection Techniques for Smartphone Devices
The problem of malwares affecting Smartphones has been widely recognized by the researchers across the world. Majority of these malwares target Android OS. Studies have found that most of the Android malwares hide inside repackaged apps to get inside ...
Repackaging Attack on Android Banking Applications and Its Countermeasures
Although anyone can easily publish Android applications (or apps) in an app marketplace according to an open policy, decompiling the apps is also easy due to the structural characteristics of the app building process, making them very vulnerable to ...
Comments