ABSTRACT
We address the problem of code search in executables. Given a function in binary form and a large code base, our goal is to statically find similar functions in the code base. Towards this end, we present a novel technique for computing similarity between functions. Our notion of similarity is based on decomposition of functions into tracelets: continuous, short, partial traces of an execution. To establish tracelet similarity in the face of low-level compiler transformations, we employ a simple rewriting engine. This engine uses constraint solving over alignment constraints and data dependencies to match registers and memory addresses between tracelets, bridging the gap between tracelets that are otherwise similar. We have implemented our approach and applied it to find matches in over a million binary functions. We compare tracelet matching to approaches based on n-grams and graphlets and show that tracelet matching obtains dramatically better precision and recall.
- A heap based vulnerability in gnu's rtapelib.c. http://www.cvedetails.com/cve/CVE-2010-0624/.Google Scholar
- Hex-rays IDAPRO. http://www.hex-rays.com.Google Scholar
- Yard-plot. http://pypi.python.org/pypi/yard.Google Scholar
- Balakrishnan, G., and Reps, T. Divine: discovering variables in executables. In VMCAI'07 (2007), pp. 1--28. Google ScholarDigital Library
- Ball, T., and Larus, J. R. Efficient path profiling. In Proceedings of the 29th Int. Symp. on Microarchitecture (1996), MICRO 29. Google ScholarDigital Library
- Bansal, S., and Aiken, A. Automatic generation of peephole superoptimizers. In ASPLOS XII (2006). Google ScholarDigital Library
- Bellon, S., Koschke, R., Antoniol, G., Krinke, J., and Merlo, E. Comparison and evaluation of clone detection tools. IEEE TSE 33, 9 (2007), 577--591. Google ScholarDigital Library
- Bruschi, D., Martignoni, L., and Monga, M. Detecting self-mutating malware using control-flow graph matching. In DIMVA'06. Google ScholarDigital Library
- Comparetti, P., Salvaneschi, G., Kirda, E., Kolbitsch, C., Kruegel, C., and Zanero, S. Identifying dormant functionality in malware programs. In IEEE Symp. on Security and Privacy (2010). Google ScholarDigital Library
- Horwitz, S. Identifying the semantic and textual differences between two versions of a program. In PLDI '90. Google ScholarDigital Library
- Horwitz, S., Reps, T., and Binkley, D. Interprocedural slicing using dependence graphs. In PLDI '88 (1988). Google ScholarDigital Library
- Jang, J., Woo, M., and Brumley, D. Towards automatic software lineage inference. In USENIX Security (2013). Google ScholarDigital Library
- Khoo, W. M., Mycroft, A., and Anderson, R. Rendezvous: a search engine for binary code. In MSR '13. Google ScholarDigital Library
- Kruegel, C., Kirda, E., Mutz, D., Robertson, W., and Vigna, G. Polymorphic worm detection using structural information of executables. In Proc. of int. conf. on Recent Advances in Intrusion Detection, RAID'05. Google ScholarDigital Library
- Myles, G., and Collberg, C. K-gram based software birthmarks. In Proceedings of the 2005 ACM symposium on Applied computing, SAC '05, pp. 314--318. Google ScholarDigital Library
- Partush, N., and Yahav, E. Abstract semantic differencing for numerical programs. In SAS (2013).Google Scholar
- Reps, T., Ball, T., Das, M., and Larus, J. The use of program profiling for software maintenance with applications to the year 2000 problem. In ESEC '97/FSE-5. Google ScholarDigital Library
- Rosenblum, N., Zhu, X., and Miller, B. P. Who wrote this code? identifying the authors of program binaries. In ESORICS'11. Google ScholarDigital Library
- Rosenblum, N. E., Miller, B. P., and Zhu, X. Extracting compiler provenance from program binaries. In PASTE'10. Google ScholarDigital Library
- Saebjornsen, A., Willcock, J., Panas, T., Quinlan, D., and Su, Z. Detecting code clones in binary executables. In ISSTA '09. Google ScholarDigital Library
- Schkufza, E., Sharma, R., and Aiken, A. Stochastic superoptimization. In ASPLOS '13. Google ScholarDigital Library
- Sharma, R., Schkufza, E., Churchill, B., and Aiken, A. Data-driven equivalence checking. In OOPSLA'13. Google ScholarDigital Library
- Singh, R., Gulwani, S., and Solar-Lezama, A. Automated feedback generation for introductory programming assignments. In PLDI '13, pp. 15--26. Google ScholarDigital Library
- Swamidass, S. J., Azencott, C.-A., Daily, K., and Baldi, P. A CROC stronger than ROC. Bioinformatics 26, 10 (May 2010). Google ScholarDigital Library
- Wagner, R. A., and Fischer, M. J. The string-to-string correction problem. J. ACM 21, 1 (Jan. 1974), 168--173. Google ScholarDigital Library
Index Terms
- Tracelet-based code search in executables
Recommendations
Tracelet-based code search in executables
PLDI '14We address the problem of code search in executables. Given a function in binary form and a large code base, our goal is to statically find similar functions in the code base. Towards this end, we present a novel technique for computing similarity ...
Stochastic superoptimization
ASPLOS '13We formulate the loop-free binary superoptimization task as a stochastic search problem. The competing constraints of transformation correctness and performance improvement are encoded as terms in a cost function, and a Markov Chain Monte Carlo sampler ...
Checking Compliance to Coding Standards for x86 Executables
UIC-ATC '10: Proceedings of the 2010 Symposia and Workshops on Ubiquitous, Autonomic and Trusted ComputingCOTS component evaluation is one of the most important steps in component-based development. Enforcing the coding standard within the coding phase is one important aspect for the quality of safety-critical software. This paper addresses the problem of ...
Comments