A framework for the static verification of api calls
Introduction
Automatic program verification tools have had a significant impact on software development, and are more and more used in practice to eliminate many errors that in the past would have caused program crashes, security vulnerabilities, and program instability (Johnson, 1977, Bush et al., 2000, Ball and Rajamani, 2002, Das et al., 2002, Csallner and Smaragdakis, 2005, Cok and Kiniry, 2005, Barringer et al., 2006). However, two software development trends are now hindering the applicability of automated program verification tools:
- (1)
the increasing use of binary-packaged components (for the most part libraries) through their application programming interface (api), and
- (2)
the increasing api sophistication, and in particular the embedding of many different domain-specific languages (dsls) as strings in the program code.
Both trends reduce the efficiency of the current approaches. The use of feature-rich libraries in their binary form handicaps verification programs that require access to source code, such as esc/Java (Flanagan et al., 2002), and also programs that contain a fixed-set of specific bug patterns, like its4 (Viega et al., 2000). Furthermore, the diversity of the libraries handicaps any tool that depends on a centralized repository of verification patterns. In addition, the embedding of dsls, like sql and xpath, in strings appearing in the program’s source code can introduce bugs that are beyond the reach of the current breed of tools based on approaches like theorem proving (Flanagan et al., 2002), dataflow analysis (Jackson, 1995), and finite state machines (Ball and Rajamani, 2002). To overcome these difficulties we propose a framework for incorporating api call verification code within each library containing the corresponding api implementation. Through the use of reflection techniques program checkers can invoke this code and verify that the actual arguments presented to api invocations meet corresponding value constraints.
An application programming interface specifies how one software component or system can communicate with a provider of some services. The communication can take the form of a (local or remote) procedure call, a method invocation, an operating system trap, or a web service invocation. The provider of the api functionality can be packaged in the form of a library, a component (Szyperski, 2002), a running process, or an abstract service available over the internet. Widely used apis include those defined in the Single unix Specification, the Microsoft Windows Win32, odbc, and .net apis, the apis defined in the Java 2 Enterprise Edition platform, as well as vertical apis addressing domains such as graphics rendering (Opengl, DirectX), storage devices (atapi, scsi), and network interfaces (ndis).
To substantiate the need for a program verification approach specifically targeting api calls we must look into the number and size of existing apis, their actual utilization in real-life software, the organizational structure of library development, and trends regarding their use. We gathered data from three sources:
- •
the Freebsd ports collection: a set of more than 10,000 contributed applications, mostly written in C/C++, and organized in a way that allows straightforward installation,
- •
a number of large and popular Java projects,
- •
historical implementations of the Emacs editor.
By and large, our results indicate that the number of projects that use external apis—that is, apis that are not by default part of their language or execution environment—are significant, as is the number of these apis, and their size. By tracking the dependencies of the Freebsd packages we established that the 12,357 ports packages in our Freebsd 4.11 system, had in total 21,135 library dependencies; i.e., they required a library, other than the 52 libraries that are part of the base system, in order to compile. The library dependencies comprised 688 different libraries, while the number of different external libraries used by a single project varied between 1 and 38, with a mode value of 2. Furthermore, 5117 projects used at least one external library and 405 projects used 10 or more.
We tracked the use of foreign apis in some large representative Java projects by analyzing all the Java archive files comprising the project’s binary distribution, and categorizing the class files they included according to their package. Those whose package was the same as the project (for example, org.eclipse for Eclipse) were categorized as “own”, the rest as “foreign”. The results are summarized in Table 1. Note that the numbers in the table do not include the Java runtime classes, because these are by default part of the runtime environment. Evidently, many projects depend on library code for a large part of their functionality.
The numbers we collected also point toward a highly decentralized organizational structure under which libraries are developed. The 111,321 foreign classes appearing in Table 1 are not unique, because the same classes may be used in multiple projects or subprojects; in our data set the class org.apache.commons.logging.LogSource was used 20 times. Nevertheless, the projects use among them 60,273 unique classes outside their own domain. Most of the classes belong to packages named according to Sun’s conventions: the name’s first elements define the organization behind the package. By looking at the first two elements of the package names we found 66 different entities behind the packages, like com.ibm or org.jboss. Clearly, any proposal for handling api call verification must take this diversity into account.
The previous two paragraphs indirectly support our claim regarding the size of existing apis through the large number of foreign classes used by the Java projects. We can further substantiate the size of existing apis by looking at the number of functions and methods available in some modern environments.
- •
The Single unix Specification version 2 identifies as interfaces 725 functions and macros.
- •
The Windows api list distributed with Visual C/C++ 5.0 contains 3777 elements listed as dll functions.
- •
The Microsoft .net Framework 1.1 documents 3136 types and 15,724 methods.
- •
The Java 1.5.0 runtime environment has 6520 public classes that contain 52,743 public methods and constructors.
In fact, such is the size and complexity of modern apis, that a recent paper proposes a system for mining api usage patterns from a corpus of sample programs in order to aid programmers in the navigation of the increasingly large and convoluted apis (Mandelin et al., 2005).
Finally, anyone who has been writing software for the last 20 years will readily attest that software systems are being fleshed-out, increasingly using third-party components through api calls. It is instructive to witness this trend in action. Table 2 details the total number of imported functions or methods used by three different editors. The text-based gnu Emacs editor derives from a 1985 codebase, and provides a feature-rich editing environment while using just 121 elements from the operating system and the C library. Released about 10 years later, XEmacs uses 406 elements to provide an X Window System gui, while nowadays jEdit uses 2927 Java methods to provide a similar interface with almost 100 thousand lines less code.
A domain-specific language is tailored specifically to an application domain: rather than being general purpose its aim is to capture precisely a domain’s semantics. Examples of domain-specific languages are bnf grammar specifications and regular expressions, as used for example by the yacc and lex tools for generating lexical analyzers and parsers (Johnson and Lesk, 1987), the sql database definition and manipulation language, and xslt, the xml document transformation language. Often dsl fragments are directly embedded into the source code of a general purpose language (Spinellis, 2001), in many cases as strings. Apart from the ubiquitous sql statements found in any typical database client, other dsls that appear as strings in code are regular expressions, output formatting specifications, various applications of xml, xpath queries, and urls. The verification of code written in these dsls is both important and worthy of an explicitly targeted approach.
Although some of the dsls we have described may appear trivial, they are not. Even a url is defined by a fairly extensive syntax, and specific url schemes can have very precise rules for the elements that appear in them. Java’s Generic Connection Framework defines a bnf grammar for a number of schemes, allowing for example the connection to a Bluetooth gps receiver using a url like
btspp://000A5600F776:1;authenticate=false;encrypt=false;
master=true;authorize=true
The name value pairs in the above url are precisely defined and a spelling error in one of them would result in a runtime error. In fact, a spelling mistake is not the only error that can occur within a dsl string embedded in a general purpose language. Other error classes include the following.
Syntax Error: Some dsls, like sql, are defined by an extensive syntax, and it is easy for a programmer to write an invalid statement.
Internationalization Problem: Regular expressions and format strings, both defined through a dsl, can often contain assumptions that can make a program difficult to localize. As an example, a regular expression containing the sequence [A-Z] to specify an uppercase letter will only work with ascii characters, and should probably be changed to specify a Unicode category through a sequence like \p{Lu}.
Portability Problem: Like general purpose programming languages, many of the dsls have been implemented or extended in a number of non-standard and incompatible ways. A programmer may unwittingly use such an extension, and thus burden the program with unintended portability restrictions that will cause problems in the future. As an example, our prototype tool flagged the following sql statement appearing in Openwfe as wrong.
SELECT workitem_id, action, arg
FROM action where msg_err is null
Although not readily apparent, the error in the above statement is the use of action, an sql reserved word, as an identifier.
All the dsls we have outlined are beyond the reach of general purpose program verification tools, because they appear in the program code as untyped strings. Errors in the dsls can, and are, caught by special-purpose approaches that target the specific dsls. However, such schemes are inherently difficult to scale: they must incorporate special-purpose checking code for every dsl in existence. Furthermore, the special-purpose verification code may well end-up duplicating functionality available in the actual dsl implementation, such as a parser. For these reasons we believe that api verifiers should not be implemented as part of verification tools, but should be incorporated into the api libraries.
Section snippets
Research context
Our approach to program verification falls within the domain of static program analysis. This involves the analysis of program code for certain properties without executing it; usually, it is performed at compile time. Errors discovered late cost much more than errors discovered early in the development process (Fagan, 1976). Static analysis aims at lowering development costs by eliminating problem spots as early as possible.
Before we examine static program analysis methods, let us note that a
Adjunct verification code
Having established in Section 1.1 that programs increasingly utilize complicated apis from a large and diverse set of third-party libraries, it is easy to see that api-specific verification code should be tied to the library providing the actual implementation. We therefore propose that every api implementation should carry with it functionality for the static verification of api calls at compile time. Verification tools can then tap into this functionality to extend their reach into the—now
Implementation examples
To validate the feasibility of our approach, we designed the application of our verification framework on Java methods, we added api verification functionality to the FindBugs tool (Hovemeyer and Pugh, 2004), and wrote verifiers for a small number of Java classes. Note that none of the above steps characterizes our approach. Our framework can be applied to different languages, can be integrated with other tools, and, of course, it can support a large number of api verification functions.
Empirical evaluation
We ran our api verifier on the compiled code (application-specific and accompanying libraries) of the eight packages listed in Table 3. The Java archives we checked comprised 353mb (about 13 mloc), and FindBugs invoking only our api verifier took 68 minutes on dual-cpu 2.2 GHz amd Opteron computer running the Java HotSpot 64-bit server 1.5.0 virtual machine. Thus, the bytecode throughput of the api verification was about 86 kb/s—comparable to that of a Java compiler running on the same hardware
Discussion
The implementation of our api verification framework, and its application on real-world code, taught us a number of valuable lessons. Some apply to our framework in general, while others are associated with FindBugs, which we chose as our implementation platform.
The imperative code we used for expressing the api verification functionality proved to be efficient in terms of code size and performance, reliable, and easy to apply. In Section 4.3 we wrote that implementing a verification method
Conclusions and further work
Our api verification framework is clearly complementary to other existing code verification approaches, such as runtime checks. Our approach can be integrated in the compilation cycle to catch some bugs early on. Runtime checks can potentially catch a wider range of errors, but they can be performed slightly later in the development cycle: at the earliest during unit testing. Because our approach does not depend on a specific tool, and it allows verification code to be embedded within a
Acknowledgements
We would like to thank the authors of FindBugs for the work they put into the platform, and in particular Dave Brosius for his help in integrating our type-inference patches that made it possible to implement our tool. We also thank the paper’s anonymous referees for many detailed and perceptive comments.
References (53)
- et al.
Combining test case generation and runtime verification
Theor. Comput. Sci.
(2005) - et al.
Monitoring interfaces for faults
Electron. Notes Theor. Comput. Sci.
(2006) Notable design patterns for domain specific languages
J. Syst. Software
(2001)- et al.
The Java Programming Language
(2005) - Ball, T., Rajamani, S.K., 2002. The SLAM project: debugging system software via static analysis. In: POPL’02:...
- Ball, T., Cook, B., Levin, V., Rajamani, S.K., 2004. SLAM and static driver verifier: technology transfer of formal...
- et al.
A technique for finding storage allocation errors in C-language programs
SIGPLAN Notices
(1982) - Barringer, H., Finkbeiner, B., Gurevich, Y., Sipma, H.B. (Eds.), 2006. Proceedings of the Fifth Workshop on Runtime...
- et al.
Test infected: programmers love writing tests
Java Report
(1998) - Blanchet, B., Cousot, P., Cousot, R., Feret, J., Mauborgne, L., Miné, A., Monniaux, D., Rival, X., 2003. A static...
An overview of JML tools and applications
Int. J. Software Tools Technol. Transfer
A static analyzer for finding dynamic programming errors
Software—Pract. Exp.
ESC/Java2: Uniting ESC/Java and JML—progress and issues in building and using ESC/Java2
PMD Applied
Byte code engineering
Improving security using extensible lightweight static analysis
IEEE Software
Design and code inspections to reduce errors in program development
IBM Syst. J.
Cited by (11)
Tools and Techniques for Analyzing Product and Process Data
2015, The Art and Science of Analyzing Software DataVerifying Cross-Layer Interactions through Formal Model-Based Assertion Generation
2020, IEEE Embedded Systems LettersEffective and efficient API misuse detection via exception propagation and search-based testing
2019, ISSTA 2019 - Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and AnalysisInstitutional effects on API development and integration in developing countries: Evidence from Ghana
2018, Americas Conference on Information Systems 2018: Digital Disruption, AMCIS 2018Detecting latent cross-platform API violations
2016, 2015 IEEE 26th International Symposium on Software Reliability Engineering, ISSRE 2015Detecting Incompatibilities Concealed in Duplicated Software Libraries
2015, Proceedings - 41st Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2015