To achieve slice extraction and convert the processing into a format that the model can receive, source code parsing is required.
4.2.3. Context Slicing
In other research on source code vulnerability detection, the slicing process mostly uses the slicing method proposed by SySeVr [
26], which, for clarity, we refer to as BFslicing (backward and forward slicing).
The selection of slice criterion is the starting point for slicing operations. In other works, including SySeVr, four types of slice criterion are typically chosen: sensitive API calls, pointer usage, array access, and arithmetic expressions. Vulnerabilities often occur when these operations are mishandled, such as using dangerous library functions or incorrect array access leading to out-of-bounds errors. Additionally, CMFVD introduces a new type of cut point called ‘extended sensitive functions’. Similar to sensitive API calls, many programmers implement their own memory manipulation functions with custom names that do not easily match the criteria for sensitive API calls. However, there are numerous such functions that may be missed by the model if it cannot fully learn the patterns of vulnerability features. CMFVD defines five types of slice criterion, summarized in
Table 1.
BFslicing starts from the slicing point and recursively adds new nodes to the final slicing result using both forward and backward slicing until no new nodes can be added, at which point the slicing process stops.
By observing a large number of slicing instances, we noticed a phenomenon: nodes added at deeper levels often reflect that they are influenced by or are influencing existing nodes based on data dependency relationships. These nodes primarily reflect the semantic flow of data from input to usage. For both vulnerable and non-vulnerable code, the characteristics reflecting the input data flow are often similar. Regardless of the presence of a vulnerability, the received input always arrives at the slicing point along the same data flow path. In other words, the slices obtained for code with and without vulnerabilities contain a lot of redundant content. This low-discriminative information stems from the related input data dependency chain. Such redundant slices are filled with many useless noises, and their low discriminative power not only increases the computational burden during model training but also affects prediction performance.
To address this issue, we believe that the characteristic patterns of vulnerabilities are closely distributed in the vicinity of vulnerable lines, i.e., they exist in the context around vulnerable lines. To overcome the aforementioned problem, we propose an approach called context slicing.
After obtaining the corresponding program dependence graph (PDG), we construct the corresponding edge tree data structure based on the method described in the previous section. Utilizing this edge tree data structure, we perform context slicing. Initially, in the PDG, we choose any edge connected to the slicing point (representing a line of code in the graph) and designate its corresponding node in the edge tree as the starting point. Subsequently, we recursively add nodes directly or indirectly adjacent to the current node to the context slicing result until no new nodes can be added. Afterward, the nodes obtained from the edge tree are mapped back to the corresponding edges in the PDG. We then select the PDG nodes connected to these edges and obtain the final context slice after removing duplicates. Algorithm 1 outlines the process of context slicing.
Algorithm 1: Context Slicing |
Input: Edge Tree ET = {e1,e2,…,en}. Choose any edge ek in the edge tree that starts from the slicing point. output: context-SDG
result = {ek} queue = {ek} while queue ! = ∅ do for ei in queue do for ej in get_direct_or_indirect_neighbors(ek) do if ej not in result do result.add(ej) end if end for end for end for context_SDG = {} for ei in result do for node in get_ successors_and_ predecessors(ei) do if node not in context_SDG do context_SDG.add(node) end if end for end for return context_SDG
|
Next, we will illustrate the specific processes of BFslicing and context slicing using a practical example, as well as highlight the differences between these two methods.
Figure 4a shows the sample code along with its corresponding program dependency graph (PDG), and we choose the ninth line of code as the slicing point.
In BFslicing, we start from the slicing point 9 and recursively call forward and backward slicing to add new nodes. At node 9, forward slicing cannot add any new nodes, while backward slicing appends nodes 5 and 6. Subsequently, we apply backward slicing again at nodes 5 and 6. According to the backward slicing rules, node 6 cannot append any new nodes. However, through backward slicing, we can append node 4 after node 5. Then, when using backward slicing again at node 4, no new nodes can be appended, concluding the entire slicing process. The final results of forward and backward slicing are [
4,
5,
6,
9].
Figure 4b illustrates this process.
For context slicing, we first construct the corresponding edge tree based on the program dependence graph (PDG). We select the edge
directly connected to the slicing point as the root of the edge tree. Based on the definition of the
described earlier, we find that both
and
satisfy the relation with
. In the edge tree, the nodes representing these two edges become children of
. Similarly,
become children of
, and
becomes a child of
. Next, we perform context slicing on the edge tree. Following the algorithm for context slicing, we choose nodes in the edge tree that have a direct or indirect relation with
, resulting in
. Finally, we need to map the edge tree slicing results back to the PDG. The specific method involves selecting the start and end nodes of the edges chosen during the edge tree slicing process and adding them to the final slicing results. For example, for edge
, we add the nodes [
6,
9] to the final slicing results. For edge
, we should add the nodes [
5,
9], but since 9 already exists, we do not duplicate it. Summing up and removing duplicates yields the slicing result [
4,
5,
6,
9,
10,
15,
17].
Figure 4c illustrates the process of context slicing on the example code.
Through the above example, it is evident that, compared to BFslicing, context slicing selects more nodes [
10,
15,
17]. The lines of code corresponding to these three nodes involve write operations on the variable “data”. For a pointer pointing to an array, understanding write operations on the corresponding memory space is crucial, as memory errors often result from illegal write operations, leading to vulnerabilities. However, BFslicing does not include any information related to write operations on the data pointer.
The reason BFslicing misses these lines of code is clear: nodes newly added through forward slicing will only go through forward slicing, and the same applies to nodes added through backward slicing. These lines of code are located after the ninth line of code (line 9), but PDG node 9 cannot obtain any new nodes through backward slicing. As a result, crucial information is lost. This precisely reveals the shortcomings of forward and backward slicing methods: if a node cannot add new nodes through forward or backward slicing in the slicing process, it will cause the slicing chain to break. In such cases, even if certain nodes are significant, they may be lost because their predecessor nodes or successor nodes were not appended to the previous round of slicing results.
In contrast, context slicing shifts the primary focus of slicing from nodes to edges. It completes the slicing process within the edge tree and then maps the sliced results of edges to nodes in PDG to obtain the final slicing results, overcoming this limitation. This approach addresses the problems faced by traditional methods and enhances accuracy and reliability. Moreover, it is evident that BFslicing provides a “long and slim” slicing result, as it traces the initial input and final usage of variables through data dependency relationships. Regardless of whether the related variables lead to vulnerabilities due to erroneous operations, their input data dependency chains always remain similar. Context slicing provides a “short and wide” slicing result, concentrating on including lines of code closely related to vulnerability patterns as vulnerability context information.
Moreover, through the example code, we can intuitively appreciate that, in cases where the software is not particularly deep or complex, context slicing often complements forward and backward slicing. In the slicing results demonstrated in this example, the lines of code obtained through BFslicing are all present in the results of context slicing.
After obtaining the context slices, we will save the corresponding lines of code in the source file based on the selected nodes for later use in graph node embedding and sequence network input. These lines of code are stored in text form, similar to natural language processing tasks, and may require preprocessing when dealing with these strings.
First, it is necessary to remove the comments from the code lines since comments are hints for programmers to understand the logic and are considered human-readable natural language. Next, the code strings undergo tokenization. Specifically, tokenization transforms the code string into a sequence of tokens composed of the smallest syntactic units, including variable names, operators, constants, and keywords. For example, “char dataBuffer = (char)ALLOCA(100 * sizeof(char))” would be converted to the sequence “[char, dataBuffer, =, (, char, ALLOCA, (, *, sizeof, (, char,),)]”.
Following that, user-defined variable names and function names in the source code are mapped to a unified and standardized style. Source code contains a multitude of elements composed of user-defined function names, variable names, and other identifiers. Different coding styles and personalized variable naming practices among programmers result in logically and functionally similar but significantly different code at the textual feature level. The presence of different coding styles and personalized variable naming introduces noise. This makes it challenging for neural networks to establish semantic correlations between logically similar codes, thereby affecting vulnerability feature model learning and ultimately reducing the detection model’s performance. For each token sequence, user-defined variable names and function names are replaced with VAR0, VAR1, …, VARn and FUN0, FUN1, …, FUNm, respectively, based on their appearance order. Additionally, string constants appearing in the source code are uniformly replaced with an empty string. Through anonymization, source code with similar semantic and structural characteristics will have a unified naming style, ensuring that subsequent models better understand code feature patterns and greatly reducing the size of the vocabulary.
Figure 5 illustrates the anonymization process described above.