3.1. An Introduction of Existing Fuzzy Testing Sample Generation Methods
At present, the generation and variation methods of test cases are mainly described as follows:
The method based on symbolic execution [
13].
The core idea of this method is to take the test case as the symbol value and search the core constraint information on the test path during the processing. A new test case is generated by constraint solving to cover different program execution paths. This method is suitable for testing programs with simple structure and less execution paths. However, the complexity of the program increases with the diversification of functions, resulting in the explosion of the number of paths. It is difficult for symbolic execution to be applied to constructing complex program test cases because of complex constraint solving problems.
The method based on taint analysis [
10].
The core idea of this method is to mark the pollution source of the input data by using the dynamic taint analysis technology, focus on the spread process of the taint, extract the key taint information from it and use the taint information to guide the generation of seed variation and related test samples. It is an effective method to construct test samples for some key execution paths in programs and has good code coverage, such as Angora [
10]. However, with the application of genetic algorithm and neural network in fuzzy testing, the disadvantage of low efficiency of taint analysis technology is gradually emerging.
The method based on evolutionary algorithm [
35].
The evolutionary algorithm uses some core rules of biological evolution to guide the generation of fuzzy testing samples. At present, genetic algorithm is the most widely used evolutionary algorithm with the best performance. Its core idea is to carry out multiple rounds of iterative mutation on test cases, eliminate the test cases that do not meet the requirements according to some rules or select the samples with the best performance from them as the seeds of the next round of mutation. Genetic algorithm can be used not only to generate new test cases but also to simplify the sample set, so as to further improve the efficiency of fuzzy testing.
The method based on neural network [
14].
As mentioned above, neural network has a very significant performance advantage in solving some nonlinear problems. The bi-LSTM neural network is used to mutate the seeds on a certain execution path to obtain a new test example. In the experiment, we prove that the bi-LSTM neural network has stronger path depth detection ability in specific key execution paths than that of the taint analysis. Moreover, Learn & Fuzz proposed by Patrice [
17] et al. can improve the code coverage of fuzzy testing. Therefore, it can be predicted that the neural network will play a greater role in the future development of fuzzy testing.
3.2. Formal Definition
In order to facilitate the subsequent description of the algorithm, we give some related concepts and formal definitions of the evaluation index.
We define the program under test as PUT. For CVDF DYNAMIC, PUT is the corresponding binary executable program, and the corresponding test cases are mentioned in
Section 4.1.
A large number of facts show that there is an exponential proportional relationship between the growth number of execution paths of PUT and the growth number of its branch conditions, so the test cases cannot completely cover all execution paths. Therefore, in fuzzy testing, the problem of sample set coverage is transformed into the problem of minimum set coverage [
36]. The minimum set covering problem is an NP hard problem [
37]. The simplest algorithm idea is to use greedy algorithm to find the approximate optimal solution. The following formal definition is used to describe SCP problem:
For , it is a 0–1 matrix of m-row n-columns, where is an n-dimensional column vector. Let and be the row and column vectors of matrix . Furthermore, let represent the cost of a column. Without losing generality, we assume that . It is specified here that if = 1, it means that column at least covers one row . Therefore, the essence of the SCP problem is to find a minimum cost subset . So, for every row , it is covered by at least one column . A natural mathematical model of SCP can be described as , and it obeys . If , then .
In fuzzy testing, there are many program-execution paths that may have vulnerabilities in PUT, so the generation of fuzzy testing samples should cover as many as possible for these program execution paths that may have vulnerabilities. For a program execution path, the number of detected vulnerabilities may be more than one, and different program execution paths can detect different numbers of vulnerabilities. We define the total number of vulnerabilities detected by the fuzzy testing sample under the current path as
, the total number of vulnerabilities contained in the current path as
and the weight of the total number of vulnerabilities contained in the current path as
.
is a weighted result, and its operation method is shown in Equation (1):
Among them, increases with the number of vulnerabilities in the current path. This is because the number of vulnerabilities in different paths is different. For the variation method of the same fuzzy testing sample seed, if more vulnerabilities are contained in a path, the smaller the value of is. If the weight is a constant, the value will decrease, and the path depth detection ability of a test case generation method cannot be objectively measured.
Suppose that a program under test has execution paths, we define the average path detection ability as : It can measure the ability of a fuzzy testing tool to detect the overall path depth
3.3. CVDF DYNAMIC Fuzzy Testing Sample Generation
The complete process of fuzzy testing sample generation of CVDF DYNAMIC is shown in
Figure 1.
In the fuzzy testing part, we learn from the ensemble learning method in artificial intelligence. The seeds are mutated by genetic algorithm to generate a set of test cases, and then the seeds are mutated by the bi-LSTM neural network to generate another set of test cases. Finally, the two sets of test cases are integrated to obtain the final set of test cases.
Considering that the size of the sample set obtained by the integration of the two methods is too large, which reduces the efficiency of fuzzy testing, we use heuristic genetic algorithm to simplify the sample set. Finally, the reduced sample set is used for fuzzy testing, and the parameters in the bi-LSTM neural network are optimized according to the result feedback.
3.3.1. Theoretical Model and Training Process of BI-LSTM Neural Network
The BI-LSTM neural network training process of CVDF DYNAMIC is shown in the
Figure 2.
- (a)
Preprocessing and Vectorization
We preprocess the training dataset, including unifying the input format of the test cases and changing the format of some binary executable programs, so that they can adapt to the input of the neural network without changing the logic function of the original program.
Then, we use the PTFuzz tool, which is a tool to obtain the program execution path by using the Intel Processor Tracing module (IntelPT). PTFuzz makes a further improvement on the basis of AFL, which removes the dependence on the program instrument but uses PT to collect package information and filter package information, and finally obtains the execution path of the current seed according to the package information. In order to achieve this goal, our hardware environment should be based on Intel CPU platform and run under the appropriate version of Linux system. Since the PTFuzz tool stores the program execution path information in data packets in order to obtain the program execution path information that can be trained for neural networks, we need to decode the data packets in the corresponding memory and recover the complete program execution path according to the entry, exit and other relevant information of each data packet. The pseudocode of the Algorithm 1 Extracting program execution path is as follows:
Algorithm 1. Extracting program execution path |
Start Func Func ExtractPath(binary-source-code) 1: Start = LoadBinaryProgram(binary-source-code) 2: ProgStaddr = GetProgramEntry(Start) 3: ExecutionPath = [] 4: while True: 5: PackagePath = LoadCurrentPackage(ProgStaddr) 6: ExecutionPath +|= PackagePath 7: If ProgStaddr == JumpNextInstrument() 8: ProgStaddr = GetNextInstruAddr() 9: If ProgStaddr == EndOfMemSpace() 10: break 11: Return ExecutionPath End Func |
In the pseudocode, JumpNextInstrument() and EndOfMemspace() are two judgment functions, which are used to judge whether to jump to the next instruction address and whether the end of the memory address of PTFuzz package has been reached, respectively. The ExecutionPath variable forms a complete program execution path by continuously connecting the PackagePath variable after decodeding. +|= is a concatenate operation.
After extracting the program execution path, we need to convert the program execution path containing instruction bytecodes into vector form and save the original semantic information of the original program execution path as much as possible.
We use the tool word2vec and regard a complete program execution path as a statement and an instruction as a word. Specifically, we regard the hexadecimal code of an instruction as a token, and then we use word2vec to train the corresponding bytecode sequence. In order to preserve as much context information as possible in the program execution path, we choose the Skip-Gram model in word2vec because it often has better performance in large corpus. The Skip-Gram model structure is shown in the
Figure 3.
Finally, we need to transform the output of word2vec into an equal length coding input, which can be used as the input vector of the neural network. Let us set a maximum length, which is MaxLen. When the output length of word2vec is less than MaxLen, we use 0 to fill in the back end to make it MaxLen. When the output length of word2vec is larger than MaxLen, we truncate it from the front end and control the length to MaxLen.
- (b)
BI-LSTM neural network structure and parameter optimization
The neural network structure we choose is bi-LSTM.
Bi-LSTM has excellent performance in dealing with long-term dependency problems, such as statement prediction and named entity recognition [
38]. The statements associated with vulnerability characteristics may be far away in the whole program execution path, so we need the bi-LSTM neural network structure for the long-term memory of the information related to the vulnerability characteristics. In order to make the bi-LSTM neural network suitable for fuzzy testing, we modify the corresponding rules of the input gate, output gate and forgetting gate of the bi-LSTM. The specific structure of the single LSTM neuron and the specific rules of the input gate, output gate and forgetting gate are shown in
Figure 4.
The number of hidden layers in the bi-LSTM neural network, epochs, batch size and other parameters will affect the final performance of the neural network. According to the experimental part in
Section 4.2, we set the number of hidden layers to 5, the batch size to 64 and the drop rate to 0.4 and use a BPTT back-propagation algorithm to adjust the network weight, using random gradient descent (SGD) method to prevent the model from falling into the local optimal solution. For the hyper parameters in the bi-LSTM neural network, we choose to use dichotomy to accelerate the selection of corresponding values.
Figure 5 shows the complete structure of the bi-LSTM neural network.
From
Figure 5, we make the coding input with length MaxLen pass through several bi-LSTM hidden layers to extract clearer context dependencies. We let the output of the last bi-LSTM hidden layer pass through a feed forward neural network layer and sigmoid activation function. The sigmoid activation function also normalizes the final output vector, which is the vector form of the fuzzy testing sample generated by the bi-LSTM neural network.
3.3.2. Genetic Algorithm for Constructing Test Cases
The core of the genetic algorithm used to construct samples can be divided into several parts, including population initialization, tracking and executing the tested program, fitness calculation and individual selection, crossover and mutation. The overall structure is shown in
Figure 6.
- (a)
Population initialization
In a genetic algorithm, the population is composed of several individuals. We abstract an individual as a chromosome. Let us set the length of the chromosome as , which means the number of bytes of test data. Then, the individual in the population can be expressed as . Population initialization is performed to assign a value to each gene in . When there are initial test data, each byte of the initial test data is used to assign a value of . Otherwise, the whole population can be initialized by randomized assignment.
- (b)
Tracking and executing the program under test
Tracking is divided into two aspects:
Because each program can be divided into many basic blocks during execution, the essence of the program execution is the process of execution and jump between basic blocks.
Each basic block has only one entry and exit. So, in a basic block, the program enters from the entry and exits from the exit. Therefore, we can use the entry address Inaddr of the basic block to represent each basic block. Then, the program execution process can be expressed as a sequence of basic blocks: We define the jump of a basic block as , where .
Obviously, if every basic block is regarded as a point in a graph, then E is an edge in the graph. Since a basic block may be executed multiple times in the execution sequence, the graph is directed. In this case, the execution path of the program can be expressed as a sequence of edges .
Because some basic blocks may be repeated many times during program execution, some edges may appear many times. We combine the same edges to obtain a set of edges with the information of times of occurrence and analyze the frequency statistics of this set and further divide it into many groups according to the different times of occurrence 1, 2–3, 4–7, 8–15, 16–31, 32–63, 64–127 and 128.
It is easy to see that the significance of this classification is that it can use different bits of a byte to represent the times information, so it can improve the processing speed of the program. Finally, we will obtain a new set of occurrence information .
We use the above processing method for each basic block to get the final program execution path information.
- (c)
Fitness calculation
By tracking the program under test, we can see that an execution path information can be expressed as a sequence of edges. Therefore, in order to find a new execution path and improve the path coverage of CVDF DYNAMIC, we need to calculate the fitness. We define the sequence set of edges as
, where each
is equivalent to
. For any edge in
, let us assume that the final test data are
. We can obtain a binary set of edge information related to the test data, as shown in Equation (2):
It is not difficult to find that its essence is a weighted digraph, and the weight is the test data. We define that the fitness (adaptation) f of an individual consists of two functions, as shown in Equations (3) and (4).
Finding the number of new edges
and the number of edges
associated with them in
:
Firstly, the fitness of each individual is calculated, and then the fitness of each individual is calculated after updating the set. The two sets used to calculate the fitness are updated after each round of testing. When comparing two individuals, first is compared; if cannot be distinguished, then compare .
- (d)
Individual selection, crossover and variation
Our individual selection method uses elite selection to produce new individuals. It is a strategy of generating new individuals in genetic algorithm, which makes the individuals with high fitness enter the next generation. The method of crossover is 2-opt transformation. A number of random numbers are generated as the intersection points, and then the fragments of the intersection points in the chromosome are exchanged. Rather than using the random mutating method, this paper proposes a control mutation method to improve the effect of mutation. A motivating example of the Algorithm 2 Control Mutation is as follows:
Algorithm 2. Control Mutation |
Start Func Func ControlPROC(X,Y) 1: A = 1, B = 1 2: IF Y >= B THEN 3: FORK1: A = A × X, B = B + 1 4: ELSE: 5: IF X >= A THEN 6: FORK2: A = A + X, B = B − 1 7: ELSE: 8: FORK3: A = A − X, B = B/2 9: RETURN A End Func |
The input data format of the program is assuming that the template data are , and the variation factor is the operation of replacing 0. Therefore, two test data can be generated by mutation and , which can cover FORK1 and FORK2. This form of testing could not achieve 100% branch coverage due to the failure to cover FORK3. For control variation, when the test data generated by the variation make the program enter the new branch FORK2, the variation field of this time will be marked as an immutable field, and the variation will be carried out on the basis of the test data. In this example, the control variation marks as an immutable field and mutates the remaining fields, the value, to , resulting in test data that can be overridden by FORK3.
The control mutation strategy consists of the test data and control information that make the program enter the new branch. The control mutation process is as follows: Firstly, the control mutation strategy is taken out from the policy database, and the test data entering the new branch are taken as the mutation template. Secondly, check the stored control information and each byte in the template to confirm whether it is marked as control information; if so, check the next byte, if not, modify the byte in combination with random mutation strategy, generate test data and execute fuzzy testing, then continue to check the next byte. Finally, after all bytes are checked, we complete one time of mutation, and the above process is repeated.
After completing the above operations, we have completed a round of iteration of the genetic algorithm taking the newly generated chromosome data as the test data of the next round of mutation, that is, continuous iterative mutation.
3.3.3. Integrating New Test Data with Integration Idea
Firstly, through the above genetic algorithm, test cases with high path coverage are constructed from the original test case seeds. Then, for the test cases located on different execution paths, the bi-LSTM neural network is used to construct test cases with stronger path depth detection ability. Finally, we integrate the test case set constructed by the two methods to obtain the final test case set. Considering that the test case set generated by the above two methods may be too large and the efficiency of the fuzzy testing is reduced, this paper uses heuristic genetic algorithm to simplify the integrated test case set to ensure that the efficiency of fuzzy testing can be improved without losing the test performance.
3.3.4. Using Heuristic Genetic Algorithm to Reduce Sample Set
In order to reduce the sample set without losing the performance of fuzzy testing as much as possible, the screening principle of heuristic genetic algorithm in this paper is to give priority to the samples with stronger code coverage and Path Depth Detection Ability. Then, select the remaining test samples in the order of decreasing test performance, until the performance index basically covers the original fuzzy testing sample set (see the experiment in
Section 4.4 for specific results). Here, our heuristic algorithm is a selection mutation algorithm for chromosomes.
- (a)
Using a compression matrix to represent chromosomes
At present, the common chromosome representation method is to use a 0–1 matrix [
39]. The element of each row vector of the 0–1 matrix is 0 or 1. As mentioned earlier, we treat the basic block address as a collection of elements. Each basic block is equivalent to the gene in the genetic algorithm. Therefore, 1 in the 0–1 matrix indicates that a basic block exists in the sample, while 0 indicates that it does not exist. In this way, the sample set formed by all samples constitutes a 0–1 matrix, and the set of genes in each column is equivalent to a chromosome. Considering the complexity of the program execution path, the 0–1 matrix is a sparse matrix. If it is stored directly in the way of 0–1, the space efficiency will be significantly reduced. Therefore, this paper compresses the 0–1 matrix. Our storage method is a triple sequence
, where
is the element with the storage value of 1, and
and
are its X and Y coordinates in the original matrix, respectively. Since the value of
is 1 by default, the value of this item can be omitted in the actual operation.
- (b)
Using heuristic genetic algorithm to improve chromosome
Each chromosome has its own independent gene sequence, but there will also be a large number of repeated and overlapping genes. Therefore, as mentioned above, we should solve the SCP when carrying out set coverage and reduce set redundancy as much as possible. Therefore, the heuristic function of the heuristic genetic algorithm is mainly reflected in eliminating the redundancy caused by gene duplication and screening better chromosomes through genetic iteration.
The specific algorithm is described as follows:
We deduce the chromosome from the position information in the compression matrix. For genes in the same column, if they contain more “1” values, it indicates that the performance priority of this column is relatively high, so we give priority to selection, mark the selected column and so on. Subsequently, we perform gene exchange on chromosomes. We assume that there are two different chromosomes and in the parent generation. After chromosome exchange, we can obtain the child’s chromosomes and . It is assumed that and can cover set . We use sets and to store the line numbers not covered in the genes and use sets and to store the genes contained in and . First, we calculate the performance priority of each gene in the parents and , that is, count the number of “1” values in each column for screening. Then, we screen out the chromosomes with the highest performance priority in and , copy them to , count the genes contained in and delete the genes contained in from . Then, we calculate the value of , which is the difference set, and store its line number in set . Next, we continue to arrange the remaining genes of and using the same performance priority selection method, and then put them into again. The remaining genes will be put into .
In the process of gene selection and gene exchange, there are some special cases with the same gene performance. At this time, we need to further screen them to obtain the optimal gene. Suppose that there are two genes, and , with the same performance priority in , and there is one gene in set . At this time, we need to compare the results of and to screen out the larger results. Considering that there will be a corresponding mutation process in the genetic algorithm, the above calculation should be carried out before and after mutation to ensure that the optimal result is always selected.
From the above description, the heuristic genetic algorithm proposed in this paper uses the compression matrix on the basis of the original population and selects the optimal chromosome according to the way of gene selection and gene exchange. Therefore, this heuristic genetic algorithm essentially does not change the workflow of ordinary genetic algorithm, but through the optimization of search conditions, it simplifies the sample set and further improves the efficiency of fuzzy testing.
The specific process of the ordinary genetic algorithm has been described above. The heuristic genetic algorithm is different from ordinary genetic algorithm in the following aspects:
There are three common methods of paternal selection: random selection, tournament selection and roulette bet. Here, we use roulette method, the specific operation is as follows:
Step 1: The fitness of each individual in the population is calculated fi (i = 1,2,3,…n), where n is the population size.
Step 2: Calculate the probability of each individual being inherited into the next generation population.
Step 3: Calculate the probability distribution of each individual:
Step 4: A pseudo-random number (rand) with uniform distribution is generated in the interval .
Step 5: When , is chosen; otherwise, if , individual K is chosen.
Step 6: Repeat step 4 and step 5 several times, and the number of repetitions depends on the size of the population.
- (d)
Cross rate selection
Crossover is the main way to produce new individuals. The crossover rate is the number of chromosomes in the crossover pool. A reasonable crossover rate can ensure that new individuals will be produced continuously in the crossover pool, but it will not produce too many new individuals, so as to prevent the genetic order from being destroyed. This paper adopts the most popular method of the adaptive crossover rate.
- (e)
Variation rate selection
The mutation rate is the proportion of the number of genes in a population based on the number of all genes. Because mutation is a way to produce new individuals, we can control the mutation by setting the number of genes or the rate of random mutation. Too low a mutation rate will lead to too few chromosomes involved in the mutation, which leads to the problem that the chromosome containing unique genes cannot be entered into the set. The high mutation rate will cause too many chromosomes involved in the mutation, which will generate some illegal data and increase the time cost. After the experiment and model tuning, the final mutation rate is 0.5.
- (f)
Elite ratio
The elite ratio means that the individuals with the highest fitness in the current population do not participate in crossover and mutation operations but replace the individuals with the lowest fitness in the current population after crossover and mutation operations.
After the experiment and model optimization, the final elite ratio is 0.06.
- (g)
Stopping Criteria
The genetic algorithm has to go through several rounds of iterative evolution until it reaches the ideal result or reaches the threshold of the number of iterations. For the heuristic genetic algorithm, the threshold of iterations is 25.