Abstract

This paper proposes a decision tree model for specifying the importance of 21 factors causing the landslides in a wide area of Penang Island, Malaysia. These factors are vegetation cover, distance from the fault line, slope angle, cross curvature, slope aspect, distance from road, geology, diagonal length, longitude curvature, rugosity, plan curvature, elevation, rain perception, soil texture, surface area, distance from drainage, roughness, land cover, general curvature, tangent curvature, and profile curvature. Decision tree models are used for prediction, classification, and factors importance and are usually represented by an easy to interpret tree like structure. Four models were created using Chi-square Automatic Interaction Detector (CHAID), Exhaustive CHAID, Classification and Regression Tree (CRT), and Quick-Unbiased-Efficient Statistical Tree (QUEST). Twenty-one factors were extracted using digital elevation models (DEMs) and then used as input variables for the models. A data set of 137570 samples was selected for each variable in the analysis, where 68786 samples represent landslides and 68786 samples represent no landslides. 10-fold cross-validation was employed for testing the models. The highest accuracy was achieved using Exhaustive CHAID (82.0%) compared to CHAID (81.9%), CRT (75.6%), and QUEST (74.0%) model. Across the four models, five factors were identified as most important factors which are slope angle, distance from drainage, surface area, slope aspect, and cross curvature.

1. Introduction

Landslide is one of the most aggressive natural disasters that causes loss of lives and billions of dollars damages annually worldwide. They pose a threat to the safety of humankind lives and, the environment, resources, and property [1]. Landslide susceptibility is defined as the propensity of an area to generate landslides [2]. Assuming that landslides will occur in the future because of the same conditions that produced them in the past, susceptibility assessments can be used to predict the geographical location of future landslides [35]. With the characteristics of high incidence and extensive occurrence range, landslide research has aroused the attention of many scientists, some of whom have focused on landslide susceptibility mapping [6, 7]. Through scientific analysis of landslide susceptibility mapping, we can assess and locate risky landslide susceptible areas. Furthermore, it allows one to take the proper precautions to reduce the negative impacts of landslides [8].

Many studies have been conducted to detect landslides and to analyze the landslide hazard using the Geographic Information Systems (GIS) and remote sensing [913]. Recently, with the development of GIS data processing techniques, quantitative studies have been applied to landslide susceptibility analysis using various techniques. Such studies can be identified on the basis of the techniques used, such as probabilistic methods [1418], logistic regression [1921], and artificial neural network [2225]. Most of these studies were aimed at increasing the accuracy of landslide prediction by finding suitable techniques for the respective study area. The objective of this study was to propose the best decision tree model to determine the most important factors which lead landslide susceptibility to occur. Decision tree is a popular classification technique and represents a good compromise between comprehensibility, accuracy, and efficiency [26].

Statistical decision tree models have been successfully used to classify and to estimate land use, land cover, and other geographical attributes from remote sensing data [27, 28]. Decision tree, having its origin in machine learning theory, is an efficient tool for classification and estimation. Unlike other statistical methods, decision tree makes no statistical assumptions, can handle data that are represented on different measurement scales, and is computationally fast [22]. Decision tree also has advantages that the estimation processes and order of important explanatory variables are explicitly represented by tree structures [29]. In addition, recent developments of computer technologies, algorithms of pattern recognition, and automatic methods of decision-tree design have enabled the use of decision tree models.

Pal and Mather [30] demonstrated the advantages of the decision tree for land cover classification in comparison with other classifiers, such as the maximum likelihood method and artificial neural networks. Saito et al. [2] used decision tree models to analyze a distribution of landslides that were almost suspended or dormant. They also indicated that decision tree models are useful for estimating landslide distributions. Bui et al. [31] compared the decision tree for landslide prediction in Vietnam. Decision tree showed a decent performance compared with Support Vector Machines (SVM) and Naive Bayes Models. Meanwhile decision tree showed a good ability in determinations of the important factors causing the landslide compared with other used models. Pang et al. [32] produced the landslide hazard mapping of Penang Island using decision tree Quinlan’s algorithm C4.5. Twelve landslide causative factors were used in his study. Pradhan [33] used three models: decision tree, SVM, and adaptive neurofuzzy inference system (ANFIS) for producing the landslide hazard map for Penang Hill area. The decision tree showed a better performance compared with SVM and ANFIS classifier.

In this paper, four decision tree methods were used to build the optimum decision models including Chi-square Automatic Interaction Detector (CHAID), Exhaustive CHAID, Classification and Regression Tree (CRT) and Quick-Unbiased-Efficient Statistical Tree (QUEST). Twenty one factors were selected as the input variables of the decision trees. A data set of 137570 samples from Penang Island in Malaysia was used as examples for building the decision trees. The experiment contained ten rounds according to different partitions of training sets and test sets.

2. Decision Trees

A decision tree is a technique for finding and describing structural patterns in data as tree structures; a decision tree does not require the relationship between all the input variables and an objective variable in advance. This technique helps to explain data and to make predictions using the data [34]. A decision tree can also handle data measured on different scales, without any assumptions concerning the frequency distributions of the data, based on its nonlinear relationship [35]. Therefore, all variables were put into the decision tree model.

The main purpose of using the decision tree is to achieve a more concise and perspicuous representation of the relationship between an objective variable and explanatory variables. Namely, the decision tree can be visualized more easily; unlike neural networks, it is not a “black box.”

The decision tree is based on a multistage or hierarchical decision scheme (tree structure). The tree is composed of a root node, a set of internal nodes, and a set of terminal nodes (leaves). Each node of the decision tree structure makes a binary decision that separates either one class or some of the classes from the remaining classes. The processing is carried out by moving down the tree until the terminal node is reached. In a decision tree, features that carry maximum information are selected for classification, while remaining features are rejected, thereby increasing computational efficiency [36]. The top down induction of the decision tree indicates that variables in the higher order of the tree structure are more important.

There are three tree types of decision tree: CRT, CHAID and Exhaustive CHAID, and Quest. The algorithms of the three types follow the following steps. Start tree building by assigning the node to classes, stopping tree building. Reach the optimal tree selection and perform cross-validation [37]. CART performs tree “Pruning” before producing the optimal tree selection, while CHAID method performs statistical tests at each step of splitting.

2.1. Classification and Regression Tree (CRT)

CRT is a recursive partitioning method to be used both for regression and classification. CRT is constructed by splitting subsets of the data set using all predictor variables to create two child nodes repeatedly, beginning with the entire data set. The best predictor is chosen using a variety of impurity or diversity measures (Gini, towing, ordered towing and least-squared deviation). The goal is to produce subsets of the data which are as homogeneous as possible with respect to the target variable [38]. In this study, we used measure of Gini impurity that was used for categorical target variables.

Gini Impurity Measure. The Gini index at node , , is defined as where and are categories of the target variable. The equation for the Gini index can also be written as Thus, when the cases in a node are evenly distributed across the categories, the Gini index takes its maximum value of , where is the number of categories for the target variable. When all cases in the node belong to the same category, the Gini index equals 0.

If costs of misclassification are specified, the Gini index is computed as where is the probability of misclassifying a category case as category .

The Gini criterion function for split at node is defined as where is the proportion of cases in sent to the left child node, and is the proportion sent to the right child node. The split is chosen to maximize the value of . This value is reported as the improvement in the tree [39].

2.2. Chi-Square Automatic Interaction Detector (CHAID) and Exhaustive CHAID

CHAID method is based on the -test of association. A CHAID tree is a decision tree that is constructed by repeatedly splitting subsets of the space into two or more child nodes, beginning with the entire data set [40]. To determine the best split at any node, any allowable pair of categories of the predictor variables is merged until there is no statistically significant difference within the pair with respect to the target variable. This CHAID method naturally deals with interactions between the independent variables that are directly available from an examination of the tree. The final nodes identify subgroups defined by different sets of independent variables [41].

The CHAID algorithm only accepts nominal or ordinal categorical predictors. When predictors are continuous, they are transformed into ordinal predictors before using the following algorithm. For each predictor variable , merge nonsignificant categories. Each final category of will result in one child node if is used to split the node. The merging step also calculates the adjusted value that is to be used in the splitting step.(1)If has 1 category only, stop and set the adjusted value to be 1.(2)If has 2 categories, go to step 8.(3)Else, find the allowable pair of categories of (an allowable pair of categories for ordinal predictor is two adjacent categories, and for nominal predictor is any two categories) that is least significantly different. The most similar pair is the pair whose test statistic gives the largest value with respect to the dependent variable .(4)For the pair having the largest value, check if its value is larger than a specified alpha-level merge. If it does, this pair is merged into a single compound category. Then a new set of categories of is formed. If it does not, then go to step 7.(5)(Optional) if the newly formed compound category consists of three or more original categories, then find the best binary split within the compound category in which value is the smallest. Perform this binary split if its value is not larger than an alpha-level split-merge.(6)Go to step 2.(7)(Optional) any category having too few observations (as compared with a user-specified minimum segment size) is merged with the most similar other category as measured by the largest of the values.(8)The adjusted value is computed for the merged categories by applying Bonferroni adjustments [42, 43].

The CHAID algorithm reduces the number of predictor categories by merging categories when there is no significant difference between them with respect to the class. When no more classes can be merged the predictor can be considered as a candidate for a split at the node. The original CHAID algorithm is not guaranteed to find the best (most significant) split of all of those examined because it uses the last split tested. The Exhaustive CHAID algorithm attempts to overcome this problem by continuing to merge categories, irrespective of significance level, until only two categories remain for each predictor. It then used the split with the largest significance value rather than the last one tried. The Exhaustive CHAID requires more computer time [44]. Calculations of (unadjusted) values in the above algorithms depend on the type of dependent variable. The merging step of both CHAID and Exhaustive CHAID sometimes needs the value for a pair of categories and sometimes needs the value for all the categories of . When value for a pair of categories is needed, only part of data in the current node is relevant. Let denotes the relevant data. Suppose in there are categories of , and categories of (if is categorical). The value calculation using data in is given below. The null hypothesis of independence of and the dependent variable is tested. To do the test, a contingency (or count) table is formed using classes of as columns and categories of the predictor as rows. The expected cell frequencies under the null hypothesis are estimated. The observed cell frequencies and the expected cell frequencies are used to calculate Pearson chi-squared statistic or likelihood ratio statistic. The value is computed based on the Pearson’s chi-square statistic method. Consider where is the observed cell frequency and is the estimated expected cell frequency for cell , from independence model as follows. The corresponding value is given by for Pearson’s chi-square test where follows a chi-squared distribution with degrees of freedom , , , , .

In step 8 the adjusted -value is calculated as the value times a Bonferroni multiplier. The Bonferroni multiplier adjusts for multiple tests. Suppose that a predictor variable originally has categories, and it is reduced to categories after the merging step. The Bonferroni multiplier is the number of possible ways that categories can be merged into categories. For , . For , use the following equation:

2.3. Quick-Unbiased-Efficient Statistical Tree (QUEST)

QUEST is a binary split decision tree algorithm for classification and data mining. QUEST can be used with univariant or linear combination splits. A unique feature is that its attribute selection method has negligible bias. If all the attributes are uninformative with respect to the class attribute, then each has approximately the same change of being selected to split a node [45].

The QUEST tree growing process consists of the selection of a split predictor, selection of a split point for the selected predictor, and stopping. In this algorithm, only univariant splits are considered. For selection of split predictor, it uses the following algorithm.(1)For each continuous predictor , perform an ANOVA -test that tests if all the different classes of the dependent variable have the same mean of , and calculate the value according to the statistics. For each categorical predictor, perform Pearson’s -test of and ’s independence, and calculate the value according to the statistics.(2)Find the predictor with the smallest value and denote it .(3)If this smallest value is less than , where is a user-specified level of significance and is the total number of predictor variables, predictor is selected as the split predictor for the node. If not, go to 4.(4)For each continuous predictor , compute Levene’s statistic based on the absolute deviation of from its class mean to test if the variances of for different classes of are the same, and calculate the value for the test.(5)Find the predictor with the smallest value and denote it as .(6)If this smallest value is less than , where is the number of continuous predictors, is selected as the split predictor for the node. Otherwise, this node is not split [45].

3. Study Area

As shown in Figure 1, this study is focused on Penang Island which lies between 5°15′ to 5°30′ N latitude and 100°10′ to 100°20′ E longitude. The North Channel separates the study area from the mainland. It occupies an area of 285 km² and is one of the 13 states of Malaysia. The island is bounded to the north and east by the state of Kedah, to the south by the state of Perak, and to the west by the Strait of Malacca and Sumatra (Indonesia).

Penang Island consists of both the island of Penang and a coastal strip on the mainland known as the Province Wellesley. This paper focuses only on the island, where frequent landslides occurred and threaten lives and damage properties [46, 47]. The heavy rain plays a major role in triggering the landslides in the study area. Data from the Malaysian Meteorological Department recorded that the rainfall amount varies approximately between 2254 mm and 2903 mm annually in the study area. Penang Island has a tropical climate with high temperature of 29°C to 32°C and humidity ranges from 65% to 96%. Topographic elevations vary between 0 m and 820 m above sea level. The slope angle ranges from 0° to 87° while 43.28% of island is flat. Geological data from the Minerals and Geosciences Department, Malaysia, show that Ferringhi granite, Batu Maung granite, clay, and sand granite represent more than 72% of the study area’s geology. Vegetation cover consists mainly of forests and fruit plantations.

4. Data Collection

An effective intelligent system requires a comprehensive data set. Therefore 137570 samples of data were selected in this analysis, where 68786 samples represent landslides and 68786 samples represent no landslides. Then, Digital Elevation Map (DEM) is used to extract 21 topographic factors. The DEM with five-meter resolutions of Penang Island was obtained from the Department of Survey and Mapping, Malaysia. The extracted factors are acronyms as (vegetation cover), (distance from the fault line), (slope angle), (cross curvature), (slope aspect), (distance from road), (geology), (diagonal length), (longitude curvature), (rugosity), (plan curvature), (elevation), (rain perception), (soil texture), (surface area), (distance from drainage), (roughness), (land cover), (general curvature), (tangent curvature), and (profile curvature). In the previous studies which have been done on Penang Island, only 14 factors ( to ) were on the subject of investigation for landslide [48]. While, the factors to will be applied and investigated for the first time on the study area. Furthermore the 21 factors represent the available data of all factors that can cause the landslide in the study area. The intelligent system target (landslides history) is represented by 0 for no landslide and 1 for landslide. The data were normalized to range between 0 and 1, for each of the factors individually. A 10-fold cross-validation analysis was performed as an initial evaluation of the test error of the algorithms. Briefly, this process involves splitting up the data set into 10 random segments and using 9 of them for training and the 10th as a test set for the algorithm. Classification accuracy of each model was calculated as follows.

The accuracy of correctly classified landslide (1) is given by

The accuracy of correctly classified no landslide (0) is given by

The overall accuracy for decision tree model is given by

5. Discussion

Four tree algorithms, CHAID, Exhaustive CHAID, CRT, and QUEST, were applied to map landslide susceptibility hazard. The 4 trees construction is based on the entire sample of 137572 cases, a cross-validation with 10 folds, 0.05 adjustment of the probabilities, a minimum cases in parent node of 100, a minimum cases in child node of 50, and equal misclassification costs. The maximum number of levels is 3 for CHAID and exhaustive CHAID and 5 for CRT and QUEST.

The results for number of nodes, number of terminal nodes, and importance of independent variable produced by each model are presented in Table 1. The classification trees obtained show a tree with a total of 317 nodes that consist of 254 terminal nodes using CHAID, 377 nodes with 302 terminal nodes using exhaustive CHAID, 43 nodes with 22 terminal nodes using CRT, and 55 nodes with 28 terminal nodes using QUEST. An example of decision tree using CRT method is explained in Table 2. The tree has 43 nodes including the root node, 20 internal nodes, and 22 leaves (terminal nodes). Percentages in each category and in each joint category are presented in Table 2.

Also, the decision tree methods are used to analyze the relationships between landslide susceptibility and related factors. The normalized importance of factors in classification using CRT is shown in Figure 2. The top-down induction of the decision tree indicates that variables in the higher order of the tree structure are more important for analyzing landslide susceptibility. The tree structure demonstrates that important variables related to high landslide susceptibility catchments are ordered as follows: (slope angle), (distance from drainage), (surface area), (slope aspect), and (cross curvature).

The results for prediction accuracy produced by each model are presented in Table 3. The results show high classification accuracy for exhaustive CHAID algorithm as compared to other algorithms. It is found that the prediction accuracy for exhaustive CHAID is 82.0%, with sensitivity 72.3% and specificity 91.7%.

6. Conclusion

This study has analyzed landslide susceptibility in Penang Island, Malaysia, using ensemble learning with a decision-tree model. We can conclude that the decision tree clearly indicates the order of important variables and quantitatively describes the relationships among the occurrence of landslides, topography, and geology. The decision-tree model using the exhaustive CHAID algorithm showed greater accuracy than the other models, demonstrating the usefulness of the decision tree model for landslide hazard mapping. Accuracies were 82.0% for the exhaustive CHAID, 81.9% for the CHAID, 75.6% for the CRT, and 74.0% for the Quest algorithm. In this study, we determined factors that may be involved in landslide susceptibility, and the results can be used for landslide hazard mapping in other regions. Moreover, landslide hazard mapping map can be used to help mitigate hazards to people and facilities and as basic data for developing plans to prevent landslide hazards, such as in locating, monitoring, and facility sites. Further case studies and modeling are needed to better generalize the factors involved in landslide susceptibility.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.