Abstract

AdaBoost is an excellent committee-based tool for classification. However, its effectiveness and efficiency in multiclass categorization face the challenges from methods based on support vector machine (SVM), neural networks (NN), naïve Bayes, and k-nearest neighbor (kNN). This paper uses a novel multi-class AdaBoost algorithm to avoid reducing the multi-class classification problem to multiple two-class classification problems. This novel method is more effective. In addition, it keeps the accuracy advantage of existing AdaBoost. An adaptive group-based kNN method is proposed in this paper to build more accurate weak classifiers and in this way control the number of basis classifiers in an acceptable range. To further enhance the performance, weak classifiers are combined into a strong classifier through a double iterative weighted way and construct an adaptive group-based kNN boosting algorithm (AGkNN-AdaBoost). We implement AGkNN-AdaBoost in a Chinese text categorization system. Experimental results showed that the classification algorithm proposed in this paper has better performance both in precision and recall than many other text categorization methods including traditional AdaBoost. In addition, the processing speed is significantly enhanced than original AdaBoost and many other classic categorization algorithms.

1. Introduction

Machine learning- (ML-) based text categorization (TC) can be defined similar with other data classification tasks as the problem of approximating an unknown category assignment function , where is the set of all possible documents and is the set of predefined categories [1]:

The approximating function is called a classifier, and the task is to build a classifier that produces results as “close” as possible to the true category assignment function [2], for instance, whether an article belongs to fiction, whether a short message belongs to advertisement, or whether the author of a script is Shakespeare and so forth.

In text categorization projects, documents usually need be preprocessed to select suitable features. Then the document will be represented by their features. After the above steps, classifier will determine the category of the document. The flow chart of a TC task is shown in Figure 1.

In different tasks preprocessing contains some or all of the following aspects: transform unstructured document into structured or semistructured format, word segmentation, and text feature selection. Feature selection is the most important part of preprocessing [3]. Features can be characters, words, phrases, concepts, and so forth [4]. Document representation is the process of using features with different weights to show texts. Classifier’s kernel is machine learning algorithm. It uses the document representation as its input and then outputs the categorization results.

Imagine an international IT corporation which is interested in job seekers’ Java programming experience and English ability. The resume screening program of this company is actually a TC system. It can assist the managers choose appropriate employees. The system is shown in Figure 2.

Researchers made considerable achievements in the design of categorization algorithms because classifier is the key point in TC systems [5]. Several of the most important methods include naïve Bayes, support vector machine (SVM), -nearest neighbor (kNN), decision tree (DT), neural networks (NN), and voting-based algorithms such as AdaBoost. Some comparative experiments revealed that SVM, kNN, and AdaBoost have the best precision, Naïve Bayes has the worst performance but very useful as baseline classifiers because of its ease of use. Performance of DT and NN are worse than the top 3 methods but the computational complexity is also lower [68].

In a word, the purpose of classifier design and research in TC is to improve the performance and maintain the balance between performance and cost.

The rest of this paper is organized as follows. Section 2 reviews related work and analyzes the goal of this paper. Section 3 improves classic kNN to build weak classifiers based on it. In Section 4, a double iterative weighted cascading algorithm is proposed to construct a strong classifier. Section 5 then modified the AdaBoost based on Sections 3 and 4 to solve multiclass problems. The application of the novel classification algorithm is presented and analyzed in Section 6. Finally, Section 7 summarizes the paper.

Voting-based categorization algorithms also known as classifier committees can adjust the number and professional level of “experts” in the committees to find a balance between performance and time-computational consumption. These algorithms give up the effort to build single powerful classifier but try to integrate views of many weak classifiers. The philosophical principle of this methodology is the truth always held in majority. Bagging and boosting are the two kinds of most popular voting-based methods.

2.1. Boosting Algorithm

Unlike bagging method which trains the classifiers in parallel, in boosting the classifiers are trained sequentially. Before training the next classifier, the training set is reweighed for allocating greater weight to the documents that were misclassified by the previous classifiers [9]. Therefore, the system can pay serious attention on controversial texts and enhance the precision.

The original boosting algorithm uses three weak classifiers to form a committee. It divides a large training set into three parts randomly and use to train firstly. Then it uses the subset of which is misclassified by and the subset which is categorized rightly by together as the training set of . The rest can be done in the same manner.

Scholars are committed to enhance the performance and reduce the overhead so a lot of improved boosting algorithm such as BrownBoost, LPBoost, LogitBoost, and AdaBoost were proposed. Majority comparative literatures proved that AdaBoost has the best performance among them [10].

2.2. Detail of AdaBoost

Boosting and its relative algorithms get big success in several practices such as image processing, audio classification, and optical characters recognition (OCR). At the same time, boosting needs huge training sets, and thus sometimes the runtime consumption become unacceptable. Moreover, the weak classifiers’ lower limit of accuracy needs to be predicted.

To control the computational cost in a reasonable range, Shapire and Singer [11] proposed AdaBoost. It uses a dual-weighted process to choose training sets and classifiers. The detailed steps of AdaBoost are as follows:(1)Given training set where is the training sample and denotes ’s category label .(2)Let denote th feature of th document.(3)Define the initial distribution of documents in the training set .(4)Searching weak classifier  : for th feature of every sample, a weak classifier can be obtained and thus get the threshold and orientation to minimum the error as follows: Therefore, the weak classifier is (5)Choose from the whole feature space which has the minimal error as the weak classifier.(6)Recalculate the feature of samples: where is a normalization factor which makes and is the weight.(7)Repeat the steps above times and get optimal weak classifiers with different weights.(8)Combine weak classifiers according to their weight to construct a strong classifier:

Training set utilization can be enhanced using the algorithm above through adjusting the weights of misclassified texts [12]. In addition, the performance of strong classifier is improved because it is constructed in a weighted way [13]. In a word, AdaBoost has lower training consumption and higher accuracy than original boosting algorithms.

Researchers proposed some variants of AdaBoost focusing on different aspects such as precision, recall, robustness, computational overhead and multiclass categorization [14]. We called these algorithms AdaBoost family. The three most important indicators are precision, efficiency, and the ability of multiclass categorization. Performances of AdaBoost family members are shown in Figure 3 [15].

2.3. Problems of AdaBoost Family and Motivation of This Article

Figure 3 reveals that few algorithms in AdaBoost family can achieve high precision and high efficiency at the same time specifically in multiclass-oriented categorization problems. Unfortunately, multiclass is the main problem in classification tasks. Traditional methods which translate the multiclass problem into multiple two-class problems will reduce accuracy and increase complexity of the system [16].

To solve problems above, we design weak classifiers with high accuracy and low complexity to limit the number of experts and thus keep the precision while reduce the consumption. More professional expert should play a more important rule, and misclassified documents should attract greater attention to further improve system’s performance. Therefore, more reasonable rules should be made to combine weak classifiers into strong classifier efficiently. In addition, this strong classifier should be used in multiclass classification tasks directly. Above is the motivation and purpose of this paper.

3. Weak Classifiers with AGkNN

Theoretically once weak classifiers are more accurate than guess randomly (1/2 in two-class tasks or in multiclass tasks), AdaBoost can integrate them into a strong classifier whose precision can infinitely be close to the true category distribution [17]. However, when the precision of weak classifiers are lower, more weak classifiers are needed to construct a strong classifier. Too many weak classifiers in the system sometimes increase its complexity and computational consumption to intolerable level. Categorization systems use naïve Bayes or C4.5 as their weak classifiers may face this problem.

Some researchers tried to design weak classifiers based on more powerful algorithms such as neural networks [18] and support vector machine [19]. These algorithms can certainly achieve higher accuracy but lead to some new problems because they are over complex and thus contrary to the ideology of boosting.

3.1. k-Nearest Neighbor

Example-based classification algorithms keep a balance between performance and cost [20]. k-nearest neighbor is the most popular example-based algorithm as it has higher precision and lower complexity.

To make the classification, kNN transforms the target documents into representational feature vectors which have same formation with training samples. Then it calculates distance between the target document and the selected neighbors [21]. Finally the category of target document is determined according to their neighbors’ class. The schematic of two-class kNN is shown in Figure 4.

The distance between two documents is calculated by a distance function:

As shown, the above function calculates the Euclidean distance between two documents in a linear space. Choose nearest neighbors as the reference documents, then the category which includes most neighbors can be found as where is the th training document, is the similarity of document , and document , represent the probability of document belong to category .

3.2. Adaptive Group-Based kNN Categorization

Two main problems in traditional kNN are experience dependent and sample category balance. Experience dependent means is an empirical value that need be preset [22]. Sample category balance notes that when the numbers of samples belonging to different categories have large gap, the classification results tend to be inaccurate. In other words, the system expects the category distribution of the samples as even as possible.

An adaptive group-based kNN (AGkNN) algorithm is proposed in this paper to solve problems above. The basic idea of AGkNN is to divide the training set into multiple groups. Use kNN to categorizing target documents parallel in each subset with a random initial value of k, and then compare classifying results. If the results of different groups are broadly consistent with each other, keep the group size and k’s value. When the results are highly similar to each other, emerge groups and reduce the value of . If the results are totally different from each other, increase the value of . Especially when two opposing views are counterbalanced, the system should increase the numbers of groups to make final decision.

Define as the number of training groups when processing th document. Give every category a number as its class name. For example, define a news reports set {financial news, sports news, entertainment news, political news, weather report} as , as the categorization result of th document determined by th group—for example means in the forth group’s opinion, the document which be classified is a weather report—and as the average value of different categories calculated by feature distance in different groups. For instance, a document is classified as “sports news, sports news, entertainment news, entertainment news and sports news” in 5 training groups; the average value of categories can be calculated as . The number of groups can be adaptively determined as

According to (3.3), the system can determine whether and how to adjust the grouping situation of samples by making reference to the variance of classification results given by different groups. When the variance of result given by each group is higher than the threshold, it means the categorization is not accurate enough because argument exists and more groups are needed to make a final decision. On the other hand, when the variance is very low, it means there are almost no disputes in classification and the sample groups can be merged to saving time consumption. In this paper, we use and as the lower bound and higher bound because the average value of categories is empirically suitable and convenient to be used as threshold.

The value of can be calculated adaptively as: where is the random initial value of . The system can test if the random initial is suitable. It can judge whether majority classifier reached agreement according to the variance to inferring if the categorization result is precise enough or not, moreover, to adjust the value of adaptively to get a more accurate result.

The detail work steps are shown in Figure 5.

In this way, the algorithm can set value of adaptively and take full use of training set because the training set is grouped automatically and the value of is initialed randomly. Furthermore, the system can adjust the number of groups and reference neighbors adaptively by calculate and update variance of categorization results given by different groups in real time. There is no condition missing in the algorithm; in addition, the core of the algorithm is still kNN algorithm whose convergence had been proofed in [23], so the AGkNN converges. The runtime complexity of solving the variance of elements is , so the computational complexity of one document classification in this algorithm is

Therefore the main problems which limit the effectiveness of original kNN for a long time are eliminated in AGkNN. Experience-dependent problem be solved means the algorithm can achieve higher efficiency and robustness. Overcoming the drawback of need category balance means the system can improve its accuracy. In summary, AGkNN has better performance and lower complexity than classic kNN. It is wonderful as the weak classifier in AdaBoost.

3.3. Generating Weak Classifiers

Weak classifier design is critical for differentiating positive samples and negative samples in training set. The precision of weak classifiers must be better than 50% to ensure the convergence of strong classifier. Therefore, the threshold needs to be set for weak classifiers to guarantee the system performance by combining them into a more powerful strong classifier [24].

Define the weight of positive document as , the weight of negative document as , the positive threshold as , and the negative threshold as . The threshold can be calculated as

Accuracy of weak classifiers can be maintained above 0.5 by introducing and updating the threshold . Therefore, weak classifiers based on AGkNN can be generated by following the steps below.(1)Calculate threshold .(2)Call AGkNN for categorization.(3)Calculate the classification error .(4)Randomly choose and compare with , if    go  to step , else, continue.(5)Update the threshold .(6)End.

4. DIWC Algorithm: A Tool for Constructing Strong Classifier

Whether strong classifier has a good performance depends largely on how weak classifiers are combined. To build a powerful strong classifier, basis classifiers which have higher precision must take more responsibility in categorization process. Therefore categorization system should distinguish between the performances of weak classifiers and give them different weights according to their capabilities. Using these weights, boosting algorithms can integrate weak classifiers as the strong classifier in a more efficient way and achieve excellent performance [25].

4.1. Review Weighting Mechanism in Original AdaBoost

Original AdaBoost algorithm uses a linear weighting way to generate strong classifier. In AdaBoost, strong classifier is defined as: where is a basis classifier, is a coefficient, and is the final strong classifier.

Given the training documents and category labels ,  , and . The strong classifier can be constructed as [26]

Initialize weight , for. Select a weak classifier with the smallest weighted error: where is the error rate.Prerequisite: , otherwise stop. Upper bounded by , where is a normalization factor. Select to greedily minimize in each step. Optimizing , where by using the constraint .Reweighting as

The above-mentioned steps demonstrated that AdaBoost gives classifiers which have better classification performance higher weights automatically, especially by step . In this way, AdaBoost can be implemented simply. Its feature selection is on a large set of features. Furthermore, it has good generalization ability.

However, this weighting algorithm does not check the precision of former classifiers using the later training documents. In other words, the strong classifier generation is a single iterative process. Weak classifiers probably have different performances in different training samples. The weak classifiers which are considered should get higher weights by AdaBoost actually have better performance in former part of the training set. However, the basis classifiers may have good performance in later part but be ignored unreasonably. Therefore, credibility of weights is decreasing with the test sequence. This phenomenon can be called weight bias. Weight bias could lead to suboptimal solution problem and make the system oversensitive to noise. Accuracy is affected by the above problems and the robustness of system is decreased.

To overcome these drawbacks, boosting algorithm should use a double iterative process for allocating weights to basis classifiers more reasonable.

4.2. Double Iterative Weighted Cascading Algorithm

In AdaBoost, weak classifiers with higher weights certainly can correctly process the documents which were misclassified by lower weights classifiers. It is important but not enough to improve categorization performance—two crucial problems are ignored as follows.(1)Could basis classifiers with higher weight classify samples which already are rightly categorized by classifiers with lower weight in a high accuracy?(2)If weak classifiers with lower weights also do not have the power to process documents which are misclassified by high-weight classifiers?

Weights’ credibility is reduced when the answers of these two problems are not absolutely positive. Therefore it is worth introducing the problems-aforementioned into weak classifiers weights allocation.

This paper proposed a double iterative weighted cascading (DIWC) algorithm to solve the two problems above and make the utilization of basis classifiers more efficient. The kernel ideal of DIWC is adding a weighting process by input training samples in reverse order. Comparing with original AdaBoost algorithm, we can call this process double iterative. Using the average weight of a basis classifier calculated in the two weighting process as the final weight. Introducing the average weight of two iterations to replace the weight using in traditional AdaBoost can avoid the weight bias problem because it takes the two problems above into account. It defines “powerful” for basis classifiers by using not only the former part but also the full training samples. The sketchy procedure chart of DIWC is shown in Figure 6.

DIWC can achieve weighting procedure shown in Figure 6 by the following steps below.(1)Start: initialize documents weights and weak classifier weights .(2)Training the first classifier with first sample documents subset, mark the set of documents which is misclassified by in as .(3)Loop: training with and (4)Calculation: calculating weights of basis classifiers according to the first round of loops (trainings).(5)Reverse iterative: training with and .(6)Loop: training with and .(7)Calculation: calculating weights of basis classifiers according to the second round of loops (trainings).(8)Calculate final weights of basis classifiers according to step and step (9)Cascade: combine basis classifiers according to their final weights and construct strong classifier.(10)End.

There are three methods that can be used, respectively, to calculate the final weights with different accuracy and complexity.

The first method is quite simple: calculate the arithmetic mean of weights in two iterative loops and use it as weak classifiers’ final weights. This method has a very low computational cost. In this paper, it is called DIWC-1.

Note that some basis classifiers may have a very high weight both in the first and the second rounds of loops. It means these classifiers have global high categorization ability and should play a more important role in classification process instead of using the average weight simply. In this case, an upper bound value is set as the final weight of significantly powerful classifiers. On the other hand, some classifiers may have a very low weight in both two iterative loops. The utility of these classifiers must be limited by using a lower bound value to enhance system’s accuracy. This method spends more time on computing but has higher precision. It is called DIWC-2.

The third method concerns the situation that some weak classifiers may have a very high weight in one round of loops but a very low weight in another round of loops. One more iterative process is needed to determine the final weight. Especially, if the weights’ variance of three rounds is significantly large, the system will consider the weak classifiers as noise oversensitive and deduce its weight. This method can achieve the best precision and robustness. However its training consumption is also highest. We call it DIWC-3 in this paper.

The computational complexity of DIWC-1, DIWC-2, and DIWC-3 can be calculated by referring (3.5). Set as the number of documents would be classified. The runtime complexity of DIWC-1 is quite simple as

In DIWC-2, weights of two iterative processes will be compared, an upper bound will be introduced when classifiers have a very high weight both in the first and the second rounds of loops, and a lower bound will be introduced when classifiers have a very low weight both in the first and the second rounds of loops. Because not every basis classifier needs an upper bound/lower bound and introduces bounds leading to extra computational consumption, so the runtime complexity ranges in

The DIWC-3 considers not only upper bound and lower bound but also the difference between weights in the two iterative loops. When the weights determined in the two loops have big difference, a third loop may be needed for final decision making. Similar to DIWC-2, the range of runtime complexity can be described as

As the analysis above, the computational complexity is proportional to , , and ; when the number of classification objects increases, the time consumption will increase linearly. Therefore the algorithms avoid index explosion problem and have an acceptable runtime complexity. In addition, the algorithms are converged because no condition is missing and the values of weights are infinity.

4.3. Using Training Sets More Efficiently

As per the review in Section 4.1, traditional AdaBoost gives documents which is misclassified by former weak classifiers with higher importance to improve the system’s ability to categorize “difficult” documents. This ideal is helpful for making AdaBoost achieves better precision than former boosting algorithms. However, AdaBoost still leaves space for improving the efficiency of using training documents.

Actually, all the training documents which are categorized incorrectly should be gathered into an error set and use it to train every basis classifier. The accuracy will further progressing by using training documents in this way. The implementation of this method is quite convenient. Integrating this method with DIWC-1, DIWC-2, and DIWC-3 constructs the complete double iterative weighted cascade algorithm. The pseudocode of DIWC is shown in Algorithm 1 where is the error set of the th basis classifier, is the weight of the th classifier in the first iterative loop, is the weight of the th classifier in the second iterative loop, is the lower threshold of the difference between and , is the upper threshold of weight, is the upper bound, is the lower threshold of weight, is the lower bound, is the upper threshold of the difference between and , and is the final weight of the th classifier.

Input: training set and weak
     classifier
OutPut: strong classifier
1   begin
2 test with
3 for ( ; ; )
4  for ( ; ; )
5   test with , and
6 calculate ,
7 test with and
8 for ( ; ; )
9  for ( ; ; )
10    test with and
11  calculate
12  for ( ; ; )
13   if
14    
15   else  if  
16    
17   else  if  
18    
19   else
20    
21    while ( )
22        goto  2
23    
24   end

5. Multiclass Classification

Majority members of AdaBoost family are oriented to two-class classification tasks. When solving multiclass problem, they often transform it into multiple two-class problems. These algorithms tend to have shortcomings in accuracy or efficiency and difficulty to achieve perfection when faced to multiclass categorization tasks. However, multiclass is a main problem in classification tasks. In many occasions simply using two-valued logic as yes or no can or cannot be hard to satisfy the requirements of categorization tasks. For instance, a news report may belong to politics, economics, sports, culture, new scientific discovery, or entertainment. In other words, processing multiclass classification tasks with higher performance should be the most important purpose of the boosting-based algorithm development.

5.1. kNN in Multiclass Classification

As per the kernel algorithm of weak classifiers, k-nearest neighbor has a nature advantage to solve multiclass problems. The mathematical expression of kNN is

The above function reveals that kNN algorithm can easily be used in multiclass classification problems, because unlike other categorization algorithms, kNN does not divide the problem into two subspaces or two subparts, but it records the class label directly. Therefore, it need not to be premodified much to satisfy the multiclass categorization problem.

Traditional text categorization research often use the Euclidean distance or the Manhattan distance to measure the similarity between samples. However, when faced to multiclass categorization problems, these distance definitions cannot distinguish the importance between weights effectively [27]. To solve this problem, the Mahalanobis distance is used in this paper: And the distance between vector and is defined as

In this way, the importance between weights can be distinguished effectively [28]. Because kNN can be easily used in multiclass situation, we can construct strong classifier without big modification of the basis classifier itself.

5.2. Integrating Strong Classifiers

According to the analysis of the former subsection, weak classifiers in this paper are easily used in multiclass classification problems. However the performance can be further improved by changing the way of using strong classifiers.

Strong classifier tends to be used directly to solve two-class problem or independently to divide multiclass problem into several two-class problems in the AdaBoost family. This is probably the simplest way but certainly not the best way because the accuracy of two-class categorization cannot be further enhanced in strong classifying step and the complexity of multiclass categorization problem cannot be constraint efficiently.

Several strong classifiers can work together to solve the problems above. In this paper, we proposed a strong classifiers’ cascading method to further improve the precision and limit the consumption in multiclass classification tasks.

The method of integrating strong classifiers can be explained clearly by using examples. For instance, we can use four strong classifiers in series sequentially to determine which category a document belonged to. When they make the same judgment, use it as the final result. When they get different results, the principle of the minority is subordinate to the majority could be used. Especially, when two different determinations are counterparts, a reclassification process is needed to get the final result. The work logic of this method is shown in Figure 7.

Using the method of integrating strong classifiers in series can improve the classification accuracy because the Cramer-Rao bound is lower in this situation [29]. The derivation and definition of original Cramer-Rao bounds contain too many integral functions and thus very complex, so we use the modified Cramer-Rao bounds (MCRBs) in this paper as below [30]: where is the conditional probability of when given variables and . Reference [31] had proved that, in text categorization, the average-result of multiple classifiers has lower MCRBs than result by single classifier. Therefore, system’s precision can improved by this method. However, input documents to strong classifiers in series will significantly extend the categorization time. To save process time, strong classifiers can be combined in parallel, but in this way, the computational consumption will be increased. To keep balance between time and computational consumption, when implement the strong classifiers integrating method in real systems, users should decide combine them in series or in parallel according the size of documents collection, mutual information (MI) of different categories, the hardware capability, and time consumption tolerance of different systems and different projects.

6. Application and Analysis

The novel text categorization tool in this paper—adaptive group nearest neighbor-based AdaBoost (AGkNN-AdaBoost) algorithm—is fully proposed in the former sections. To evaluate its performance we tested its training time by Matlab with different submodules and parameters. We also measured the time consumption of other algorithms and made comparison to analyze whether and why AGkNN-AdaBoost is better than many other tools, furthermore, which parts make the contributions for its efficiency beyond some algorithms and what mechanisms make it spend more training time than other algorithms.

A text categorization system based on AGkNN-AdaBoost is implemented, and plenty of standard corpora texts are used to measure its precision, recall, . and so forth with different submodule and different initial parameters. We compared AGkNN-AdaBoost’s performance not only with the AdaBoost family algorithms but also with some other classic classification algorithms such as SVM, decision tree, neural networks, and Naïve Bayes. We analyzed all data carefully and took the time consumption into account to make our final conclusion about the novel tool’s performance.

6.1. Experiment for Time Consumption Analysis

In training step, we use documents set in which each document is a bit more or less than 2 KB and selected from standard copora to test the time needed for modeling when using AGkNN-AdaBoost. The relationship between the number of documents and time consumption by using different parameters and random model is shown in Figure 8.

In Figure 8, we test the modeling time with training sets containing 10 documents, 50 documents, 200 documents, 1000 documents, 5000 documents, and 20000 documents. We select , , , and random as the number of reference neighbors. In each situation, we set the number of group as 4, 5, and 6 to evaluate the novel tool’s performance with different parameters. In this test step, the stochastic strategy is used for strong classifier generation. That means system would use DIWC-1, DIWC-2, or DIWC-3 randomly. For ease of view, the logarithms of the documents numbers are used to draw the curve.

As shown in the chart above, the time consumption increased when the number of neighbors or groups increased. Note that logarithmic coordinates are used in the figure, so time consumption increased significantly with the change of and . Therefore, adjust the number of neighbors and groups adaptively has great significance to improve system’s efficiency in the conditions of guaranteeing the performance. In random mode, system’s training time is longer than mode and shorter than mode. Whether the efficiency of the system is higher than mode mostly depends on the number of groups, the number of documents, the size of each document, and document types.

To compare AGkNN-AdaBoost whose strong classifier is based on DIWC-1, DIWC-2, and DIWC-3 with other classic categorization algorithm, we designed experimental control groups including our novel algorithm, SVM, neural networks, and naïve Bayes. Similar to the former part, 5 thousands of documents (each text’s size is about 2 KB) downloaded from standard copara are used for the comparison. The result is shown in Figure 9.

Time consumption will change with parameters and the way which combined strong classifiers. The training time of DIWC-1, DIWC-2, and DIWC-3 with different size of training set is tested as shown in Figure 9. The red dashed line represents DIWC-1, blue dash-dotted line represents DIWC-2, and brown dotted line represents DIWC-3.

Figure 9 reveals that time consumption increases fast when the training set becomes larger. In addition, more training time is needed when using more complex way to integrate strong classifiers. As shown above, the system even combines strong classifiers with DIWC-3 which need less training time than classification tools based on SVM and neural networks. However, three kinds of AGkNN-AdaBoost all need longer training time than naïve Bayes.

AdaBoost is a big algorithm family. We choose most classic and most efficient algorithms—original AdaBoost, AdaBoost.M1, AdaBoost.MR, and AdaBoost.ECC to evaluate the runtime complexity level of our novel algorithms proposed in this paper. We used the same training set as the former experiment, and the result is shown in Figure 10.

It is clearly shown in Figure 10 that AGkNN-AdaBoost has higher efficiency than original AdaBoost and AdaBoost.M1. Moreover, AGkNN-AdaBoost using DIWC-1, DIWC-2 and DIWC-3 as its strong classifier construction strategy makes them all spend training time equal to or less than AdaBoost.MR and AdaBoost.ECC—the leader of the efficiency in AdaBoost family [32]. That is because the adaptive grouping mechanism can better fit different characteristics of different training sets in grouping and selection. Comparing adaptive value of with experience-dependant , the adaptive has higher efficiency because it can upload its value in real time according to the variation of different situation such as document size, document type, and feature sparsity. In addition, the strategy which strong classifiers used—solving multiclass problem directly instead of transforming it into multiple two-class problems—further reduced the system complexity and time consumption.

It should be noted that the performance difference of efficiency between AGkNN-AdaBoost and AdaBoost.ECC is not obvious. The main reason is that the advanced reweight process of DIWC-3 spends a lot of time. However, AGkNN-AdaBoost still has advantages compared with AdaBoost.ECC, because AGkNN-AdaBoost with DIWC-1 and DIWC-2 has significantly lower runtime complexity. What’s more, the precision of AGkNN-AdaBoost will be proved better than AdaBoost.ECC no matter DIWC-1, DIWC-2, or DIWC-3 is used.

6.2. Performance Comparison and Analysis

Experiment is made to evaluate performance of the system. Chinese news corpus support by Sogou Labs [33] is used as the training set and test set. kernel conditional random fields [34] (KCRFs) are used for preprocessing (word segmentation, feature extraction, and representation) the documents. The corpus can be divided into six categories—economics, politics, sports, weather report, entertainments, and culture. 20000 documents are randomly selected as the training samples and 10000 documents are randomly selected as the test texts in each category. Experimental results of system’s precision, recall, and -measure with comparative data [3537] as shown in Tables 1, 2, 3, and 4.

As shown in the above-mentioned tables, AGkNN-AdaBoost has better performance than other text categorization algorithms. The performance of AGkNN-AdaBoost is far beyond naïve Bayes, neural networks, and decision tree. In addition, the novel classification tool has better performance than other AdaBoost family members. Spatially strong classifiers’ integrates according to DIWC-3 method have the best accuracy and recall. It takes more than six percentage increment to average -measure in the six categories than AdaBoost family members. Although the support vector machine is an excellent classification tool, AGkNN-AdaBoost is even more accurate than it. Moreover take its runtime complexity into consideration, AGkNN-AdaBoost is much better than SVM.

Therefore, AGkNN-AdaBoost is an ideal tool for text categorization, it achieves really high accuracy while controls the runtime complexity in a very low degree. That is because the adaptive characteristics improve the performance, the double iterative mechanism enhances the efficiency, and the multiclass classification ability improves precision and reduces the time consumption at the same time.

It is interesting to note that classification in weather reports has the best precision and recall. It is probably because weather reports are quite simple and always contain similar features or key words such as sunny, rainy, cloudy, temperature, and heavy fog warning. AGkNN-AdaBoost does not perform as excellent in entertainment topic categorization as in other categories perhaps because documents belongs to this topic containing too many new words and formal phrases which lead documents in this topic to have more complex features and the feature space of them possibly are extremely sparse.

7. Conclusion

An improved boosting algorithm based on k-nearest neighbors is proposed in this paper. It uses an adaptive group-based kNN categorization algorithm as basis classifiers and combines them in a double iterative weighed cascading method which contains three alternative modes. The strong classifiers are also modified for better satisfying multiclass classification tasks. The AGkNN-AdaBoost algorithm is implemented in a text categorization system, and several experiments are made. Experimental results shows that the algorithm proposed in this paper has higher precision, recall, and robustness than traditional AdaBoost. Furthermore, the time and computational consumption of AGkNN-AdaBoost are lower than many other categorization tools which are not limited to AdaBoost family algorithms. Therefore the algorithm proposed in former sections is quite a useful tool in text categorization, including the Chinese TC problems.

However, support vector machine as one of the best classification algorithm and its usage as weak classifier combined by ideas which are similar with DIWC is a virgin land in text categorization. Moreover, there is space for improving the accuracy and efficiency of AGkNN-AdaBoost, and the performance of AGkNN-AdaBoost in other classification tasks such as image processing, speech categorization, and writer identification should be evaluate. These will be undertaken as future works on this topic.

Acknowledgments

The material presented in this paper is partly based upon work supported by the China Association for Science and Technology. Experimental data is offered by the Sogou Labs.