Skip to main content
Log in

Editing training data for multi-label classification with the k-nearest neighbor rule

  • Short Paper
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Multi-label classification allows instances to belong to several classes at once. It has received significant attention in machine learning and has found many real-world applications in recent years, such as text categorization, automatic video annotation and functional genomics, resulting in the development of many multi-label classification methods. Based on labeled examples in the training dataset, a multi-labeled method extracts inherent information to output a function that predicts the labels of unlabeled data. Due to several problems, like errors in the input vectors or in their labels, this information may be wrong and might lead the multi-label algorithm to fail. In this paper, we propose a simple algorithm for overcoming these problems by editing the existing training dataset, and adapting the edited set with different multi-label classification methods. Evaluation on benchmark datasets demonstrates the usefulness and effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Datasets available at http://mulan.sourceforge.net/datasets.html, and http://cse.seu.edu.cn/people/zhangml/.

  2. http://mips.gsf.de/genre/proj/yeast/.

  3. http://enrondata.org/content/research/.

  4. http://www.wormbook.org/chapters/www_genomclassprot/genomclassprot.html.

References

  1. Barbedo JGA, Lopes A (2007) Automatic genre classification of musical signals. EURASIP J Adv Signal Process 2007(1):064960 (2007). doi:10.1155/2007/64960

  2. Blockeel H, De Raedt L, Ramon J (1998) Top-down induction of clustering trees. In: 15th international conference on machine learning, pp 55–63. Morgan Kaufmann, London

  3. Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6(2):153–172. doi:10.1023/A:1014043630878

    Article  MathSciNet  MATH  Google Scholar 

  4. Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167. doi:10.1613/jair.606

    MATH  Google Scholar 

  5. de Carvalho A, Freitas AA (2009) A tutorial on multi-label classification techniques. In: Foundations of computational intelligence. Studies in computational intelligence, vol 5, pp 177–195. Springer, Berlin (2009). doi:10.1007/978-3-642-01536-6_8

  6. Clare A, King RD (2001) Knowledge discovery in multi-label phenotype data. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery (PKDD ’01), vol 2168, pp 42–53. Springer-Verlag, London (2001)

  7. Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. doi:10.1109/TIT.1967.1053964

    Article  MATH  Google Scholar 

  8. Dasarathy BV (1991) Nearest neighbor (NN) norms: NN pattern classification techniques. IEEE Computer Society Press, London. doi:10.1109/2.84880

  9. Denoeux T (1995) A k-nearest neighbor classification rule based on Dempster–Shafer Theory. IEEE Trans Syst Man Cybern 25(05):804–813

    Article  Google Scholar 

  10. Denoeux T, Younes Z, Abdallah F (2010) Representing uncertainty on set-valued variables using belief functions. Artif Intell 174(7–8):479–499. doi:10.1016/j.artint.2010.02.002

    Article  MathSciNet  MATH  Google Scholar 

  11. Devijver PA (1986) On the editing rate of the Multiedit algorithm. Pattern Recogn Lett 4(1):9–12. doi:10.1016/0167-8655(86)90066-8

    Article  Google Scholar 

  12. Elisseeff A, Weston J (2001) Kernel methods for multi-labelled classification and categorical regression problems. In: Advances in neural information processing systems, vol 14, pp 681–687. Biowulf Technologies, MIT Press, New York (2001)

  13. García S, Derrac J, Cano JR, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Analysis Mach Intell 34(3):417–435. doi:10.1109/TPAMI.2011.142

    Article  Google Scholar 

  14. Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694

    MATH  Google Scholar 

  15. Guan D, Yuan W, Lee YK, Lee S (2009) Nearest neighbor editing aided by unlabeled data. Inf Sci 179(13):2273–2282. doi:10.1016/j.ins.2009.02.011

    Article  Google Scholar 

  16. Hattori K, Takahashi M (2000) A new edited k-nearest neighbor rule in the pattern classification problem. Pattern Recogn 33(3):521–528. doi:10.1016/S0031-3203(99)00068-0

    Article  Google Scholar 

  17. Jin B, Muller B, Zhai C, Lu X (2008) Multi-label literature classification based on the gene ontology graph. BMC Bioinform 9:525. doi:10.1186/1471-2105-9-525

    Article  Google Scholar 

  18. Kanj S, Abdallah F, Denoeux T (2012) Purifying training data to improve performance of multi-label classification algorithms. In: Proceedings of the 15th international conference on information fusion (FUSION 2012), pp 1784–1792. IEEE, Singapore (2012)

  19. Koplowitz J, Brown TA (1981) On the relation of performance to editing in nearest neighbor rules. Pattern Recogn 13(3):251–255. doi:10.1016/0031-3203(81)90102-3

    Article  Google Scholar 

  20. Li Y, Hu Z, Cai Y, Zhang W (2005) Support vector based prototype selection method for nearest neighbor rules. In: Proceedings of the first international conference on advances in natural computation, pp 528–535. Springer, Berlin. doi10.1007/11539087_68

  21. Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S (2012) An extensive experimental comparison of methods for multi-label learning. Pattern Recogn 45(9):3084–3104. doi:10.1016/j.patcog.2012.03.004

    Article  Google Scholar 

  22. Pavlidis P, Grundy WN (2000) Combining microarray expression data and phylogenetic profiles to learn gene functional categories using support vector machines. In: Technical report, Department of Computer Science, Columbia University, New York

  23. Pestian JP, Brew C, Matykiewicz P, Hovermale DJ, Johnson N, Cohen KB, Duch W (2007) A shared task involving multi-label classification of clinical free text. In: Proceedings of the workshop on BioNLP 2007: biological, vol 1. Translational, and clinical language processing (BioNLP ’07), pp 97–104. Association for Computational Linguistics, Prague

  24. Pkalska E, Duin RP, Paclík P (2006) Prototype selection for dissimilarity-based classifiers. Pattern Recogn 39(2):189–208. doi:10.1016/j.patcog.2005.06.012

    Article  MATH  Google Scholar 

  25. Qi GJ, Hua XS, Rui Y, Tang J, Mei T, Zhang HJ (2007) Correlative multi-label video annotation. In: Proceedings of the 15th international conference on multimedia—MULTIMEDIA ’07, p 17. ACM Press, Augsburg. doi:10.1145/1291233.1291245

  26. Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333–359. doi:10.1007/s10994-011-5256-5

    Article  MathSciNet  Google Scholar 

  27. Sánchez J, Pla F, Ferri F (1997) Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recogn Lett 18(6):507–513. doi:10.1016/S0167-8655(97)00035-4

    Article  Google Scholar 

  28. Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336. doi:10.1023/A:1007614523901

    Article  MATH  Google Scholar 

  29. Schapire RE, Singer Y (2000) BoosTexter : a boosting-based system for text categorization. Mach Learn 39(2–3):135–168. doi:10.1023/A:1007649029923

    Article  MATH  Google Scholar 

  30. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. doi:10.1145/505282.505283

    Article  Google Scholar 

  31. Shetty J, Adibi J (2004) The enron email dataset database schema and brief statistical report. In: Technical report, Information Sciences Institute, University of Southern California

  32. Tahir MA, Kittler J, Bouridane A (2012) Multilabel classification using heterogeneous ensemble of multi-label classifiers. Pattern Recogn Lett 33(5):513–523. doi:10.1016/j.patrec.2011.10.019

    Article  Google Scholar 

  33. Tang L, Rajan S, Narayanan VK (2009) Large scale multi-label classification via metalabeler. In: Proceedings of the 18th international conference on world wide web (WWW ’09), p 211. ACM Press, Madrid. doi:10.1145/1526709.1526738

  34. Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772. doi:10.1109/TSMC.1976.4309452

    Article  MathSciNet  MATH  Google Scholar 

  35. Trohidis K, Tsoumakas G, Kalliris G, Vlahavas I (2008) Multi-label classification of music into emotions. In: Proceedings of the 9th international conference on music information retrieval (ISMIR ’08), pp 325–330, Philadelphia

  36. Tsoumakas G, Katakis I (2007) Multi-label classification : an overview. Int J Data Wareh Min 3(3):1–13

    Article  Google Scholar 

  37. Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook, pp 667–685. Springer, Thessaloniki. DOI:10.1007/978-0-387-09823-4_34

  38. Tsoumakas G, Katakis I, Vlahavas I (2011) Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng 23(7):1079–1089. doi:10.1109/TKDE.2010.164

    Article  Google Scholar 

  39. Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68(12):1513–1542. doi:10.1016/j.datak.2009.08.005

    Article  Google Scholar 

  40. Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421. doi:10.1109/TSMC.1972.4309137

    Article  MathSciNet  MATH  Google Scholar 

  41. Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH, Steinbach M, Hand DJ, Steinberg D (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37. doi:10.1007/s10115-007-0114-2

  42. Xu J (2011) An extended one-versus-rest support vector machine for multi-label classification. Neurocomputing 74(17):3114–3124. doi:10.1016/j.neucom.2011.04.024

    Article  Google Scholar 

  43. Xu J (2012) An efficient multi-label support vector machine with a zero label. Expert Syst Appl 39(5):4796–4804. doi:10.1016/j.eswa.2011.09.138

    Article  Google Scholar 

  44. Younes Z, Abdallah F, Denoeux T, Snoussi H (2011) A dependent multilabel classification method derived from the k-nearest neighbor rule. EURASIP J Adv Signal Process 2011(1):645964. doi:10.1155/2011/645964

  45. Zhang ML, Zhou ZH (2006) Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng 18(10):1338–1351 (2006). doi:10.1109/TKDE.2006.162

  46. Zhang ML, Zhou ZH (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048. doi:10.1016/j.patcog.2006.12.019

    Article  MATH  Google Scholar 

  47. Zhou ZH, Zhang ML (2007) Multi-instance multi-label learning with application to scene classification. In: 22nd conference on artificial intelligence (AAAI ), vol 40, pp 1609–1616. MIT Press, Vancouver (2007). doi:10.1016/j.patcog.2006.12.019

  48. Zouhal L, Denoeux T (1998) An evidence-theoretic k-NN rule with parameter optimization. IEEE Trans Syst Man Cybern Part C (applications and reviews) 28(2):263–271 (1998). doi:10.1109/5326.669565

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fahed Abdallah.

Appendices

Appendix 1: Evaluation measures

As discussed in Sect. 3.2, performance evaluation for multi-label learning systems differs from that of single-label classification. Let \(\mathcal {H}:\mathbb {X}\rightarrow 2^\mathcal {Y}\) be a multi-label classifier that assigns a predicted label subset of \(\mathcal {Y}=\{\omega _1,\ldots ,\omega _{Q}\}\) to each instance \(\mathbf {x}\in \mathbb {X}\), and let \(f:\mathbb {X}\times \mathcal {Y}\rightarrow [0,1]\) be the corresponding scoring function which gives a score for each label \(\omega _q\) which in turn is interpreted as the probability that \(\omega _q\) is relevant. The function \(f(.,.)\) can be transformed to a ranking function \({\rm{rank}}_f(.,.)\) which maps the outputs of \(f(\mathbf {x},\omega )\) for any \(\omega \in \mathcal {Y}\) to \(\{\omega _1,\omega _2,\ldots ,\omega _Q\}\) so that \(f(\mathbf {x}_i,\omega _q)>f(\mathbf {x}_i,\omega _r)\) implies that \({\rm{rank}}_f(\mathbf {x}_i,\omega _q)<{\rm{rank}}_f(\mathbf {x}_i,\omega _r)\).

Given a set \(\mathcal {S}=\{(\mathbf {x}_1,Y_1),\ldots ,(\mathbf {x}_m,Y_m)\}\) of \(m\) test examples, the evaluation metrics of multi-label learning systems are divided into two groups: prediction-based and ranking-based metrics. Prediction-based measures are calculated based on the average difference of the actual and the predicted set of labels over all test examples. Ranking-based metrics evaluate the label ranking quality depending on the scoring function \(f(.,.)\).

1.1 Prediction-based measures

Hamming loss: The Hamming loss metric for the set of labels is defined as the fraction of labels whose relevance is incorrectly predicted:

$$\begin{aligned} \mathcal H{\rm{Loss}}(\mathcal {H},\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \frac{|Y_i\triangle \widehat{Y}_i|}{Q}, \end{aligned}$$
(7)

where \(\triangle\) denotes the symmetric difference between two sets.

Accuracy: The accuracy metric gives an average degree of similarity between the predicted and the ground truth label sets:

$$\begin{aligned} \mathcal A{\rm{ccuracy}}(\mathcal {H},\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \frac{|Y_i\cap \widehat{Y}_i|}{|Y_i\cup \widehat{Y}_i|}. \end{aligned}$$
(8)

Precision: The precision metric computes the proportion of true positive predictions:

$$\begin{aligned} \mathcal P{\rm{recision}}(\mathcal {H},\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \frac{|Y_i \cap \widehat{Y}_i|}{|\widehat{Y}_i|}. \end{aligned}$$
(9)

Recall: This metric estimates the proportion of true labels that have been predicted as positives:

$$\begin{aligned} \mathcal R{\rm{ecall}}(\mathcal {H},\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \frac{|Y_i \cap \widehat{Y}_i|}{|Y_i|}. \end{aligned}$$
(10)

F1-measure: F1 measure is defined as the harmonic mean of precision and recall. It is calculated as:

$$\begin{aligned} \mathcal F1(\mathcal {H},\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \frac{2|Y_i\cap \widehat{Y}_i|}{|Y_i| + |\widehat{Y}_i|}. \end{aligned}$$
(11)

Note that the smaller the value of the Hamming loss, the better the performance. For the other metrics, higher values correspond to better classification quality.

1.2 Ranking-based measures

One-Error: This metric computes how many times the top-ranked label is not in the true set of labels of the instance, and it ignores the relevancy of all other labels.

$$\begin{aligned} {\rm{OErr}}(f,\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \left\langle [\underset{\omega \in Y}{\arg \max }~f(\mathbf {x}_i,\omega )]\notin Y_i\right\rangle , \end{aligned}$$
(12)

where for any proposition \(H\), \(\langle H \rangle\) equals to 1 if \(H\) holds and 0 otherwise. Note that, for single-label classification problems, the One-Error is identical to ordinary classification error.

Coverage: Coverage computes the average of how far we need to move down the ranked label list to cover all the labels assigned to a test instance.

$$\begin{aligned} \mathcal C{\rm{ov}}(f,\mathcal {S}) = \frac{1}{m}\sum _{i=1}^m \max _{\omega \in Y_i} {\rm{rank}}_f(\mathbf {x}_i,\omega )-1. \end{aligned}$$
(13)

Ranking loss: This metric computes the number of times that an incorrect label is ranked higher than a correct label.

$$\begin{aligned} {\mathcal {R}}{\rm{Loss}}(f,\mathcal {S}) = \frac{1}{m}\sum _{i=1}^m \frac{1}{|Y_i||\overline{Y}_i|}|{(\omega _q,\omega _r)\in Y_i\times \overline{Y}_i\backslash f(\mathbf {x}_i,\omega _q)\le f(\mathbf {x}_i,\omega _r)}| \end{aligned}$$
(14)

where \(\overline{Y}_i\) is the complementary set of \(Y_i\) in \(\mathcal {Y}\).

Average precision: This metric evaluates the average fraction of labels ranked above a particular label \(\omega \in Y_i\) which are actually in \(Y_i\).

$$\begin{aligned} {\mathcal {A}}{\rm{vPrec}}(f,\mathcal {S}) = \frac{1}{m}\sum _{i=1}^m \frac{1}{|Y_i|}\sum _{\omega _q \in Y_i}\frac{|\{\omega _r \in Y_i\}\backslash {\rm{rank}}_f(\mathbf {x}_i,\omega _r)\le {\rm{rank}}_f(\mathbf {x}_i,\omega _q)|}{{\rm{rank}}_f(\mathbf {x}_i,\omega _q)}. \end{aligned}$$
(15)

Note that AvPrec\((f, \mathcal {S}) = 1\) means that the labels are perfectly ranked. For the other metrics, smaller values correspond to a better label ranking quality.

Appendix 2: Multi-labeled dataset statistics

Given a multi-labeled dataset \(\mathcal {D}=\{(\mathbf {x}_i,Y_i),i=1,\ldots ,n \}\) with \(\mathbf {x}_i \in \mathbb {X}\) and \(Y_i \subseteq \mathcal {Y}\), this dataset can be measured by the number of instances (\(n\)), the number of attributes in the input space, and the number of labels (\(Q\)). In the following, we review some statistics about the multi-labeled dataset \(\mathcal {D}\) [36].

Label cardinality: The label cardinality (LCard) of \(\mathcal {D}\) is the average number of labels per instance. Label cardinality is calculated asFootnote 4

$$\begin{aligned} {\rm{LCard}}(\mathcal {D})=\frac{1}{n}\sum _{i=1}^n |Y_i| \end{aligned}$$
(16)

Label density: The label density (LDen) of \(\mathcal {D}\) is defined as the average number of labels per instance divided by the total number of labels \(Q\). Label density is calculated as:

$$\begin{aligned} {\rm{LDen}}(\mathcal {D})=\frac{1}{n}\sum _{i=1}^n \frac{|Y_i|}{Q} \end{aligned}$$
(17)

Both metrics indicate the number of alternative labels that characterize the examples of a multi-labeled dataset. Label cardinality is independent of the total number of labels in the classification problem, while label density takes into consideration the total number of labels. Two datasets with the same label cardinality but with different label densities may present different properties that influence the performance of the multi-label classification methods.

Distinct label sets: The distinct label sets (DL) count the number of label sets that are unique across the total number of examples. Distinct label sets are given by:

$$\begin{aligned} {\rm{DL}}(\mathcal {D})=|\{Y_i \subseteq \mathcal {Y}|\exists ~\mathbf {x}_i\in \mathbb {X}~:~(\mathbf {x}_i,Y_i)\in \mathcal {D}\}| \end{aligned}$$
(18)

This measure gives an idea of the regularity of the labeling scheme.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kanj, S., Abdallah, F., Denœux, T. et al. Editing training data for multi-label classification with the k-nearest neighbor rule. Pattern Anal Applic 19, 145–161 (2016). https://doi.org/10.1007/s10044-015-0452-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-015-0452-8

Keywords

Navigation