Editing training data for multi-label classification with the k-nearest neighbor rule

Kanj, Sawsan; Abdallah, Fahed; Denœux, Thierry; Tout, Kifah

doi:10.1007/s10044-015-0452-8

Editing training data for multi-label classification with the k-nearest neighbor rule

Short Paper
Published: 07 February 2015

Volume 19, pages 145–161, (2016)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Sawsan Kanj^1,2,
Fahed Abdallah¹,
Thierry Denœux¹ &
…
Kifah Tout²

1284 Accesses
48 Citations
Explore all metrics

Abstract

Multi-label classification allows instances to belong to several classes at once. It has received significant attention in machine learning and has found many real-world applications in recent years, such as text categorization, automatic video annotation and functional genomics, resulting in the development of many multi-label classification methods. Based on labeled examples in the training dataset, a multi-labeled method extracts inherent information to output a function that predicts the labels of unlabeled data. Due to several problems, like errors in the input vectors or in their labels, this information may be wrong and might lead the multi-label algorithm to fail. In this paper, we propose a simple algorithm for overcoming these problems by editing the existing training dataset, and adapting the edited set with different multi-label classification methods. Evaluation on benchmark datasets demonstrates the usefulness and effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Article Open access 15 November 2019

Jesper E. van Engelen & Holger H. Hoos

Learning from positive and unlabeled data: a survey

Article 02 April 2020

Jessa Bekker & Jesse Davis

Notes

References

Barbedo JGA, Lopes A (2007) Automatic genre classification of musical signals. EURASIP J Adv Signal Process 2007(1):064960 (2007). doi:10.1155/2007/64960
Blockeel H, De Raedt L, Ramon J (1998) Top-down induction of clustering trees. In: 15th international conference on machine learning, pp 55–63. Morgan Kaufmann, London
Brighton H, Mellish C (2002) Advances in instance selection for instance-based learning algorithms. Data Min Knowl Discov 6(2):153–172. doi:10.1023/A:1014043630878
Article MathSciNet MATH Google Scholar
Brodley CE, Friedl MA (1999) Identifying mislabeled training data. J Artif Intell Res 11:131–167. doi:10.1613/jair.606
MATH Google Scholar
de Carvalho A, Freitas AA (2009) A tutorial on multi-label classification techniques. In: Foundations of computational intelligence. Studies in computational intelligence, vol 5, pp 177–195. Springer, Berlin (2009). doi:10.1007/978-3-642-01536-6_8
Clare A, King RD (2001) Knowledge discovery in multi-label phenotype data. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery (PKDD ’01), vol 2168, pp 42–53. Springer-Verlag, London (2001)
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. doi:10.1109/TIT.1967.1053964
Article MATH Google Scholar
Dasarathy BV (1991) Nearest neighbor (NN) norms: NN pattern classification techniques. IEEE Computer Society Press, London. doi:10.1109/2.84880
Denoeux T (1995) A k-nearest neighbor classification rule based on Dempster–Shafer Theory. IEEE Trans Syst Man Cybern 25(05):804–813
Article Google Scholar
Denoeux T, Younes Z, Abdallah F (2010) Representing uncertainty on set-valued variables using belief functions. Artif Intell 174(7–8):479–499. doi:10.1016/j.artint.2010.02.002
Article MathSciNet MATH Google Scholar
Devijver PA (1986) On the editing rate of the Multiedit algorithm. Pattern Recogn Lett 4(1):9–12. doi:10.1016/0167-8655(86)90066-8
Article Google Scholar
Elisseeff A, Weston J (2001) Kernel methods for multi-labelled classification and categorical regression problems. In: Advances in neural information processing systems, vol 14, pp 681–687. Biowulf Technologies, MIT Press, New York (2001)
García S, Derrac J, Cano JR, Herrera F (2012) Prototype selection for nearest neighbor classification: taxonomy and empirical study. IEEE Trans Pattern Analysis Mach Intell 34(3):417–435. doi:10.1109/TPAMI.2011.142
Article Google Scholar
Garcia S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694
MATH Google Scholar
Guan D, Yuan W, Lee YK, Lee S (2009) Nearest neighbor editing aided by unlabeled data. Inf Sci 179(13):2273–2282. doi:10.1016/j.ins.2009.02.011
Article Google Scholar
Hattori K, Takahashi M (2000) A new edited k-nearest neighbor rule in the pattern classification problem. Pattern Recogn 33(3):521–528. doi:10.1016/S0031-3203(99)00068-0
Article Google Scholar
Jin B, Muller B, Zhai C, Lu X (2008) Multi-label literature classification based on the gene ontology graph. BMC Bioinform 9:525. doi:10.1186/1471-2105-9-525
Article Google Scholar
Kanj S, Abdallah F, Denoeux T (2012) Purifying training data to improve performance of multi-label classification algorithms. In: Proceedings of the 15th international conference on information fusion (FUSION 2012), pp 1784–1792. IEEE, Singapore (2012)
Koplowitz J, Brown TA (1981) On the relation of performance to editing in nearest neighbor rules. Pattern Recogn 13(3):251–255. doi:10.1016/0031-3203(81)90102-3
Article Google Scholar
Li Y, Hu Z, Cai Y, Zhang W (2005) Support vector based prototype selection method for nearest neighbor rules. In: Proceedings of the first international conference on advances in natural computation, pp 528–535. Springer, Berlin. doi10.1007/11539087_68
Madjarov G, Kocev D, Gjorgjevikj D, Džeroski S (2012) An extensive experimental comparison of methods for multi-label learning. Pattern Recogn 45(9):3084–3104. doi:10.1016/j.patcog.2012.03.004
Article Google Scholar
Pavlidis P, Grundy WN (2000) Combining microarray expression data and phylogenetic profiles to learn gene functional categories using support vector machines. In: Technical report, Department of Computer Science, Columbia University, New York
Pestian JP, Brew C, Matykiewicz P, Hovermale DJ, Johnson N, Cohen KB, Duch W (2007) A shared task involving multi-label classification of clinical free text. In: Proceedings of the workshop on BioNLP 2007: biological, vol 1. Translational, and clinical language processing (BioNLP ’07), pp 97–104. Association for Computational Linguistics, Prague
Pkalska E, Duin RP, Paclík P (2006) Prototype selection for dissimilarity-based classifiers. Pattern Recogn 39(2):189–208. doi:10.1016/j.patcog.2005.06.012
Article MATH Google Scholar
Qi GJ, Hua XS, Rui Y, Tang J, Mei T, Zhang HJ (2007) Correlative multi-label video annotation. In: Proceedings of the 15th international conference on multimedia—MULTIMEDIA ’07, p 17. ACM Press, Augsburg. doi:10.1145/1291233.1291245
Read J, Pfahringer B, Holmes G, Frank E (2011) Classifier chains for multi-label classification. Mach Learn 85(3):333–359. doi:10.1007/s10994-011-5256-5
Article MathSciNet Google Scholar
Sánchez J, Pla F, Ferri F (1997) Prototype selection for the nearest neighbour rule through proximity graphs. Pattern Recogn Lett 18(6):507–513. doi:10.1016/S0167-8655(97)00035-4
Article Google Scholar
Schapire RE, Singer Y (1999) Improved boosting algorithms using confidence-rated predictions. Mach Learn 37(3):297–336. doi:10.1023/A:1007614523901
Article MATH Google Scholar
Schapire RE, Singer Y (2000) BoosTexter : a boosting-based system for text categorization. Mach Learn 39(2–3):135–168. doi:10.1023/A:1007649029923
Article MATH Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47. doi:10.1145/505282.505283
Article Google Scholar
Shetty J, Adibi J (2004) The enron email dataset database schema and brief statistical report. In: Technical report, Information Sciences Institute, University of Southern California
Tahir MA, Kittler J, Bouridane A (2012) Multilabel classification using heterogeneous ensemble of multi-label classifiers. Pattern Recogn Lett 33(5):513–523. doi:10.1016/j.patrec.2011.10.019
Article Google Scholar
Tang L, Rajan S, Narayanan VK (2009) Large scale multi-label classification via metalabeler. In: Proceedings of the 18th international conference on world wide web (WWW ’09), p 211. ACM Press, Madrid. doi:10.1145/1526709.1526738
Tomek I (1976) Two modifications of CNN. IEEE Trans Syst Man Cybern 6(11):769–772. doi:10.1109/TSMC.1976.4309452
Article MathSciNet MATH Google Scholar
Trohidis K, Tsoumakas G, Kalliris G, Vlahavas I (2008) Multi-label classification of music into emotions. In: Proceedings of the 9th international conference on music information retrieval (ISMIR ’08), pp 325–330, Philadelphia
Tsoumakas G, Katakis I (2007) Multi-label classification : an overview. Int J Data Wareh Min 3(3):1–13
Article Google Scholar
Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook, pp 667–685. Springer, Thessaloniki. DOI:10.1007/978-0-387-09823-4_34
Tsoumakas G, Katakis I, Vlahavas I (2011) Random k-labelsets for multilabel classification. IEEE Trans Knowl Data Eng 23(7):1079–1089. doi:10.1109/TKDE.2010.164
Article Google Scholar
Van Hulse J, Khoshgoftaar T (2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68(12):1513–1542. doi:10.1016/j.datak.2009.08.005
Article Google Scholar
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421. doi:10.1109/TSMC.1972.4309137
Article MathSciNet MATH Google Scholar
Wu X, Kumar V, Ross Quinlan J, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH, Steinbach M, Hand DJ, Steinberg D (2007) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37. doi:10.1007/s10115-007-0114-2
Xu J (2011) An extended one-versus-rest support vector machine for multi-label classification. Neurocomputing 74(17):3114–3124. doi:10.1016/j.neucom.2011.04.024
Article Google Scholar
Xu J (2012) An efficient multi-label support vector machine with a zero label. Expert Syst Appl 39(5):4796–4804. doi:10.1016/j.eswa.2011.09.138
Article Google Scholar
Younes Z, Abdallah F, Denoeux T, Snoussi H (2011) A dependent multilabel classification method derived from the k-nearest neighbor rule. EURASIP J Adv Signal Process 2011(1):645964. doi:10.1155/2011/645964
Zhang ML, Zhou ZH (2006) Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans Knowl Data Eng 18(10):1338–1351 (2006). doi:10.1109/TKDE.2006.162
Zhang ML, Zhou ZH (2007) ML-KNN: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048. doi:10.1016/j.patcog.2006.12.019
Article MATH Google Scholar
Zhou ZH, Zhang ML (2007) Multi-instance multi-label learning with application to scene classification. In: 22nd conference on artificial intelligence (AAAI ), vol 40, pp 1609–1616. MIT Press, Vancouver (2007). doi:10.1016/j.patcog.2006.12.019
Zouhal L, Denoeux T (1998) An evidence-theoretic k-NN rule with parameter optimization. IEEE Trans Syst Man Cybern Part C (applications and reviews) 28(2):263–271 (1998). doi:10.1109/5326.669565

Download references

Author information

Authors and Affiliations

Heudiasyc, UMR CNRS 7253, Université de Technologie de Compiègne, Rue Roger Couttolenc, CS 60319, 60203, Compiegne Cedex, France
Sawsan Kanj, Fahed Abdallah & Thierry Denœux
Faculty of Sciences I, Lebanese University, Hadath, Beirut, Lebanon
Sawsan Kanj & Kifah Tout

Authors

Sawsan Kanj
View author publications
You can also search for this author in PubMed Google Scholar
Fahed Abdallah
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Denœux
View author publications
You can also search for this author in PubMed Google Scholar
Kifah Tout
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fahed Abdallah.

Appendices

Appendix 1: Evaluation measures

As discussed in Sect. 3.2, performance evaluation for multi-label learning systems differs from that of single-label classification. Let $\mathcal {H}:\mathbb {X}\rightarrow 2^\mathcal {Y}$ be a multi-label classifier that assigns a predicted label subset of $\mathcal {Y}=\{\omega _1,\ldots ,\omega _{Q}\}$ to each instance $\mathbf {x}\in \mathbb {X}$, and let $f:\mathbb {X}\times \mathcal {Y}\rightarrow [0,1]$ be the corresponding scoring function which gives a score for each label $\omega _q$ which in turn is interpreted as the probability that $\omega _q$ is relevant. The function $f(.,.)$ can be transformed to a ranking function ${\rm{rank}}_f(.,.)$ which maps the outputs of $f(\mathbf {x},\omega )$ for any $\omega \in \mathcal {Y}$ to $\{\omega _1,\omega _2,\ldots ,\omega _Q\}$ so that $f(\mathbf {x}_i,\omega _q)>f(\mathbf {x}_i,\omega _r)$ implies that ${\rm{rank}}_f(\mathbf {x}_i,\omega _q)<{\rm{rank}}_f(\mathbf {x}_i,\omega _r)$.

Given a set $\mathcal {S}=\{(\mathbf {x}_1,Y_1),\ldots ,(\mathbf {x}_m,Y_m)\}$ of $m$ test examples, the evaluation metrics of multi-label learning systems are divided into two groups: prediction-based and ranking-based metrics. Prediction-based measures are calculated based on the average difference of the actual and the predicted set of labels over all test examples. Ranking-based metrics evaluate the label ranking quality depending on the scoring function $f(.,.)$.

1.1 Prediction-based measures

Hamming loss: The Hamming loss metric for the set of labels is defined as the fraction of labels whose relevance is incorrectly predicted:

$$\begin{aligned} \mathcal H{\rm{Loss}}(\mathcal {H},\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \frac{|Y_i\triangle \widehat{Y}_i|}{Q}, \end{aligned}$$

(7)

where $\triangle$ denotes the symmetric difference between two sets.

Accuracy: The accuracy metric gives an average degree of similarity between the predicted and the ground truth label sets:

$$\begin{aligned} \mathcal A{\rm{ccuracy}}(\mathcal {H},\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \frac{|Y_i\cap \widehat{Y}_i|}{|Y_i\cup \widehat{Y}_i|}. \end{aligned}$$

(8)

Precision: The precision metric computes the proportion of true positive predictions:

$$\begin{aligned} \mathcal P{\rm{recision}}(\mathcal {H},\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \frac{|Y_i \cap \widehat{Y}_i|}{|\widehat{Y}_i|}. \end{aligned}$$

(9)

Recall: This metric estimates the proportion of true labels that have been predicted as positives:

$$\begin{aligned} \mathcal R{\rm{ecall}}(\mathcal {H},\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \frac{|Y_i \cap \widehat{Y}_i|}{|Y_i|}. \end{aligned}$$

(10)

F1-measure: F1 measure is defined as the harmonic mean of precision and recall. It is calculated as:

$$\begin{aligned} \mathcal F1(\mathcal {H},\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \frac{2|Y_i\cap \widehat{Y}_i|}{|Y_i| + |\widehat{Y}_i|}. \end{aligned}$$

(11)

Note that the smaller the value of the Hamming loss, the better the performance. For the other metrics, higher values correspond to better classification quality.

1.2 Ranking-based measures

One-Error: This metric computes how many times the top-ranked label is not in the true set of labels of the instance, and it ignores the relevancy of all other labels.

$$\begin{aligned} {\rm{OErr}}(f,\mathcal {S})=\frac{1}{m}\sum _{i=1}^m \left\langle [\underset{\omega \in Y}{\arg \max }~f(\mathbf {x}_i,\omega )]\notin Y_i\right\rangle , \end{aligned}$$

(12)

where for any proposition $H$, $\langle H \rangle$ equals to 1 if $H$ holds and 0 otherwise. Note that, for single-label classification problems, the One-Error is identical to ordinary classification error.

Coverage: Coverage computes the average of how far we need to move down the ranked label list to cover all the labels assigned to a test instance.

$$\begin{aligned} \mathcal C{\rm{ov}}(f,\mathcal {S}) = \frac{1}{m}\sum _{i=1}^m \max _{\omega \in Y_i} {\rm{rank}}_f(\mathbf {x}_i,\omega )-1. \end{aligned}$$

(13)

Ranking loss: This metric computes the number of times that an incorrect label is ranked higher than a correct label.

$$\begin{aligned} {\mathcal {R}}{\rm{Loss}}(f,\mathcal {S}) = \frac{1}{m}\sum _{i=1}^m \frac{1}{|Y_i||\overline{Y}_i|}|{(\omega _q,\omega _r)\in Y_i\times \overline{Y}_i\backslash f(\mathbf {x}_i,\omega _q)\le f(\mathbf {x}_i,\omega _r)}| \end{aligned}$$

(14)

where $\overline{Y}_i$ is the complementary set of $Y_i$ in $\mathcal {Y}$.

Average precision: This metric evaluates the average fraction of labels ranked above a particular label $\omega \in Y_i$ which are actually in $Y_i$.

$$\begin{aligned} {\mathcal {A}}{\rm{vPrec}}(f,\mathcal {S}) = \frac{1}{m}\sum _{i=1}^m \frac{1}{|Y_i|}\sum _{\omega _q \in Y_i}\frac{|\{\omega _r \in Y_i\}\backslash {\rm{rank}}_f(\mathbf {x}_i,\omega _r)\le {\rm{rank}}_f(\mathbf {x}_i,\omega _q)|}{{\rm{rank}}_f(\mathbf {x}_i,\omega _q)}. \end{aligned}$$

(15)

Note that AvPrec$(f, \mathcal {S}) = 1$ means that the labels are perfectly ranked. For the other metrics, smaller values correspond to a better label ranking quality.

Appendix 2: Multi-labeled dataset statistics

Given a multi-labeled dataset $\mathcal {D}=\{(\mathbf {x}_i,Y_i),i=1,\ldots ,n \}$ with $\mathbf {x}_i \in \mathbb {X}$ and $Y_i \subseteq \mathcal {Y}$, this dataset can be measured by the number of instances ($n$), the number of attributes in the input space, and the number of labels ($Q$). In the following, we review some statistics about the multi-labeled dataset $\mathcal {D}$ [36].

Label cardinality: The label cardinality (LCard) of $\mathcal {D}$ is the average number of labels per instance. Label cardinality is calculated as^{Footnote 4}

$$\begin{aligned} {\rm{LCard}}(\mathcal {D})=\frac{1}{n}\sum _{i=1}^n |Y_i| \end{aligned}$$

(16)

Label density: The label density (LDen) of $\mathcal {D}$ is defined as the average number of labels per instance divided by the total number of labels $Q$. Label density is calculated as:

$$\begin{aligned} {\rm{LDen}}(\mathcal {D})=\frac{1}{n}\sum _{i=1}^n \frac{|Y_i|}{Q} \end{aligned}$$

(17)

Both metrics indicate the number of alternative labels that characterize the examples of a multi-labeled dataset. Label cardinality is independent of the total number of labels in the classification problem, while label density takes into consideration the total number of labels. Two datasets with the same label cardinality but with different label densities may present different properties that influence the performance of the multi-label classification methods.

Distinct label sets: The distinct label sets (DL) count the number of label sets that are unique across the total number of examples. Distinct label sets are given by:

$$\begin{aligned} {\rm{DL}}(\mathcal {D})=|\{Y_i \subseteq \mathcal {Y}|\exists ~\mathbf {x}_i\in \mathbb {X}~:~(\mathbf {x}_i,Y_i)\in \mathcal {D}\}| \end{aligned}$$

(18)

This measure gives an idea of the regularity of the labeling scheme.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kanj, S., Abdallah, F., Denœux, T. et al. Editing training data for multi-label classification with the k-nearest neighbor rule. Pattern Anal Applic 19, 145–161 (2016). https://doi.org/10.1007/s10044-015-0452-8

Download citation

Received: 27 May 2014
Accepted: 12 January 2015
Published: 07 February 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s10044-015-0452-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Editing training data for multi-label classification with the k-nearest neighbor rule

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Learning from positive and unlabeled data: a survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Evaluation measures

1.1 Prediction-based measures

1.2 Ranking-based measures

Appendix 2: Multi-labeled dataset statistics

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Editing training data for multi-label classification with the k-nearest neighbor rule

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Learning from positive and unlabeled data: a survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1: Evaluation measures

1.1 Prediction-based measures

1.2 Ranking-based measures

Appendix 2: Multi-labeled dataset statistics

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation