A Comparison of Different Data Transformation Approaches in the Feature Ranking Context

Petković, Matej; Panov, Panče; Džeroski, Sašo

doi:10.1007/978-3-319-46307-0_20

Matej Petković^16,17,
Panče Panov¹⁶ &
Sašo Džeroski^16,17,18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9956))

Included in the following conference series:

International Conference on Discovery Science

1609 Accesses

Abstract

Due to the omnipresence of high-dimensional datasets, feature selection and ranking are very important steps in data preprocessing. In this work, we propose three transformations for real-valued features. The transformations are based on estimating the probability densities of the features. Originally, we propose modified distance measures for the ReliefF algorithm, which is one the most prominent feature ranking algorithms. To enable their comparison with the other feature ranking algorithms, we present data transformations that are mathematically equivalent to the modified distance measures. Finally, we evaluate our proposed transformations used in combination with several feature ranking methods on a set of benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Due to its implementation in Weka, the SVM-RFE algorithm could not be applied to datasets with non-binary nominal features, hence the results of SVM-RFE are based on 19 (and not 28) small datasets. From now on, we refer to SVM-RFE as SVM.

References

Visualization-based cancer microarray data classification analysis. http://www.biolab.si/supp/bi-cancer/projections. Accessed 04 Oct 2015
Abramowitz, M., Stegun, I.: Handbook of Mathematical Functions (1972)
Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., Benítez, J.M., Herrera, F.: A review of microarray datasets and applied feature selection methods. Inf. Sci. 282, 115–135 (2014)
Article Google Scholar
Botev, Z., Grotowsky, J., Kroese, D.P.: Kernel density estimation via diffusion. Ann. Stat. 38(5), 2916–2957 (2010)
Article MathSciNet MATH Google Scholar
Bowling, S.R., Khasawneh, M.T., Kaewkuekool, S., Cho, B.R.: A logistic approximation to the cumulative normal distribution. J. Ind. Eng. Manag. 2, 114–127 (2009)
Google Scholar
Cantelli, F.P.: Sulla determinazione empirica delle leggi di probabilita. Giornale dell’Istituto Italiano degli Attuari 4, 421–424 (1933)
MATH Google Scholar
Cao, X.H., Obradovic, Z.: A robust data scaling algorithm for gene expression classification. In: Proceedings of the 15th IEEE International Conference on Bioinformatics and Bioengineering (2015)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. Ser. B (Methodol.) 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Glivenko, V.I.: Sulla determinazione empirica delle leggi di probabilita. Giornale dell’Istituto Italiano degli Attuari 2, 92–99 (1933)
MATH Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1), 389–422 (2002)
Article MATH Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, 3rd edn. MorganKaufmann Publishers Inc., San Francisco (2011)
MATH Google Scholar
Kira, K., Rendell, L.A.: The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI 1992, pp. 129–134 (1992)
Google Scholar
Kononenko, I., Robnik-Šikonja, M.: Theoretical and empirical analysis of ReliefF and RReliefF. Mach. Learn. J. 53, 23–69 (2003)
Article MATH Google Scholar
Lewis, A.: Getdist. https://github.com/cmbant/getdist. Accessed 27 May 2016
Lichman, M.: UCI machine learning repository (2013)
Google Scholar
Petković, M., Panov, P., Džeroski, S.: Improved ranking of numeric features with ReliefF. Presented at the Workshops on Machine Learning in Computational Biology (MLCB) & Machine Learning in Systems Biology (MLSB) (2015)
Google Scholar
Rao, K.R., Kim, D.N., Hwang, J.J.: Fast Fourier Transform - Algorithms and Applications, 1st edn. Springer Publishing Company, Incorporated, Heidelberg (2010)
Book MATH Google Scholar
Slavkov, I.: An Evaluation Method for Feature Rankings. Ph.D. thesis, Mednarodna podiplomska šola Jožefa Stefana, Ljubljana (2012)
Google Scholar
Stańczyk, U., Jain, L.C. (eds.): Feature Selection for Data and Pattern Recognition. Studies in Computational Intelligence, vol. 584. Springer, Heidelberg (2015). doi:10.1007/978-3-662-45620-0
Google Scholar
Wu, C.: On the convergence properties of the EM algorithm. Ann. Stat. 11, 95–103 (1983)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We would like to acknowledge the support of the EC through the projects: MAESTRA (FP7-ICT-612944) and HBP (FP7-ICT-604102), and the Slovenian Research Agency through a young researcher grant and the program Knowledge Technologies (P2-0103).

Author information

Authors and Affiliations

Jožef Stefan Institute, Jamova Cesta 39, Ljubljana, Slovenia
Matej Petković, Panče Panov & Sašo Džeroski
Jožef Stefan International Postgraduate School, Jamova Cesta 39, Ljubljana, Slovenia
Matej Petković & Sašo Džeroski
CIPKeBiP, Jamova Cesta 39, Ljubljana, Slovenia
Sašo Džeroski

Authors

Matej Petković
View author publications
You can also search for this author in PubMed Google Scholar
Panče Panov
View author publications
You can also search for this author in PubMed Google Scholar
Sašo Džeroski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matej Petković .

Editor information

Editors and Affiliations

Campus Middelhe, M.G.103a, Universiteit Antwerpen Campus Middelhe, M.G.103a, Antwerp, Belgium
Toon Calders
Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Bari, Italy
Donato Malerba

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Petković, M., Panov, P., Džeroski, S. (2016). A Comparison of Different Data Transformation Approaches in the Feature Ranking Context. In: Calders, T., Ceci, M., Malerba, D. (eds) Discovery Science. DS 2016. Lecture Notes in Computer Science(), vol 9956. Springer, Cham. https://doi.org/10.1007/978-3-319-46307-0_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-46307-0_20
Published: 21 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46306-3
Online ISBN: 978-3-319-46307-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics