Skip to main content
Log in

Fast Maximum Entropy Machine for Big Imbalanced Datasets

  • Research paper
  • Published:
Journal of Communications and Information Networks

Abstract

Driven by the need of a plethora of machine learning applications, several attempts have been made at improving the performance of classifiers applied to imbalanced datasets. In this paper, we present a fast maximum entropy machine (MEM) combined with a synthetic minority over-sampling technique for handling binary classification problems with high imbalance ratios, large numbers of data samples, and medium/large numbers of features. A random Fourier feature representation of kernel functions and primal estimated sub-gradient solver for support vector machine (PEGASOS) are applied to speed up the classic MEM. Experiments have been conducted using various real datasets (including two China Mobile datasets and several other standard test datasets) with various configurations. The obtained results demonstrate that the proposed algorithm has extremely low complexity but an excellent overall classification performance (in terms of several widely used evaluation metrics) as compared to the classic MEM and some other state-of-the-art methods. The proposed algorithm is particularly valuable in big data applications owing to its significantly low computational complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. N. V. Chawla. Data mining for imbalanced datasets: An overview [M]. Boston, MA: Springer US, 2010.

    Google Scholar 

  2. J. Poon, P. Jain, I. C. Konstantakopoulos, et al. Model-based fault detection and identification for switching power converters [J]. IEEE Transactions on Power Electronics, 2017, 32(2): 1419–1430.

    Article  Google Scholar 

  3. M. F. Ganji, M. S. Abadeh, M. Hedayati, et al. Fuzzy classifcation of imbalanced data sets for medical diagnosis [C]//Iranian Conference of Biomedical Engineering (ICBME), Singapore, 2010: 1–5.

    Google Scholar 

  4. B. Krawczyk. Learning from imbalanced data: Open challenges and future directions [J]. Progress in Artificial Intelligence, 2016, 5(4): 221–232.

    Article  Google Scholar 

  5. R. Akbani, S. Kwek, N. Japkowicz. Applying support vector machines to imbalanced datasets [C]//European Conference on Machine Learning (ECML), Pisa, Italy, 2004: 39–50.

    MATH  Google Scholar 

  6. J. Xie, Z. Qiu. The effect of imbalanced data sets on lda: A theoretical and empirical analysis [J]. Pattern Recognition, 2007, 40(2): 557–562.

    Article  MATH  Google Scholar 

  7. N. V. Chawla, K. W. Bowyer, L. O. Hall, et al. SMOTE. Synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321–357.

    Article  MATH  Google Scholar 

  8. H. Han, W. Y. Wang, B. H. Mao. Borderline-smote: A new oversampling method in imbalanced data sets learning [C]//International Conference on Intelligent Computing, Hefei, China, 2005: 878–887.

    Google Scholar 

  9. S. Barua, M. M. Islam, X. Yao, et al. Mwmote—majority weighted minority oversampling technique for imbalanced data set learning [J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(2): 405–425.

    Article  Google Scholar 

  10. C. X. Ling, V. S. Sheng. Cost-Sensitive Learning [M]. Boston, MA: Springer US, 2010.

    Google Scholar 

  11. Y. Sun, M. S. Kamel, A. K. Wong, et al. Cost-sensitive boosting for classification of imbalanced data [J]. Pattern Recognition, 2007, 40(12): 3358–3378.

    Article  MATH  Google Scholar 

  12. C. X. Ling, V. S. Sheng, T. Bruckhaus, et al. Maximum profit mining and its application in software development [C]//ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA, 2006: 929–934.

    Google Scholar 

  13. S. Wang, Z. Li, W. Chao, et al. Applying adaptive over-sampling technique based on data density and cost-sensitive svm to imbalanced learning [C]//International Joint Conference on Neural Networks (IJCNN), Brisbane, Australia, 2012: 1–8.

    Google Scholar 

  14. G. Ditzler, R. Polikar. Incremental learning of concept drift from streaming imbalanced data [J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(10): 2283–2301.

    Article  Google Scholar 

  15. H. M. Nguyen, E. W. Cooper, K. Kamei. Online learning from imbalanced data streams [C]//International Conference of Soft Computing and Pattern Recognition (SoCPaR), Dalian, China, 2011: 347–352.

    Google Scholar 

  16. S. Subramaniam, T. Palpanas, D. Papadopoulos, et al. Online outlier detection in sensor data using non-parametric models [C]//The 32nd International Conference on Very Large Data Bases (VLDB), Seoul, 2006: 187–198.

    Google Scholar 

  17. C. M. Vong, W. F. Ip, P. K. Wong, et al. Predicting minority class for suspended particulate matters level by extreme learning machine [J]. Neurocomputing, 2014, 128(9): 136–144.

    Article  Google Scholar 

  18. B. Wang, J. Pineau. Online bagging and boosting for imbalanced data streams [J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(12): 3353–3366.

    Article  Google Scholar 

  19. S. Wang, L. L. Minku, X. Yao. Resampling-based ensemble methods for online class imbalance learning [J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(5): 1356–1368.

    Article  Google Scholar 

  20. F. Farnia, D. Tse. A minimax approach to supervised learning [C]//Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016: 4240–4248.

    Google Scholar 

  21. J. Alcal´a-Fdez, A. Fern´andez, J. Luengo, et al. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework [J]. Journal of Multiple-Valued Logic and Soft Computing, 2010, 17(2-3): 255–287.

    Google Scholar 

  22. C. Cortes, V. Vapnik. Support-vector networks [J]. Machine Learning, 1995, 20(3): 273–297.

    MATH  Google Scholar 

  23. A. Rahimi, B. Recht. Random features for large-scale kernel machines [J]. Advances in Neural Information Processing Systems, 2007, 20(3): 1177–1184.

    Google Scholar 

  24. M. Aizerman, E. Braverman, L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition learning [J]. Automation and Remote Control, 1964, 25(6): 821–837.

    MATH  Google Scholar 

  25. C. E. Rasmussen, C. I. K. Williams. Gaussian processes for machine learning [M]. MIT Press, 2006.

    MATH  Google Scholar 

  26. S. Shalev-Shwartz, Y. Singer, N. Srebro, et al. Pegasos: Primal estimated sub-gradient solver for SVM [J]. Mathematical Programming, 2010, 127(1): 3–30.

    Article  MathSciNet  MATH  Google Scholar 

  27. V. N. Vapnik. Overview of statistical learning theoty [J]. IEEE Transactions on Neural Networks, 1999, 10(5): 988–999.

    Article  Google Scholar 

  28. N. Bhatia, Vandana. Survey of nearest neighbor techniques [J]. The International Journal of Computer Science and Information Security, 2010, 8(2): 302–305.

    Google Scholar 

  29. S. Lemeshow, D. W. Hosmer. A review of goodness of fit statistics for use in the development of logistic regression [J]. American Journal of Epidemiology, 1982, 115(1): 92–106.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feng Yin.

Additional information

We would like to thank China Mobile Information Technology Company Limited (Shenzhen) for providing the real datasets. We would also like to thank Teng Li, Fengli Yu, and Wei Yu for their valuable discussions. The author Feng Yin was funded by the Shenzhen Science and Technology Innovation Council (No. JCYJ20170307155957688) and by National Natural Science Foundation of China Key Project (No. 61731018). The authors Feng Yin and Shuguang (Robert) Cui were funded by Shenzhen Fundamental Research Funds under Grant (Key Lab) No. ZDSYS201707251409055, Grant (Peacock) No. KQTD2015033114415450, and Guangdong province “The Pearl River Talent Recruitment Program Innovative and Entrepreneurial Teams in 2017”—Data Driven Evolution of Future Intelligent Network Team. The associate editor coordinating the review of this paper and approving it for publication was X. Cheng.

Feng Yin [corresponding author] received his B.Sc. degree from Shanghai Jiao Tong University, Shanghai, China, and his M.Sc. and Dr.-Ing. degrees from Technische Universitaet Darmstadt, Germany, in 2008, 2011, and 2014, respectively. From 2014 to 2016, he was with Ericsson Research, Linkoping, Sweden, mainly working on the European Union FP7 Marie Curie Training Programme on Tracking in Complex Sensor Systems (TRAX). Since 2016, he has been with the Chinese University of Hong Kong, Shenzhen, China, and Shenzhen Research Institute of Big Data. Hisresearch interests include statistical signal processing, machine learning, and sensory data fusion with applications to wireless positioning and tracking. In 2013, he was the recipient of the Chinese Government Award for outstanding self-financed students abroad. In 2014, he was the recipient of the Marie-Curie scholarship from European Union.

Shuqing Lin received her B.S. degree in electronic engineering from Xiamen University, Xiamen, China, in 2016. She has been with the School of Information Science and Engineering, Xiamen University, Xiamen, China, since 2016. Her major is electronics and communications engineering. Her current research interests include machine learning and neural network.

Chuxin Piao majored in Statistics, has studied in the Chinese University of Hong Kong, Shenzhen, China since 2015. Her current research interests include machine learning and the practical applications of data analysis.

Shuguang (Robert) Cui (S’99-M’05-SM’12-F’14) received his Ph.D. degree in Electrical Engineering from Stanford University, California, USA, in 2005. Afterwards, he has been working as assistant, associate, full, Chair Professor in Electrical and Computer Engineering at the Univ. of Arizona, Texas A&M University, and UC Davis, respectively. He is currently a Chair Professor at CUHK Shenzhen and the Vice Director at Shenzhen Research Institute of Big Data. His current research interests focus on data driven large-scale system control and resource management, large data set analysis, IoT system design, energy harvesting based communication system design, and cognitive network optimization. He was selected as the Thomson Reuters Highly Cited Researcher and listed in the Worlds’ Most Influential Scientific Minds by ScienceWatch in 2014. He was the recipient of the IEEE Signal Processing Society 2012 Best Paper Award. He has served as the general co-chair and TPC co-chairs for many IEEE conferences. He has also been serving as the area editor for IEEE Signal Processing Magazine, and associate editors for IEEE Transactions on Big Data, IEEE Transactions on Signal Processing, IEEE JSAC Series on Green Communications and Networking, and IEEE Transactions on Wireless Communications. He has been the elected member for IEEE Signal Processing Society SPCOM Technical Committee (2009-2014) and the elected Chair for IEEE ComSoc Wireless Technical Committee (2017-2018). He is a member of the Steering Committee for IEEE Transactions on Big Data and the Chair of the Steering Committee for IEEE Transactions on Cognitive Communications and Networking. He was also a member of the IEEE Com- Soc Emerging Technology Committee. He was elected as an IEEE Fellow in 2013 and an IEEE ComSoc Distinguished Lecturer in 2014. He received the Amazon AWS Machine Learning Award in 2018.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yin, F., Lin, S., Piao, C. et al. Fast Maximum Entropy Machine for Big Imbalanced Datasets. J. Commun. Inf. Netw. 3, 20–30 (2018). https://doi.org/10.1007/s41650-018-0026-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41650-018-0026-1

Keywords

Navigation