EPLS: A novel feature extraction method for migration data clustering

https://doi.org/10.1016/j.jpdc.2016.11.008Get rights and content

Highlights

  • A numerical feature extraction approach EPLS is proposed.

  • EPLS attempts to preserve the most valuable features which are adaptive to different distance measures and different clustering approaches.

  • EPLS-based clustering algorithm can scale to large-volumes of data for its dimensionality reduction characteristic.

  • EPLS can be efficiently suitable for migration data clustering.

Abstract

Nowadays human activity data such as migration data can be easily accumulated by personal devices thanks for GPS. Analysis on migration data is very useful for society decision. Migration data as non-line time series have the properties of higher noise and outliers. Traditional feature extraction methods cannot address this issue very well because of inherent characteristics. Aiming at this problem, a novel numerical feature extraction approach EPLS is proposed. It is an integration of the Ensemble Empirical Mode (EEMD), Principal Component Analysis (PCA) and Least Square (LS) method. The EPLS model includes (1) Mode Decomposition in which EEMD algorithm is applied to the aggregation dataset; (2) Dimension Reduction is carried out for a more significant set of vectors; (3) Least Squares Projection in which all testing data are projected to the obtained vectors. Experimental results show that EPLS can overcome the higher noise and outliers based on migration data clustering. Meanwhile, EPLS feature extraction method can achieve high performance compared with several different clustering methods and distance measures.

Introduction

Human activities data can be easily collected by personal devices in recent years. An increasing number of human activity data is accumulated for the decision of society  [33]. Through human activities data analysis  [16], researches can find some interesting patterns. For example, some patterns can be mined from twitter data which are useful to forecast stock price  [33]. Air quality index can be used to find PM2.5 patterns  [5].

Migration data from Baidu system are used to track human migration activities; for example, during Chinese New Year of 2015, 3.6 billion passenger trips data have been recorded. Baidu map gathers an amount of migration activities dataset from smartphones which own Baidu Maps or other apps using its location-based platform. This dataset is human activity related, which reveals human movement pattern. Analysis on migration is an important aspect of many fields such as economy, traffic, and culture  [18].

How to analyze and apply these data is a challenge because the database is time related and has a higher noise and nonlinear level. Usually these data are treated as chaotic time series. A chaotic time series naturally has the properties of high dimension and large data size  [23], [25]. Usually, clustering is used for exploratory data analysis and act as a major processing step for other tasks [21], [24], [28]. It can be concluded that clustering is classified into three main branches: (1) whole time series clustering; (2) subsequence clustering; (3) time point clustering  [12]. As for whole time series clustering, there are three different categories, namely shape-based approach  [23], [46], [27], feature-based  [2], [19] and model-based  [44], [35]. Feature-based clustering is taken into consideration in this study.

Traditional feature extraction methods usually have some limitation to deal with time series clustering  [2], [19] because of the nature properties of chaotic time series such as nonlinear, high level of noise and outlier, and non-stationary. So a new representative method is needed to address these problems. It is suggested that not all of these clustering methods and similarity measures are appropriate for every time series databases  [41]. Usually Euclidean Distance (ED) can lead to good clustering results as a useful method  [11]. But ED measure is not a general method; for some databases, Elastic measure including Dynamic Time Warping (DTW) and Edit Distance can achieve higher performance  [6]. Finding a suitable distance measure or a clustering method in specific dataset with best result is difficult. However, a relatively general feature extraction approach for fixed database can deal with this problem to some extent.

In this paper, EPLS as hybrid model-based approach is proposed to extract numerical features for migration data which is gathered from Baidu map engine. EPLS attempts to preserve the most valuable features which are adaptive to different distance measures and different clustering approaches. EPLS approach is based on Ensemble Empirical Mode Decomposition (EEMD)  [9], Principal Component Analysis (PCA)  [32] and Least Square (LS) method  [3] which transfers the original time series into feature space. The proposed method is immune to dataset with higher noise and outliers. Meanwhile, the extracted feature from EPLS has relatively low dimension which shows that it can be adapted to different distance measures and clustering methods.

This paper is organized as follows. In Section  2, the related work is discussed. In Section  3, the details of EPLS algorithm, as well as EEMD, PCA and LS, are shown. Then, some experiment settings and database description are given in Section  4. In Section  5, the EPLS model based approach is applied to Migration data from Baidu system. Section  6 provides a summary of the results and concludes the whole paper.

Section snippets

Background and related work

Data sequences which contain explicit information about timing (e.g.  PM2.5, stock, speech, population migration) can be looked as time series. Large amount of time series appear in almost every discipline  [23], [26]. With applying clustering methods, some interesting patterns and correlation can be found in the underlying data  [8]. Usually time series analysis depends on the choice of techniques and distance measures, in which the target is to find general approach for Migration data from

Mode decomposition, component analysis and projection (EPLS)

The target of EPLS is to find a base vector for time series database. Then, all the time series is mapped to the base vector, and a new set of time series can be obtained. The extracted features from time series are serving as inputs for data mining algorithms. EPLS is a way of dealing with features extraction, the outline of which is shown below. Firstly, all time series in a database are pooled into a aggregation and EEMD algorithm is applied to this aggregation. Then, a dimension reduction

Experimental settings

Some results of evaluation metrics and the details of experiment database are shown in this section. EPLS is applied to Migration data from Baidu search engine to reveal patterns. The accuracy, Fmeasure and RandIndex are the evaluation criterion of performance.

Experiment results

The clustering results are presented in this section. Migration data from Baidu is applied to EPLS model-based approach. We compare the result from four aspects: (1) the quantization of noise and outlier, (2) the adaption for different clustering methods, (3) different distance measures for database and (4) time–frequency pattern. The results show that EPLS is more effective than traditional frequency and time domain based clustering method. Otherwise, EPLS performs well when the database is

Conclusion and future work

The study of feature extraction for Migration data is significant for urban planning and population research. In this paper, a novel numerical feature extraction method EPLS is proposed to address this problem. A series of experiments have been carried out to verify that proposed EPLS is valuable. It also has the society meaning; for example, the analysis on Baidu Migration data can provide useful suggests for government decision in Chinese New year Season.

Firstly, the noise level and outlier

Yunliang Chen received the B.Sc. and M.Eng degrees from China University of Geosciences, and the Ph.D.degree from Huazhong University of Science and Technology, China. He is currently an associate professor with School of Computer Science, China University of Geoscience, Wuhan, China.

References (47)

  • C. Faloutsos et al.

    Fastsubsequence matching intime-seriesdatabases

    ACMSIGMOD Rec.

    (1994)
  • M. Halkidi et al.

    On clustering validation techniques

    J. Intell. Inf. Syst.

    (2001)
  • N.E. Huang et al.

    The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis

    Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci.

    (1998)
  • L. Kaufman et al.

    Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344

    (2009)
  • E. Keogh et al.

    On the need for time series data mining benchmarks: a survey and empirical demonstration

    Data Min. Knowl. Discov.

    (2003)
  • E. Keogh et al.

    Clustering of time-series subsequences is meaningless: implications for previous and future research

    Knowl. Inf. Syst.

    (2005)
  • E. Keogh et al.

    A simple dimensionality reduction technique for fast similarity search in large time series databases

    Knowl. Inf. Syst.

    (2000)
  • T. Köhler et al.

    A Comparison of Denoising Methods for One Dimensional Time Series

    (2005)
  • M. Kumar, N.R. Patel, J. Woo, Clustering seasonality patterns in the presence of errors, in: Proceedings of KDD 02,...
  • Y. Li et al.

    Privacy protection for preventing data over-collection in smart city

    IEEE Trans. Comput.

    (2016)
  • J. Lin, D. Etter, D. DeBarr, Exact and approximate reverse nearest neighbor search for multimedia data, in:...
  • B. Minor, J.R. Doppa, D.J. Cook, Toward Learning and Mining from Uncertain Time-Series Data for Activity...
  • U. Mori, A. Mendiburu, J. Lozano, Similarity Measure Selection for Clustering Time Series...
  • Cited by (9)

    • Unsupervised feature selection via latent representation learning and manifold regularization

      2019, Neural Networks
      Citation Excerpt :

      In original high-dimensional feature space, the distance concentration phenomenon makes the classical distance based models, e.g., KNN, fail to work (Mil’Man, 1971; Tang, Li, Wang and Wang, 2018). Usually, the intrinsic dimensionality of high-dimensional data is typically small (Chen et al., 2016; Jiang, Gao, Wang, & Shi, 2014; Li, Chen, Cheng, Liao, & Chen, 2017; Liyanaarachchi, Yang, Huang, & Zhang, 2016; Pes, Dess, & Angioni, 2017; Shao, Liu, & Li, 2014; Wang, Song, & Liu, 2016; Zhu, Zhang, Jin, Zhang, & Xu, 2010) and only a part of features are discriminative for learning tasks such as data clustering and classification since the noisy and redundant features mixed in original data often degenerate the performance of learning algorithms (Jiang & Chung, 2014; Jing et al., 2019; Kang, Pan, Hoi and Xu, 2019; Kang, Peng, & Cheng, 2017; Kang, Wen, Chen and Xu, 2019; Li, Shao, & Deng, 2015; Liu et al., 2018; Lu, Wang, Zou, & Wang, 2017; Matsumoto, Akaho, Sugase-Miyamoto, & Okada, 2010; Saraswati, Nguyen, Hagenbuchner, & Tsoi, 2018; Tang, Liu, Wang, Zhang, Li and Wang, 2019; Tang, Liu, Zhu, Xiong, Li and Xia et al., 2019; Tang, Zhu, Liu, Li, Wang and Zhang et al., 2018; Tangkaratt, Morimoto, & Sugiyama, 2016; Yan & Yang, 2015; Ye, Fu, Zhang, Zhao, & Naiem, 2018; Zhang et al., 2018; Zhu, Xu, Shen and Zhao, 2017). As an effective pre-processing of high dimensional data, feature selection (Connor, Hollensen, Krigolson, & Trappenberg, 2015; Cruz, Sabourin, & Cavalcanti, 2017; Ganivada, Ray, & Pal, 2013; Jain & Zongker, 1997; Le, Vo, & Pham, 2014; Sankaran, Jain, Vashisth, Vatsa, & Singh, 2017; Wang, Bensmail, & Gao, 2014) aims to accomplish the dimensionality reduction by removing some irrelevant and redundant features while preserving the intrinsic data structure.

    • Consensus learning guided multi-view unsupervised feature selection

      2018, Knowledge-Based Systems
      Citation Excerpt :

      Many multi-view learning methods have been proposed in past decade, such as feature learning [7–9], semi-supervised learning [10,11], ensemble learning [12,13], transfer learning [14,15] and active learning [16,17]. In practice, different modalities of data are usually represented in a high-dimensional feature space, this frequently leads to the curse of dimensionality problem [18–24]. In addition, obtaining the labels of data is a challenging and laborious task.

    • Big Educational Data Analytics, Prediction and Recommendation: A Survey

      2022, Journal of Circuits, Systems and Computers
    View all citing articles on Scopus

    Yunliang Chen received the B.Sc. and M.Eng degrees from China University of Geosciences, and the Ph.D.degree from Huazhong University of Science and Technology, China. He is currently an associate professor with School of Computer Science, China University of Geoscience, Wuhan, China.

    Fangyuan Li received the B.Sc. degree from China University of Geosciences. Currently, she is a graduate student with the School of Computer Science, China University of Geosciences, Wuhan, China.

    Jia Chen received the B.Sc. degree from China University of Geosciences. He is currently a postgraduate with School of Computer Science, China University of Geosciences, Wuhan, China. His research interests include data mining and high performance computing.

    Bo Du received the B.Sc. degree from China University of Geosciences. He is currently a postgraduate with School of Computer Science, China University of Geosciences, Wuhan, China. His research interests include data mining and high performance computing.

    Kim-Kwang Raymond Choo is with Department of Information Systems and Cyber Security, University of Texas at San Antonio, USA, and School of Information Technology and Mathematical Sciences, University of South Australia, Australia.

    Houcine Hassan is with Department of Systems Data Processing and Computers Organization, Polytechnic University of Valencia, Spain.

    View full text