EPLS: A novel feature extraction method for migration data clustering

doi:10.1016/j.jpdc.2016.11.008

Journal of Parallel and Distributed Computing

Volume 103, May 2017, Pages 96-103

https://doi.org/10.1016/j.jpdc.2016.11.008 Get rights and content

Highlights

•
A numerical feature extraction approach EPLS is proposed.
•
EPLS attempts to preserve the most valuable features which are adaptive to different distance measures and different clustering approaches.
•
EPLS-based clustering algorithm can scale to large-volumes of data for its dimensionality reduction characteristic.
•
EPLS can be efficiently suitable for migration data clustering.

Abstract

Nowadays human activity data such as migration data can be easily accumulated by personal devices thanks for GPS. Analysis on migration data is very useful for society decision. Migration data as non-line time series have the properties of higher noise and outliers. Traditional feature extraction methods cannot address this issue very well because of inherent characteristics. Aiming at this problem, a novel numerical feature extraction approach EPLS is proposed. It is an integration of the Ensemble Empirical Mode (EEMD), Principal Component Analysis (PCA) and Least Square (LS) method. The EPLS model includes (1) Mode Decomposition in which EEMD algorithm is applied to the aggregation dataset; (2) Dimension Reduction is carried out for a more significant set of vectors; (3) Least Squares Projection in which all testing data are projected to the obtained vectors. Experimental results show that EPLS can overcome the higher noise and outliers based on migration data clustering. Meanwhile, EPLS feature extraction method can achieve high performance compared with several different clustering methods and distance measures.

Introduction

Human activities data can be easily collected by personal devices in recent years. An increasing number of human activity data is accumulated for the decision of society [33]. Through human activities data analysis [16], researches can find some interesting patterns. For example, some patterns can be mined from twitter data which are useful to forecast stock price [33]. Air quality index can be used to find PM2.5 patterns [5].

Migration data from Baidu system are used to track human migration activities; for example, during Chinese New Year of 2015, 3.6 billion passenger trips data have been recorded. Baidu map gathers an amount of migration activities dataset from smartphones which own Baidu Maps or other apps using its location-based platform. This dataset is human activity related, which reveals human movement pattern. Analysis on migration is an important aspect of many fields such as economy, traffic, and culture [18].

How to analyze and apply these data is a challenge because the database is time related and has a higher noise and nonlinear level. Usually these data are treated as chaotic time series. A chaotic time series naturally has the properties of high dimension and large data size [23], [25]. Usually, clustering is used for exploratory data analysis and act as a major processing step for other tasks [21], [24], [28]. It can be concluded that clustering is classified into three main branches: (1) whole time series clustering; (2) subsequence clustering; (3) time point clustering [12]. As for whole time series clustering, there are three different categories, namely shape-based approach [23], [46], [27], feature-based [2], [19] and model-based [44], [35]. Feature-based clustering is taken into consideration in this study.

Traditional feature extraction methods usually have some limitation to deal with time series clustering [2], [19] because of the nature properties of chaotic time series such as nonlinear, high level of noise and outlier, and non-stationary. So a new representative method is needed to address these problems. It is suggested that not all of these clustering methods and similarity measures are appropriate for every time series databases [41]. Usually Euclidean Distance (ED) can lead to good clustering results as a useful method [11]. But ED measure is not a general method; for some databases, Elastic measure including Dynamic Time Warping (DTW) and Edit Distance can achieve higher performance [6]. Finding a suitable distance measure or a clustering method in specific dataset with best result is difficult. However, a relatively general feature extraction approach for fixed database can deal with this problem to some extent.

In this paper, EPLS as hybrid model-based approach is proposed to extract numerical features for migration data which is gathered from Baidu map engine. EPLS attempts to preserve the most valuable features which are adaptive to different distance measures and different clustering approaches. EPLS approach is based on Ensemble Empirical Mode Decomposition (EEMD) [9], Principal Component Analysis (PCA) [32] and Least Square (LS) method [3] which transfers the original time series into feature space. The proposed method is immune to dataset with higher noise and outliers. Meanwhile, the extracted feature from EPLS has relatively low dimension which shows that it can be adapted to different distance measures and clustering methods.

This paper is organized as follows. In Section 2, the related work is discussed. In Section 3, the details of EPLS algorithm, as well as EEMD, PCA and LS, are shown. Then, some experiment settings and database description are given in Section 4. In Section 5, the EPLS model based approach is applied to Migration data from Baidu system. Section 6 provides a summary of the results and concludes the whole paper.

Section snippets

Background and related work

Data sequences which contain explicit information about timing (e.g. ${PM}_{2.5}$ , stock, speech, population migration) can be looked as time series. Large amount of time series appear in almost every discipline [23], [26]. With applying clustering methods, some interesting patterns and correlation can be found in the underlying data [8]. Usually time series analysis depends on the choice of techniques and distance measures, in which the target is to find general approach for Migration data from

Mode decomposition, component analysis and projection (EPLS)

The target of EPLS is to find a base vector for time series database. Then, all the time series is mapped to the base vector, and a new set of time series can be obtained. The extracted features from time series are serving as inputs for data mining algorithms. EPLS is a way of dealing with features extraction, the outline of which is shown below. Firstly, all time series in a database are pooled into a aggregation and EEMD algorithm is applied to this aggregation. Then, a dimension reduction

Experimental settings

Some results of evaluation metrics and the details of experiment database are shown in this section. EPLS is applied to Migration data from Baidu search engine to reveal patterns. The $accuracy$ , $F_{measure}$ and $RandIndex$ are the evaluation criterion of performance.

Experiment results

The clustering results are presented in this section. Migration data from Baidu is applied to EPLS model-based approach. We compare the result from four aspects: (1) the quantization of noise and outlier, (2) the adaption for different clustering methods, (3) different distance measures for database and (4) time–frequency pattern. The results show that EPLS is more effective than traditional frequency and time domain based clustering method. Otherwise, EPLS performs well when the database is

Conclusion and future work

The study of feature extraction for Migration data is significant for urban planning and population research. In this paper, a novel numerical feature extraction method EPLS is proposed to address this problem. A series of experiments have been carried out to verify that proposed EPLS is valuable. It also has the society meaning; for example, the analysis on Baidu Migration data can provide useful suggests for government decision in Chinese New year Season.

Firstly, the noise level and outlier

Yunliang Chen received the B.Sc. and M.Eng degrees from China University of Geosciences, and the Ph.D.degree from Huazhong University of Science and Technology, China. He is currently an associate professor with School of Computer Science, China University of Geoscience, Wuhan, China.

References (47)

W.G. Cobourn
An enhanced PM 2.5 air quality forecast model based on nonlinear regression and back-trajectory concentrations
Atmos. Environ.
(2010)
H. Deng et al.
A time series forest for classification and feature extraction
Inform. Sci.
(2013)
F. Petitjean et al.
A global averaging method for dynamic time warping, with applications to clustering
Pattern Recognit.
(2011)
Lizhe Wang et al.
Particle Swarm Optimization based dictionary learning for remote sensing big data
Knowl.-Based Syst.
(2015)
Lizhe Wang et al.
Software tools and techniques for big data computing in healthcare clouds
Future Gener. Comput. Syst.
(2015)
T. Warrenliao
Clustering of time series dataa survey
Pattern Recognit.
(2005)
S. Aghabozorgi et al.
Clustering of large time series datasets
Intell. Data Anal.
(2014)
D. Barrack, J. Goulding, K. Hopcraft, et al. AMP: a new time-frequency feature extraction method for intermittent...
A. Charnes et al.
The equivalence of generalized least squares and maximum likelihood estimates in the exponential family
J. Amer. Statist. Assoc.
(1976)
L. Chen et al.
Robust and fast similarity search for moving object trajectories

C. Faloutsos et al.

Fastsubsequence matching intime-seriesdatabases

ACMSIGMOD Rec.

(1994)

M. Halkidi et al.

On clustering validation techniques

J. Intell. Inf. Syst.

(2001)

N.E. Huang et al.

The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis

Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci.

(1998)

L. Kaufman et al.

Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344

(2009)

E. Keogh et al.

On the need for time series data mining benchmarks: a survey and empirical demonstration

Data Min. Knowl. Discov.

(2003)

E. Keogh et al.

Clustering of time-series subsequences is meaningless: implications for previous and future research

Knowl. Inf. Syst.

(2005)

E. Keogh et al.

A simple dimensionality reduction technique for fast similarity search in large time series databases

Knowl. Inf. Syst.

(2000)

T. Köhler et al.

A Comparison of Denoising Methods for One Dimensional Time Series

(2005)

M. Kumar, N.R. Patel, J. Woo, Clustering seasonality patterns in the presence of errors, in: Proceedings of KDD 02,...

Y. Li et al.

Privacy protection for preventing data over-collection in smart city

IEEE Trans. Comput.

(2016)

J. Lin, D. Etter, D. DeBarr, Exact and approximate reverse nearest neighbor search for multimedia data, in:...

B. Minor, J.R. Doppa, D.J. Cook, Toward Learning and Mining from Uncertain Time-Series Data for Activity...

U. Mori, A. Mendiburu, J. Lozano, Similarity Measure Selection for Clustering Time Series...

Cited by (9)

Unsupervised feature selection via latent representation learning and manifold regularization
2019, Neural Networks
Citation Excerpt :
In original high-dimensional feature space, the distance concentration phenomenon makes the classical distance based models, e.g., KNN, fail to work (Mil’Man, 1971; Tang, Li, Wang and Wang, 2018). Usually, the intrinsic dimensionality of high-dimensional data is typically small (Chen et al., 2016; Jiang, Gao, Wang, & Shi, 2014; Li, Chen, Cheng, Liao, & Chen, 2017; Liyanaarachchi, Yang, Huang, & Zhang, 2016; Pes, Dess, & Angioni, 2017; Shao, Liu, & Li, 2014; Wang, Song, & Liu, 2016; Zhu, Zhang, Jin, Zhang, & Xu, 2010) and only a part of features are discriminative for learning tasks such as data clustering and classification since the noisy and redundant features mixed in original data often degenerate the performance of learning algorithms (Jiang & Chung, 2014; Jing et al., 2019; Kang, Pan, Hoi and Xu, 2019; Kang, Peng, & Cheng, 2017; Kang, Wen, Chen and Xu, 2019; Li, Shao, & Deng, 2015; Liu et al., 2018; Lu, Wang, Zou, & Wang, 2017; Matsumoto, Akaho, Sugase-Miyamoto, & Okada, 2010; Saraswati, Nguyen, Hagenbuchner, & Tsoi, 2018; Tang, Liu, Wang, Zhang, Li and Wang, 2019; Tang, Liu, Zhu, Xiong, Li and Xia et al., 2019; Tang, Zhu, Liu, Li, Wang and Zhang et al., 2018; Tangkaratt, Morimoto, & Sugiyama, 2016; Yan & Yang, 2015; Ye, Fu, Zhang, Zhao, & Naiem, 2018; Zhang et al., 2018; Zhu, Xu, Shen and Zhao, 2017). As an effective pre-processing of high dimensional data, feature selection (Connor, Hollensen, Krigolson, & Trappenberg, 2015; Cruz, Sabourin, & Cavalcanti, 2017; Ganivada, Ray, & Pal, 2013; Jain & Zongker, 1997; Le, Vo, & Pham, 2014; Sankaran, Jain, Vashisth, Vatsa, & Singh, 2017; Wang, Bensmail, & Gao, 2014) aims to accomplish the dimensionality reduction by removing some irrelevant and redundant features while preserving the intrinsic data structure.
With the rapid development of multimedia technology, massive unlabelled data with high dimensionality need to be processed. As a means of dimensionality reduction, unsupervised feature selection has been widely recognized as an important and challenging pre-step for many machine learning and data mining tasks. Traditional unsupervised feature selection algorithms usually assume that the data instances are identically distributed and there is no dependency between them. However, the data instances are not only associated with high dimensional features but also inherently interconnected with each other. Furthermore, the inevitable noises mixed in data could degenerate the performances of previous methods which perform feature selection in original data space. Without label information, the connection information between data instances can be exploited and could help select relevant features. In this work, we propose a robust unsupervised feature selection method which embeds the latent representation learning into feature selection. Instead of measuring the feature importances in original data space, the feature selection is carried out in the learned latent representation space which is more robust to noises. The latent representation is modelled by non-negative matrix factorization of the affinity matrix which explicitly reflects the relationships of data instances. Meanwhile, the local manifold structure of original data space is preserved by a graph based manifold regularization term in the transformed feature space. An efficient alternating algorithm is developed to optimize the proposed model. Experimental results on eight benchmark datasets demonstrate the effectiveness of the proposed method.
An intelligent regressive ensemble approach for predicting resource usage in cloud computing
2019, Journal of Parallel and Distributed Computing
Cloud Computing has become prime infrastructure for scientists to deploy scientific applications as it offers parallel and distributed environment for large-scale computations. During deployment, the significant prediction of resource usage is essential to achieve optimal scheduling for scientific applications. The existing resource prediction models fall short in providing reasonable accuracy because of high variances of cloud metrics. Therefore, to handle the varying cloud resource demands, it is necessary to accurately predict the future resource requirements for automatically provisioning the resources. In this paper, an Intelligent Regressive Ensemble Approach for Prediction (REAP) has been proposed which integrates feature selection and resource usage prediction techniques to achieve high performance. The effectiveness of proposed approach is evaluated in a real cloud environment by conducting a series of experiments. The experimental results show that the proposed approach outperforms the existing models by significantly improving the accuracy rate and reducing the execution time. The results are further validated by comparing the existing Learning Automata (LA) based ensemble approach with the proposed approach on the basis of error rate.
Consensus learning guided multi-view unsupervised feature selection
2018, Knowledge-Based Systems
Citation Excerpt :
Many multi-view learning methods have been proposed in past decade, such as feature learning [7–9], semi-supervised learning [10,11], ensemble learning [12,13], transfer learning [14,15] and active learning [16,17]. In practice, different modalities of data are usually represented in a high-dimensional feature space, this frequently leads to the curse of dimensionality problem [18–24]. In addition, obtaining the labels of data is a challenging and laborious task.
Multi-view unsupervised feature selection has been proven to be an effective approach to reduce the dimensionality of multi-view data. One of its key issues is how to exploit the underlying common structures across different views. In this paper, we propose a consensus learning guided multi-view unsupervised feature selection method, which embeds multi-view feature selection into a non-negative matrix factorization based clustering with sparse constrain. The proposed method learns latent feature matrices from all the views, and optimizes a consensus matrix such that the difference between the cluster indicator matrix of each view and the consensus matrix is minimized. The parameters for balancing the weights of different views are automatically adjusted, and a sparse constraint is imposed on the latent feature matrices to perform feature selection. After that, we design an effective iterative algorithm to solve the resultant optimization problem. Extensive experiments have been conducted on six publicly multi-view datasets, and the results demonstrate that the proposed algorithm outperforms several other state-of-the-art single view and multi-view unsupervised feature selection methods in terms of clustering tasks, validating the effectiveness of the proposed multi-view unsupervised feature selection method. The source code of our algorithm will be available on our on-line page: http://tangchang.net/.
Quantitative spatiotemporal impact of dynamic population density changes on the COVID-19 pandemic in China’s mainland
2023, Geo-Spatial Information Science
Big Educational Data Analytics, Prediction and Recommendation: A Survey
2022, Journal of Circuits, Systems and Computers
Feature selective projection with low-rank embedding and dual laplacian regularization
2020, IEEE Transactions on Knowledge and Data Engineering

View all citing articles on Scopus

Fangyuan Li received the B.Sc. degree from China University of Geosciences. Currently, she is a graduate student with the School of Computer Science, China University of Geosciences, Wuhan, China.

Jia Chen received the B.Sc. degree from China University of Geosciences. He is currently a postgraduate with School of Computer Science, China University of Geosciences, Wuhan, China. His research interests include data mining and high performance computing.

Bo Du received the B.Sc. degree from China University of Geosciences. He is currently a postgraduate with School of Computer Science, China University of Geosciences, Wuhan, China. His research interests include data mining and high performance computing.

Kim-Kwang Raymond Choo is with Department of Information Systems and Cyber Security, University of Texas at San Antonio, USA, and School of Information Technology and Mathematical Sciences, University of South Australia, Australia.

Houcine Hassan is with Department of Systems Data Processing and Computers Organization, Polytechnic University of Valencia, Spain.

View full text

EPLS: A novel feature extraction method for migration data clustering

Highlights

Abstract

Introduction

Section snippets

Background and related work

Mode decomposition, component analysis and projection (EPLS)

Experimental settings

Experiment results

Conclusion and future work

Atmos. Environ.

Inform. Sci.

Pattern Recognit.

Knowl.-Based Syst.

Future Gener. Comput. Syst.

Pattern Recognit.

Clustering of large time series datasets

Intell. Data Anal.

The equivalence of generalized least squares and maximum likelihood estimates in the exponential family

J. Amer. Statist. Assoc.

Robust and fast similarity search for moving object trajectories

Fastsubsequence matching intime-seriesdatabases

ACMSIGMOD Rec.

On clustering validation techniques

J. Intell. Inf. Syst.

The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis

Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci.

Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344

On the need for time series data mining benchmarks: a survey and empirical demonstration

Data Min. Knowl. Discov.

Clustering of time-series subsequences is meaningless: implications for previous and future research

Knowl. Inf. Syst.

A simple dimensionality reduction technique for fast similarity search in large time series databases

Knowl. Inf. Syst.

A Comparison of Denoising Methods for One Dimensional Time Series

Privacy protection for preventing data over-collection in smart city

IEEE Trans. Comput.