Elsevier

Neurocomputing

Volume 490, 14 June 2022, Pages 229-245
Neurocomputing

A multi-variate time series clustering approach based on intermediate fusion: A case study in air pollution data imputation

https://doi.org/10.1016/j.neucom.2021.09.079Get rights and content

Abstract

Multivariate Time Series Clustering (MVTS) is an essential task, especially for large and complex dataset, but it has received limited attention in the literature. We are motivated by a real-world problem: the need to cluster air pollution data to produce plausible imputations for missing measurements for some pollutants. Our main focus will be on the UK air quality assessments, the study uses data collected from automatic monitoring stations during four-year period (2015–2018).

In this work, we propose a MVTS clustering method followed by an imputation methods for the whole Time Series (TS). We compare two approaches to cluster the stations: univariate TS clustering using Shape-Based Distance (SBD) for individual pollutants, and MVTS clustering using the fused similarity that combines the SBD for all the pollutants. We run a k-means algorithm to produce clusters with each approach on the same dataset.

Our analysis shows that using MVTS clustering produces the best clusters as measured by various quality indexes and by the imputations they help to reduce the error average between imputed and real values based on the Root Mean Squared Error (RMSE) and its standard deviation (Std).

Introduction

Time Series(TS) is a sequence of observations that a variable takes over time, such as (t1,v1),…, (ti,vi),…(tm,vm), where ti is the time step and vi is the observation. The order in the time series data is important since the values are based on time. When several variables are observed and recorded simultaneously, this becomes a Multivariate Time Series (MVTS).

A large variety of real-world applications use time series analysis such as weather forecasting [1], earthquake prediction [2] or human activity recognition [3]. MVTS are becoming more prominent specially as part of large and complex datasets being produced [4]. In this study, we motivated by the need to generate modelling techniques for multivariate time series data, where air pollution is an example of such data. Our focus in this paper, will be on problems related to the air pollution in the UK, especially uncertainty resulting from missing data with air quality assessment. Hence, we encounter MVTS while looking at air pollution data, our proposed approach is based on the MVTS clustering and imputation.

Air pollution is one of the main risks to human health in several parts of the world. Sources of air pollution are varied and include anthropogenic sources such as combustion (e.g.in power plants, motor vehicles and residential heating), agriculture and industry as well as natural sources such as vegetation, soils, and lightning [5].

In the UK, the main four pollutants that are used to assess the quality of the air are ozone (O3), nitrogen dioxide (NO2) and particulate matter less than 2.5μm in diameter (PM2.5) or less than 10μm in diameter (PM10), so we focus on those four. These pollutants are measured at various monitoring stations and the measured concentrations of each pollutant become a time series (TS) requiring further transformation and analysis to produce air quality assessments.

One of the available resources to assess air quality in the UK is the Air Pollutants Monitoring Network. The network contains air pollution monitoring stations that record the air pollutant concentrations. There are 285 air quality monitoring sites across the UK, which are part of several types of networks with different objectives and coverage. Our focus will be on the automatic monitoring network called Automatic Urban and Rural Network (AURN). The instruments used in this network are automated and produce hourly pollutant concentrations. These data are collected and stored, then made directly available via the Internet [6]. Stations in this network are categorised by their environmental type into one of the following: rural, urban, suburban background, roadside, or industrial. The total number of these stations is 169 stations and the geographical distribution of the AURN monitoring stations is shown in Fig. 1.

In the UK, air quality is quantified using the Daily Air Quality Index (DAQI) which is calculated using the concentrations of NO2, O3, PM2.5, and PM10. This index is numbered from 1 to 10, and divided into four bands: ’low’ (1–3), ’moderate’ (4–6), ’high’ (7–9) and ’very high’ (10). An index value is initially assigned for each pollutant depending on its measured concentration. Then the DAQI is taken to be the maximum value assigned to any of the pollutants. Periods of poor air quality can be identified using this index. Air quality is negatively correlated with the DAQI index, meaning that a higher DAQI index represents worse air quality (for more details see [6]).

The challenges associated with analysing air pollution data (i.e. the pollutants TS) are as follows. Not all the stations report all the pollutants and even if a station does, it may not measure a particular pollutant all the time due to instrument down-time. Together this results in high levels of missing data. Therefore current air quality assessments are based on high levels of uncertainty. As the DAQI is calculated based on the concentrations of measured pollutants only, which may not reflect the actual air pollution. This may lead to incorrect policy decisions, with further negative environmental and health consequences [7].

What makes the air pollution data analysis more complex is that pollutants have different behaviours and seasonal variation. Adding to that, pollutant can be emitted from various sources and be involved in different chemical reactions and so their concentrations exhibit different temporal and spatial distributions.

Particulate matter (PM) has lots of various sources, both primary (emitted directly into the atmosphere) and secondary (produced in the atmosphere via chemical and physical processes). Whilst PM concentrations are often greater at roadside [8], the particles can have lifetimes of several days in the atmosphere, meaning that they can be distributed widely. The larger particles are subject to greater loss via sedimentation, so PM2.5 is more evenly distributed than PM10 [9].

The primary source of NO2 comes from fuel burning such as cars, trucks and buses, power plants, and off-road equipment. This gives NO2 a local pattern, concentrating where it is emitted in urban areas and near to the roadside [10].

Ozone is complex as it is not directly emitted into the air, but it is formed as a secondary pollutant by the reaction of nitrogen oxides (NOx) and volatile organic compound (VOC) in the presence of sunlight [11]. So, the ozone formation depends on the VOC–NOx ratio [12]. Ozone concentrations in urban areas have been found to be lower than those in rural areas [13], due to the presence of more NOx in urban sites that can remove ozone via the reaction of NO with O3 to give NO2 and oxygen (O2). O3 and NO2 are strongly anti-correlated, indicating that the O3 is strongly depressed by high NOx [14]. Furthermore, ozone can have a lifetime of days to weeks [15], meaning that ozone at a specific site may have been produced by NOx and VOCs emitted from other distant locations. Ozone behaviour makes the seasonal variation with ozone concentrations, as ozone is lower in the winter due to scavenging by NO and higher in the summer due to photochemical ozone production [11]. While PM and NO2 are at their lowest level during the summer [16].

Therefore, we aim is to investigate robust methods for estimating the missing values when there are no measurements of a particular pollutant at a site at all to reduce the uncertainty of the air quality assessment resulting from missing measurements which may be missing either partially or completely and to enhance the air quality data and provide a DAQI that is more realistic. As DAQI calculated from observed data only may give a false representation of the air quality, for example, if there were high concentrations of an air pollutant that was not being measured, the air quality may be worse than indicated by the DAQI.

To achieve our goal, we need to understand the relation between different pollutant concentrations and their geography. In particular, understanding such relations may enable us to impute missing data (including entire TS) where particular pollutants are not being measured. We postulate that in such cases pollutant measurements from other stations may act as a proxy measurement for the missing TS. Our approach to this starts with grouping stations with similar pollutant(s) behaviour into groups using clustering algorithm. Once we have clusters, we can use those to impute various measurements for stations that may belong to a cluster with information from the cluster itself or stations within the cluster.

As known that clustering is an unsupervised learning method to group unlabelled objects into homogeneous groups [17]. Similarly for TS, we group together a set of time series with similar patterns. TS unique structure makes many traditional clustering methods unable to be applied directly [18]. One challenge for TS clustering is how to measure similarity, which is the core of any clustering algorithm. Some of the univariate TS similarity measures cannot handle missing data or TS of different lengths [19]. The problem becomes more challenging when more than one time series is involved (i.e. in a multivariate TS environment). In this work, we experiment with novel MVTS clustering approaches and evaluate them in the context of air pollution measurements, and particularly for the task of imputing missing pollutant TS. We deal with an observation-based MVTS dataset with a high level of missing data and uncertainty.

We proposed a MVTS clustering approach that starts by clustering stations based on all measured pollutants using a fusion approach that aggregates the similarity/dissimilarity of the univariate TS (pollutants) between every two MVTS (stations). This aggregated similarity represents the distance between MVTS in the k-means clustering algorithm. Then, based on the clustering results, we propose two methods to impute the whole time series for the missing pollutant at a given station.

Three experiments are carried out to demonstrate the validity of our approach. In these experiments, we compare the clustering and the imputation results obtained using MVTS clustering with imputation using the univariate TS clustering.

The structure of the paper is as follows: Section 2 discusses some of the existing TS clustering methods with their application and the limitations of the previously proposed time series clustering approaches to our case with air pollution data. Section 3 discusses in detail all the methods we used to measure the similarity between MVTS for the clustering algorithm, methods to impute the missing pollutants and evaluate our proposed solutions. Finally, in Section 5, we analyse and compare the results of our experiments, then we discuss these results in Section 6. At the end of this paper, we conclude the work with some recommended future work in Section 7.

Section snippets

Related work

Due to the increasing availability of time series data and the demand to analyse them, clustering time series has attracted growing research interest in recent years [4], [20], [18], [21], [19], [22], [23]. However, most of the existing clustering methods are for univariate TS data, while clustering multivariate time series remains a challenging task [21].

The main problem with multivariate time series is dimensionality, and the majority of the existing researchers have proposed methods for

Time series analysis

As previously mentioned, that the UK air pollution data that is used in this work, has high level of missing data either partially or completely. As a pre-processing step, we impute partial missing values within the TS to create a complete dataset. Imputing the missing observations in an early stage enable us to measure the similarity between TS using univariate time series similarity measures that cannot handle missing data (i.e. Dynamic Time Warping (DTW) and Shape-Based Distance (SBD)).

In

Experimental set up

Clustering algorithm and imputation methods were implemented in R, Version (3.5.2). To provide a more robust testing scenario we separate the ‘model building’ stage for the imputation from the testing stage. We use an initial data period of three years (2015–2017) as a training set to build the imputations, including the clustering results, and then impute on the next year (2018) of the TS to evaluate the goodness of fit.

To fully evaluate the advantages of the MVTS clustering over the

Experiment 1: univariate TS clustering

In this experiment, we applied the first approach to impute the pollutant concentrations using our proposed imputation methods through clustering individual pollutant. We analyse the clustering results of each pollutant, in the following sections.

Discussion

Our analysis showed that a basic k-means algorithm with fused distances results in geographical patterns that are consistent with our understanding of sources and lifetimes of these pollutants, as explained in Section 5.2.

We found that using the basic k-means with the MVTS clustering and fused similarity in the second and third experiments gave a clear geographical correlation between the stations. Our results of analysing the centroids of the clusters identify similar pollutant concentrations

Conclusion

In this work, we proposed a model to impute missing pollutant (whole TS) through a MVTS clustering approach. We conducted multiple experiments to evaluate the effectiveness of our approach. We compared the proposed approach (i.e. the MVTS clustering using the fused similarity that combines the SBD for all the pollutants) with the univariate TS clustering using SBD for individual pollutants. These two approaches are compared in term of the clustering and the imputation quality.

We found that

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

We thank the anonymous referees for their useful suggestions.

Wedad Alahamade obtained her B.Sc. in Computer Science at Taibah University, Medina, Saudi Arabia with a first class Honours in 2007. Since 2010, she has worked for the Faculty of Computing Sciences, Taibah University as a lecturer. She obtained her M.Sc. degree in Computer Science at Rochester Institute of Technology (RIT), Rochester, NY, USA in 2016. she joined the machine learning group at University of east Anglia (UEA) and started her Ph.D. research in October 2018 in the field of data

References (43)

  • P.J. Rousseeuw

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

    J. Comput. Appl. Math.

    (1987)
  • G. Di Bello, V. Lapenna, M. Macchiato, C. Satriano, C. Serio, V. Tramutoli, et al., Parametric time series analysis of...
  • S. Seto, W. Zhang, Y. Zhou, Multivariate time series classification using dynamic time warping template selection for...
  • DEFRA air information resource, URL:...
  • P. Holnicki et al.

    Emission data uncertainty in urban air quality modeling–case study

    Environ. Model. Assessment

    (2015)
  • Public Health sources and effects of pm2.5, URL:...
  • National Statistics concentrations of particulate matter pm10 and pm25, URL:...
  • Centreforcities cities outlook 2020, URL:...
  • F.M. Diaz et al.

    Ozone trends in the united kingdom over the last 30 years

    Atmosphere

    (2020)
  • G.M. Mazzuca, X. Ren, C.P. Loughner, M. Estes, J.H. Crawford, K.E. Pickering, A.J. Weinheimer, R.R. Dickerson, Ozone...
  • M.A.H. Khan et al.

    An estimation of the levels of stabilized criegee intermediates in the uk urban and rural atmosphere using the steady-state approximation and the potential effects of these intermediates on tropospheric oxidation cycles

    Int. J. Chem. Kinet.

    (2017)
  • Cited by (9)

    • Multivariate time series cluster model based on multi-granularity linear Gaussian fuzzy granulation

      2023, Proceedings - 2023 6th International Conference on Data Science and Information Technology, DSIT 2023
    View all citing articles on Scopus

    Wedad Alahamade obtained her B.Sc. in Computer Science at Taibah University, Medina, Saudi Arabia with a first class Honours in 2007. Since 2010, she has worked for the Faculty of Computing Sciences, Taibah University as a lecturer. She obtained her M.Sc. degree in Computer Science at Rochester Institute of Technology (RIT), Rochester, NY, USA in 2016. she joined the machine learning group at University of east Anglia (UEA) and started her Ph.D. research in October 2018 in the field of data mining. Her main research interests include data mining and time-series analysis.

    Iain Lake is Professor of Environmental Sciences in the School of Environmental Sciences at UEA. He examines the link between the natural environment and public health. For example, he has been looking at: how gastrointestinal infections are affected by the weather, temperature, rainfall; how the climate influences Dengue fever in Mexico and central America; and the impact of climate change on allergies – especially related to pollen.

    Claire Reeves I have been at the University of East Anglia (UEA) throughout my academic career. After completing a BSc in Environmental Sciences and a PhD in Atmospheric Science, I was a researcher for many years. In 2010 I became Reader and in 2014 a Professor of Atmospheric Science. In 2021 I took early retirement but continue at UEA as a Professorial Fellow. My main research interests are tropospheric chemistry, air quality and halogenated compounds relevant to ozone depletion and climate change. I have contributed to the UNEP/WMO Scientific Assessments of Ozone Depletion and am on the Defra Air Quality Expert Group.

    Beatriz de la Iglesia was appointed as a lecturer at UEA in 2001 and since then she has lead on the development of the Master Degrees at the School of Computing Sciences. She obtained her first degree, a first class BSc Honours in Applied Computing, at UEA in 1994. She obtained her PhD in Computing Science in September 2001. The subject of her research was data mining and in particular the extraction of partial classification rules or nuggets using meta-heuristic algorithms.

    Beatriz has worked on data mining researcher for the past 25 years with particular application to health care data analysis. She has worked, among other themes, on the analysis of primary care datasets for cardiovascular disease risk evaluation; on text mining of gastroenterology procedural reports to identify key success indicators, on linking data in the secondary care setting in order to create patient-centric databases suitable for clinical research and on analysis of Twitter data for syndromic surveillance. She has experience of developing new data mining algorithms using optimisation techniques and has over 60 peer reviewed publications.

    View full text