Evolving Gaussian process models for prediction of ozone concentration in the air

doi:10.1016/j.simpat.2012.04.005

Simulation Modelling Practice and Theory

Volume 33, April 2013, Pages 68-80

https://doi.org/10.1016/j.simpat.2012.04.005 Get rights and content

Abstract

Ozone is one of the main air pollutants with harmful influence to human health. Therefore, predicting the ozone concentration and informing the population when the air-quality standards are not being met is an important task. In this paper, various first- and high-order Gaussian process models for prediction of the ozone concentration in the air of Bourgas, Bulgaria are identified off-line based on the hourly measurements of the concentrations of ozone, sulphur dioxide, nitrogen dioxide, phenol and benzene in the air and the meteorological parameters, collected at the automatic measurement stations in Bourgas. Further, as an alternative approach an on-line updating (evolving) Gaussian process model is proposed and evaluated. Such an approach is needed when the training data is not available through the whole period of interest and consequently not all characteristics of the period can be trained or when the environment, that is to be modelled, is constantly changing.

Introduction

Ozone (O₃), a form of oxygen, is a highly unstable and poisonous gas that can form and react under the action of light and that is present in two layers of the atmosphere. The ozone is a very specific air substance, which is present in the whole Earth’s atmosphere – from the ground level to the top of the atmosphere. The stratospheric ozone prevents the harmful solar ultraviolet radiation to reach the Earth’s surface. However, in the tropospheric layer, which is at ground level, the ozone is an air pollutant, which damages human health and the ecosystem equilibrium. Exposure to ozone can cause serious health problems in plants and people, thus ozone pollution is a major problem in some regions of the world. It tends to increase during periods of high temperatures and sunny skies. The ozone content changes in the troposphere and the complexity of the processes defining these changes are the reasons why the atmospheric ozone dynamics is an object of intensive research.

The most direct way to obtain accurate air quality information is from measurements made at surface monitoring stations across countries. Fixed measurements of hourly ozone concentrations in compliance with the European Directive on ambient air quality and cleaner air for Europe [1] give continuous information about the evolution of surface ozone pollution at a large number of sites across Europe. In several Member States, they are more and more supplemented by numerical model outputs delivered at a regional or local scale, in keeping with the European Directive. The European standards that guarantee human-health protection are as follows: health protection level, 120 μg/m³ 8 h mean concentration; informing the public level, 180 μg/m³ 1 h mean concentration; and warning the public level, 240 μg/m³ 1 h mean concentration. Therefore, predicting the ozone concentration and informing the population when the air-quality standards are not being met are important tasks.

Ozone concentration has a pronounced daily cycle [22], which can be modelled and forecasted using a variety of methods, and methods that describe the non-linear dynamics from available data are particularly useful. Thus, there exists a number of methods for ozone concentration prediction based on various modelling techniques, e.g. based on neural network NARX models [2], [15], [31], polynomial NARX models [26], fuzzy systems [20], [21], support vector machines [7], ARIMA stochastic models [9], Gaussian processes (GP) [15], [14], [13]. There are also methods which are based on a combination of some of the mentioned techniques, e.g. the approach in [10] combines the use of neural networks, support vector machines and genetic algorithms.

In this paper, we focus on the use of GP modelling techniques for development and comparison of various models for prediction of ozone concentration in the air. The GP model is a probabilistic, non-parametric model based on the principles of Bayesian probability. It differs from most of the other black-box identification approaches in that it does not try to approximate the modelled system by fitting the parameters of the selected basis functions, but rather by searching for relationships among the measured data. The output of the GP model is a normal distribution, expressed in terms of the mean and the variance. The mean value represents the most likely output and the variance can be interpreted as a measure of its confidence. The obtained variance, which depends on the amount and the quality of the available identification data, is important information when it comes to distinguishing the GP models from other computational intelligence methods. Because of their properties GP models are especially suitable for modelling of uncertain processes or when modelling data are unreliable, noisy or missing. GP models fit well for modelling of environmental systems as well as for ozone pollution modelling. Thus, GP models have been developed for prediction of ozone concentration in the air of Bourgas, which is among the regions in Bulgaria with the highest levels of ozone pollution in the air. In [14] first-order GP models based on measurements of the air-pollutant concentrations are identified and verified for one-step-ahead predictions of the ozone concentration in the air of Bourgas. Furthermore, in [13] high-order GP models by using measurements of both the air pollutants and the meteorological parameters are identified and verified. In both cases GP models are trained off-line using only a subset of the available data due to the high computational burden of modelling GP models. However, this limitation and, consequently, the quality of GP models can be improved with on-line updating using the most recent measurements.

A noticeable drawback of system identification with GP models is the computation time necessary for the modelling. Regression based on GP models involves several matrix computations in which the computational complexity increases with the third power of the number of input data, such as matrix inversion and the calculation of the log-determinant of the used covariance matrix. This computational greed restricts the amount of training data, to at most a few thousand cases. To overcome the computational-limitation issues and to also make use of the method for large-scale dataset applications, numerous authors have suggested various sparse approximations [27], [28]. All sparse approximate methods try to retain the bulk of the information contained in the full training dataset, but reduce the size of the covariance matrix to facilitate a less computationally demanding implementation of the GP model. The special kind of sparse approximate method is on-line modelling method Sparse On-line Gaussian Processes (OGP) [8] which tries to incorporate all information of the data by projecting to the reduced covariance matrix.

The OGP method was already implemented for modelling the ozone concentration in the air [25]. As the weather and its characteristics are constantly changing, the model should be updated and adjusted as well. That means it should not only update the model with information contained in streaming data, but should concurrently optimize hyperparameter values as well. As we experienced the OGP method has problems with numerical instability, therefore we propose, by our opinion, more robust method for on-line updating (evolving) of a GP model and compare its performance with an off-line trained GP models. The proposed method is based on the concept described in [24], but it is implemented differently as the method used in the experimental part of the paper.

The paper is structured as follows. In Section 2, the use and properties of Gaussian processes for modelling are reviewed. In Section 3, first- and high-order GP models for prediction of ozone concentration in the air of Bourgas, Bulgaria are identified off-line. A method for prediction of ozone concentration based on an on-line updated (evolving) GP model is proposed and evaluated in Section 4. The concluding remarks end the paper.

Section snippets

Modelling of dynamic systems with Gaussian processes

A GP model is a flexible, probabilistic, non-parametric model with uncertainty predictions. Its uses and properties for modelling are reviewed in [29]. The use of Gaussian processes for modelling dynamic systems is a relatively recent development [6], [12], [18]. A retrospective review can be found in [17].

A Gaussian process is a collection of random variables which have a joint multivariate Gaussian distribution (Fig. 1). Assuming a relationship of the form y = f(x) between input x and output y,

Off-line GP models for prediction of ozone concentration in the air of Bourgas

Measurement data for the year 2008, collected at the automatic measurement station in the centre of Bourgas, Bulgaria, are used. The data includes hourly measurements of the concentrations of ozone, sulphur dioxide, nitrogen dioxide, phenol and benzene. The meteorological parameters have not been measured at this station. However, in order to study how these parameters would influence the prediction of ozone concentration in the air of Bourgas, their measurements at two other stations in the

Evolving GP models

In the previous section, data through all period of interest, in our case year 2008, is available for training and validation.¹ Therefore all characteristics of the data from the whole period can be trained. In this section we will test an alternative approach of training GP models that is needed when the training data is not available through the whole period of interest and

Results

To assess the effectiveness of proposed method we divided data in four periods: astronomical winter, astronomical spring, astronomical summer and astronomical fall,² as data is available only for the year 2008 and it can be shown (Fig. 3) that selected seasons have different characteristics.

From Fig. 3

Conclusions

In this paper, various first- and high-order GP models for prediction of the ozone concentration in the air of Bourgas, Bulgaria are identified off-line and compared. The results show that for the same model type, the high-order models are more accurate than first-order models. The best model is the second-order model, whose input parameters are values of the ozone concentration and the meteorological parameters at two previous hours.

Furthermore, as an alternative approach an on-line updating

Acknowledgements

This work was financed by the National Science Fund of the Ministry of Education, Youth and Science of Republic of Bulgaria, Contract DO02-94/14.12.2008 and the Slovenian Research Agency, Contract BI-BG/09–10-005 (“Application of Gaussian processes to the modelling and control of complex stochastic systems”).

References (31)

S.M. Al-Alawi et al.
Combining principal component regression and artificial neural-networks for more accurate predictions of ground-level ozone
Environmental Modelling & Software
(2008)
K. Ažman et al.
Application of Gaussian processes for black-box modelling of biosystems
ISA Transactions
(2007)
C. Dueñas et al.
Stochastic model to forecast ground-level ozone concentration at urban and rural areas
Chemosphere
(2005)
Y. Feng et al.
Ozone concentration forecast method based on genetic algorithm optimized back propagation neural networks and support vector machine data classification
Atmospheric Environment
(2011)
B. Fritzke
Growing cell structures – a self-organizing network for unsupervised and supervised learning
Neural Networks
(1994)
A. Grancharova et al.
Explicit stochastic predictive control of combustion plants based on Gaussian process models
Automatica
(2008)
J. Kocijan et al.
Gas–liquid separator modelling and simulation with Gaussian-process models
Simulation Modelling Practice and Theory
(2008)
Y. Lin et al.
Fuzzy system models combined with nonlinear regression for daily ground-level ozone predictions
Atmospheric Environment
(2007)
E. Pisoni et al.
Forecasting peak air pollution levels using narx models
Engineering Applications of Artificial Intelligence
(2009)
D. 2008/50/EC, Directive 2008/50/EC of the european parliament and of the council of 21 may 2008 on ambient air quality...

Angelov, P., Buswell, R., 2001. Evolving rule-based models: a tool for intelligent adaptation, in: Proceedings of the...

P. Angelov et al.

Evolving Intelligent Systems: Methodology and Applications

(2010)

K.J. Åström et al.

Computer Controlled Systems: Theory and Design

(1984)

A.B. Chelani

Prediction of daily maximum ground ozone concentration using support vector machine

Environmental Monitoring and Assessment

(2010)

L. Csató et al.

Sparse online gaussian processes

Neural Computation

(2002)

Cited by (58)

A simplified potential source density function based on predefined discretization
2024, Journal of Engineering Research (Kuwait)
The potential source contribution function (PSCF) method is widely used in the analysis of air pollutant source areas, but it also faces several limitations. To address such limitations, the potential source density function (PSDF) method was developed based on Gaussian process regression (GPR). However, the PSDF model requires more computational resources than the PSCF model. Here, we present an enhanced model with improved speed. We discretized the PSDF method by assigning a predetermined spatial correlation between cells through a priori known correlation length scale. The time taken was reduced by 25–30% from that of the original PSDF method, while the values representing the air pollution sources exhibited only a slight difference from the original ones. Our new method reduces the time required for computational calculations, measures potential sources with comparable precision, and ensures the reliability and source intensity of the results.
A novel soft sensor based warning system for hazardous ground-level ozone using advanced damped least squares neural network
2020, Ecotoxicology and Environmental Safety
Estimation of hazardous air pollutants in the urban environment for maintaining public safety is a significant concern to mankind. In this paper, we have developed an efficient air quality warning system based on a low-cost and robust ground-level ozone soft sensor. The soft sensor was developed based on a novel technique of damped least squares neural network (DLSNN) with greedy backward elimination (GBE) for the estimation of hazardous ground-level ozone. Only three meteorological factors were used as input variables in the estimation of ground-level ozone and we have used weighted k-nearest neighbors (WkNN) classifier with fast response for development of air quality warning system. We have chosen the urban areas of Taiwan for this study and have analyzed seasonal variations in the ground-level ozone concentration of various cities in Taiwan as part of this work. Moreover, descriptive statistics and linear dependence of ozone concentration based on Spearman correlation coefficient, Kendall's tau coefficient, and Pearson coefficient are calculated. The proposed DLSNN/GBE method exhibited excellent performance resulting in very low mean square error (MSE), mean absolute error (MAE), and high coefficient of determination (R²) compared to other traditional approaches in ozone concentration estimation. We have achieved a good fit in the determination of ozone concentration from meteorological features of atmosphere. Moreover, the excellent performance of proposed urban air quality warning system was evident from the good F1-score value of 0.952 achieved by the WkNN classifier.
The assessment of return probability of maximum ozone concentrations in an urban environment of Delhi: A Generalized Extreme Value analysis approach
2019, Atmospheric Environment
High ozone episodes have become a serious issue specially in megacities of developing countries. In this study an attempt has been made to understand the changes in the extreme ozone concentration due to different precursor compounds and meteorological variables in a given time period by applying Generalized Extreme Value (GEV) theory. The return probabilities of the extreme ozone concentrations were estimated for both classical stationary assumption as well as nonstationary assumption given the fact that ozone time series have both trend and multiple periodicities. Under stationary case, the distribution parameters were allowed to remain stationary, but in nonstationary case, distribution parameters were allowed to vary as a function of precursor compounds (Benzene, Toluene, mp-Xylene, NO, NO₂) and meteorological variables (temperature and relative humidity). Daily maximum ozone concentrations were found to follow heavy tailed Fréchet distribution in both stationary and nonstationary conditions. The inclusion of covariate into the classical model explains the dynamic nature of ozone depending on its precursor variables. Principal components were also used as covariates in nonstationary GEV distribution model. The estimated return levels of ozone from the stationary model, were found to be 97.89, 144.34, 188.44, 232.10, 310.33, 366.47 and 429.97 μg/m³ for 3, 7, 15, 30, 90, 180 and 365-day return period, respectively. In case of non-stationary GEV model the estimated return levels for 3, 7, 15, 30, 90, 180 and 365-day period were in the range 38.38–140.96, 84.31–184.35, 115.92–225.26, 166.61–337.21, 189.46–388.30 and 214.98–445 μg/m³, respectively. The highest median return level of O₃ for 3-day return period (100.35 μg/m³) was observed due to temperature as covariate, for 7-day (146.04 μg/m³) and 15- day (189.85 μg/m³) return period, it was observed due to Benzene as a covariate and for rest of the 4 different return periods, NO was found responsible for the highest median return level of O₃ (233.52, 312.40, 369.02 and 433.06 μg/m³ for 30, 90, 180 and 365-day respectively). Seasonal analysis finds O₃ extremes to be high in monsoon and premonsoon seasons and low in winter period. The impact of nonstationary condition is exemplified by the fact that the 365 day return level of maximum ozone concentration was found to exceed within just 20 days for a lower concentration of precursor Benzene (i.e. 214.48 μg/m³). Results underline the role of the principal components of the precursor compounds in governing the maximum ozone concentration in any city.
Handling Big Datasets in Gaussian Processes for Statistical Wind Vector Prediction
2019, IFAC-PapersOnLine
We construct a statistical model predicting wind vector over very complex terrain characterized by low wind speeds and changeable wind directions. These are necessary inputs for atmospheric dispersion modelling of hypothetical radioactive pollution events in short-term or medium-term future to better protect the local population. The statistical model uses predictions of a numerical weather prediction model as some of its inputs, so they together form a hybrid model. The statistical model is realized as a nonlinear autoregressive exogenous model whose dynamics is described with a Gaussian process model. It relies on training data, and there is more training data available than the computing system is able to process. One possibility of avoiding this issue is to use a randomly selected subset of the available historical measurements as the training data. However, a better choice of training data may result in a model that performs better. We develop and test a smart training set selection method that selects the training data points based on Euclidean distances between them. The resulting model improvement is insignificant and inconsistent. We explore the reasons for underperformance of the method. We conclude that our example does not offer much opportunity for training set selection methods to achieve better results than random selection.
A novel approach to forecast urban surface-level ozone considering heterogeneous locations and limited information
2018, Environmental Modelling and Software
Surface ozone (O₃) is considered an hazard to human health, affecting vegetation crops and ecosystems. Accurate time and location O₃ forecasting can help to protect citizens to unhealthy exposures when high levels are expected. Usually, forecasting models use numerous O₃ precursors as predictors, limiting the reproducibility of these models to the availability of such information from data providers. This study introduces a 24 h-ahead hourly O₃ concentrations forecasting methodology based on bagging and ensemble learning, using just two predictors with lagged O₃ concentrations. This methodology was applied on ten-year time series (2006–2015) from three major urban areas of Andalusia (Spain). Its forecasting performance was contrasted with an algorithm especially designed to forecast time series exhibiting temporal patterns. The proposed methodology outperforms the contrast algorithm and yields comparable results to others existing in literature. Its use is encouraged due to its forecasting performance and wide applicability, but also as benchmark methodology.
A space mapping method based on Gaussian process model for variable fidelity metamodeling
2018, Simulation Modelling Practice and Theory
Citation Excerpt :
Gaussian process (GP) model has been wildly used to formulate a Bayesian framework for emulating various forms of functions, especially for expensive simulations or physical experiments [1–3].
Computational simulation models with different fidelities are usually available in the design of engineering products for obtaining the quantity of interest (QOI). To integrate and fully exploit variable fidelity information, a space mapping based variable-fidelity metamodeling (VFM) approach is developed in this work. Firstly, a Gaussian process (GP) model is constructed for the low-fidelity (LF) model. Secondly, a variable-fidelity metamodel is constructed by taking the predicted information from this GP model as a prior-knowledge of the QOI and directly mapped into the outputs space of the high-fidelity (HF) model. This space mapping process is performed by constructing another GP model. A mathematic example is first adopted for illustrating how the proposed approach works under different sample sizes and sample noises. Then, the proposed approach is applied to two real-life cases, modeling of the maximum stress for the structure of a Small Waterplane Area Twin Hull (SWATH) catamaran and predicting weld geometry in fiber laser keyhole welding, to illustrate its ability in support of complex engineering design.

View all citing articles on Scopus

View full text