1 Introduction

The low costs of sensors to scale airborne agents, besides the high existence of environmental data, lead forward to an extreme increase in the pollution dataset available amount for analysis (Bellinger et al. 2017). Air quality prediction is vital due to its impacts in keeping human health secured. Thus, ground-level O3 prediction is also crucial (Samsuri Abdullah et al. 2020; Geetha and Prasika 2018). For human health and luxury, clean air is crucial. Air pollutant is a real hazard to both people and the planet (Mohd Napi et al. 2020). Indeed, ground-level O3 is a global air pollution and dangerous element issue. In non-rural zones, ozone’s presence records a wide range in the meteorological and gas emission elements (Abdul Aziz et al. 2019). Ozone in the tropospheric layer stays as the almost inescapable air harming the universe that shall impact people’s prosperity, sustainability and the planet as well. Indeed, prospect danger to people’s health, crops, flora and fauna will occur due to a high tendency of O3 in the tropospheric layer (Arsić et al. 2020).

Ozone, as illustrated in Flowchart 1, is crucial as it plays an important role in the thermal elements of the atmosphere. In urban areas, precocious death rate and chronic diseases of vulnerable community members are due to high-risk air quality (Abdullah et al. 2019; World Health Organization 2018). Owing to this, the artificial neural network (ANN) becomes a field of interest to be utilized in prediction purposes, especially metrological data and volatile organic compounds (Cabaneros et al. 2019). Recently, ANN performs desirably in various short- and long-term prediction implementations (Lightstone et al. 2017; Rahimi 2017).

Flowchart 1
figure 1

Illustration of the production and interaction of O3

Owing to this, to assist in minimizing the related consequences, short-term prediction is highly recommended, since it is frequently repeated over interval time, to raise population’s awareness by public authorities. Recently, support vector machine (SVM) besides artificial neural network—which is soft computing model derivation—becomes wide range and is used for air quality prediction depending on the meteorological and the gas emission data obtained from networks of air monitoring (Al-Abri and S. 2016).

Research papers which focused on the use of “machine learning” or “artificial intelligence” in the ozone prediction are checked and evaluated throughout the protocol of modelling recommended by Wu et al. (2014). In Addition, the performance of the models regarding ozone prediction by using machine learning is examined. Furthermore, papers concerned on tropospheric and stratosphere ozone concentration layers in industrial and urban areas (Wang et al. 2018c). Although multilayer perceptron (MLP) network is widely used in various prediction and forecasting applications (Cabaneros et al. 2017; Muslim et al. 2020), there is another formulation of models examined. It is expected that readers, through this paper, clearly understand a variety of ozone prediction approach terminologies and concept. For discussion in details of different subjects that have been mentioned in papers and textbook, see Bishop (2014), Kotzias et al. (2009) and Swamy (2018). All abbreviations used are mentioned in Table 1.

Table 1 List of abbreviation

In part of the ANN model, development and types are outlined briefly. A detailed discussion for every single stage is stated somewhere (see Galelli et al. 2014; Humphrey et al. 2017; Jiang et al. 2004). Furthermore, the taxonomies of current selections at different stages at the progress of prediction were based on previous studies. The outcome of the paper reviewed is that the concentration of O3 reflects the condition of the metrological data. Pollution due to high ozone concentration is affected by a variety of factors and the typical result of the conditions of metrological data such as temperature and relative humidity.

2 Literature Review

In this paper, the investigated articles covered are selected by using the below-mentioned international journals: Springer Nature, Atmospheric Environment, Environment International, Journal of Cleaner Production, American Geophysical Union, Atmosphere (MDPI), Environmental Research, Environmental Science and Technology, Science of the Total Environment, Atmospheric Chemistry and Physics, Environmental Pollution, Frontiers in Earth Science, The International Ozone Association, International Journal of Environmental Research and Public Health and International Journal of Automation and Computing. Also, international journals such as International Journal of Modelling on Simulation, Journal of Intelligent and Fuzzy Systems, Neural Networks, Air Quality, Atmosphere, and Health, Journal of Hazardous Materials, Water Resources, Journal of the Air and Waste Management Association, ScienceDirect, Chemosphere, Ecological Modelling, Atmospheric Measurement Techniques, Geoscientific Model Development, Journal of Scientific and Industrial Research, Environmental Modelling and Software, World Health Organization, Atmospheric Pollution Research, Neural Networks in a Softcomputing Framework, Fresenius Environmental Bulletin, Integrative Biology, Journal of Advanced Science and Technology, Advances in Neural Information Processing Systems, Advances in Space Research, Aerosol and Air Quality Research, Aerosol Science and Technology, Agricultural Water Management, Algorithms, American Chemical Society, American Journal of Epidemiology, American Statistical Association, Applied Acoustics, Applied Sciences, Applied Soft Computing, Applied Water Science, Arabian Journal of Geosciences, Artificial Intelligence Review, Atmosphere, Atmospheric and Oceanic Optics, Atmospheric Research, Chemical Engineering Research and Design, Chemometrics and Intelligent Laboratory Systems, Chinese Journal of Environmental Engineering, Computational Intelligence and Neuroscience, Computer Aided Chemical Engineering, Decision Support Systems, Ecological Processes and Environmental Modelling and Assessment.

The research papers are accessed through EZproxy library catalogues by Universiti Tenaga Nasional, ScienceDirect , IEEE Xplore and Scopus. Search terms involved “Ozone Prediction”, “Ozone Concentration Forecasting”, “Artificial Neural Networks” and “Machine Learning and ozone concentration”. For each database utilized in the paper review progress, the procedure of keyword search was frequently repeated till the citation that followed up stopped. Nevertheless, the reference list of the papers reviewed was tracked to obtain further references. The articles selected, in this review papers, were published from 2015 to June 2020. Articles that related to stratosphere and troposphere ozone concentration are included. Furthermore, articles with undesirable performance or obtained similar results to other approaches were not covered. Table 2 clearly stated the characteristics of the covered papers, such as the authors, study area and data examined.

The main contribution of this paper review is to present the recent machine learning techniques, including SVM, ANN, decision tree and hybrid models, for predicting ozone concentration. A variety of papers has been reviewed from different journals and scientific platforms.

Details regarding the publication year of the papers reviewed are shown in Fig. 1a. Meanwhile, the distribution of the number of articles with related air pollutant parameters is presented in Fig. 1b. The findings showed that oxides of nitrogen, e.g. NO2, NO, PM10 and PM2.5 and O3, are the highest parameters’ number of tested variables out of the papers reviewed. O3 concentration prediction was tested in 63 papers out of 156 papers, whereas, by approximately 50% lower than O3, PM concentration prediction was studied in 28 of the total papers reviewed. NO and CO have been studied in 20 and 10 papers, respectively. Furthermore, other variables have been discussed with low research intensity.

Fig. 1
figure 2

a Distribution of research papers by publication year. b Distribution of air pollutants involved in this paper

The distribution of research papers based on study area is shown in Fig. 2. It is noticed that there are an increasing number of papers since 2015 that cite the prediction of ozone concentration based on machine learning (ML) algorithms. This increase occurred due to the difficulty of accessing ML algorithms in the past (Cabaneros et al. 2019). Recently, the availability of ML algorithms as a faster and more helpful computing tool has earned research attention.

Fig. 2
figure 3

Distribution of research papers

The aim of this paper is to investigate all aspects affecting the prediction of O3 and the accuracy of models. This has been done by using a clear discussion on the integration of previous studies in the same or similar field. The research studies involved in this review paper are organized as shown in Table 2.

Table 2 Details of paper reviewed

3 Discussion

3.1 Theoretic Approaches

Prediction process could be progressed with the help of past and present data, of the required variable, and then it is carried out to forecast the future developed trend utilizing logical reasoning and scientific methods (Yang et al. 2020). Historical data consist of full knowledge regarding different traits and historical behaviours of the data system. Time series prediction analysis plays an important role from catastrophe prevention and rational decisions in various areas. So, various approaches are required to efficiently predict various data accordingly with different traits (Wang et al. 2018a).

Air pollutant concentration prediction is essential to create alarm warning for air quality which possesses practical significance for community (Bai et al. 2018). Owing to this, there are several types of approaches established for ozone layer time series prediction analysis.

3.1.1 Informative Theory Approach

Based on Kingman and Kullback (1970), informative theories encompass the subject fields which create an interdisciplinary framework to explore different dynamic systems as well as their nonlinear relationships and behaviours (Mayer et al. 2014). Understanding these characteristics is necessary for atmospheric and environmental systems characterized by a high degree of complexity and nonlinearity (Chattopadhyay et al. 2020). Thus, a study carried out by Chattopadhyay et al. (2020) on the tropospheric ozone’s dependence on several ozone’s precursors, in the period of summer monsoon to, explored the intrinsic multicollinearity via Bartlett’s sphericity test. Additionally, PCA has been performed to obtain the most influential precursors of ozone by identifying the principal components along with maximizing factor loadings. This study has resulted in the investigation of the intrinsic uncertainty as well as identification of the normal distribution. However, there is a need for further study to investigate the post-monsoon as well as winter season which are the periods when pollution become in alarming stages.

The detrended fluctuation analysis of power law correlations in column ozone over Antarctica has been carried out by Varotsos (2005a). This study has utilized dataset from 1979 to 2003. Furthermore, the study illustrated the planetary waves’ role in scaling attribute of spatiotemporal of the Antarctica O3 hole. This research has outcome that since 1996, the intrinsic dynamics in column ozone in the Antarctica’s edge have changed which means that does not hold for the ozone hole of Antarctica (Varotsos 2005a). Similarly, in the line with detrended fluctuation analysis for the global ozone layer, Varotsos (2005b) has utilized modern computational techniques. This analysis has been performed for globally and zonally averaged column ozone data implemented by satellite-borne (1979–2003) and ground-based (1964–2004) to identify long-term correlations in the time series of column ozone. The results showed that column ozone fluctuations illustrate persistence in long-range power law correlations at all lags of time.

3.1.2 Fuzzy set theoretic approaches

In 1993, Song and Chissom suggested the definition of fuzzy time series (FTS) rely on fuzzy set (Song and Chissom 1993). Recently, for the prediction of air pollutants, fuzzy time series has been utilized. According to a study carried out by Wang et al. (2017), fuzzy set theory has been utilized to build the air quality index. His research paper deploys the trapezoidal function to define and identify the negative effects of individual pollutants that lead to membership degrees in the individual pollutants. However, the limitation of the research is the low number of air pollutants examined; meanwhile, ignoring other environmental variables might influence the level of AQI. Furthermore, there is consideration of pollutant concentrations with respect of period, whereas a variety of correlated factors with efficiency of prediction has not been considered.

Another fuzzy approach has been developed by Wojtylak (2012), and this prediction approach has utilized fuzzy time series models in order to predict the pollutant such as O3, CO, NO, SO2, PM2.5 and PM. The positive aspect in the study that encompasses all the data, including the chaotic, uncertain and imprecise, could not been utilized. Although the results of the experiment were promising, uncertainty and stability analysis is still questioned which is very vital for the model proposed. The consequences of not performing the uncertainty and stability analysis lead to insufficient results.

In line with demanding of the stability and uncertainty analysis, Yang et al. (2020) have implemented both analyses in order to evaluate the robustness of the model. This novel combined predicting system for air pollution is relying on fuzzy theory and optimization of aggregation weight. This proposed predicting system is capable of taking into account more information and maintaining the models’ diversity. For the prediction of PM2.5 and PM10, the proposed system has outweighed other models, such as backpropagation artificial neural network (BPNN), extreme learning machine (ELM) and double exponential, in terms of the generalization, stability and accuracy abilities that are the principle of a robust air quality early-warning system in practice.

3.1.3 Probabilistic Set Theoretic Approaches

Probabilistic approaches for air quality prediction have been highly preferred in order to enhance the capability for ozone prediction (Vautard et al. 2009). To augment forecasters of air quality from the USA, a study of probabilistic forecasting of surface ozone with novel approach has been proposed (Balashov et al. 2017). This study investigates the surface ozone predicting approach which is relied on standard meteorological variables and statistical approaches. It aims to generate probabilistic MDA8 ozone predictions and REGiS weighs and gathers all of the developed regression models on the principle of the weather patterns forecasted by an NWP model. The outcomes proposed which model proceeds highest once trained and adjusted individually for a single air quality monitoring station and its corresponding meteorological site. However, this approach is not developed to deal with sudden local emission modifications or occasions like biomass fire.

The National Centre for Atmospheric Research has carried out a research to enhance the air quality prediction with an analog ensemble (Delle Monache et al. 2020). The analog ensemble speculates the probability of the true state of a prediction which relies on a recent deterministic numerical forecasting and an archive of previous analogous predictions associated with previous observations. The results’ probabilistic predictions from analog ensemble are numerically sharp, reliable and consistent which quantify the underlying forecasting’s uncertainty. However, a wide range of datasets is required since analog-based approaches perform poorly when dealing with smoke of wild fire incidence.

A probabilistic prediction study for extreme NO2 pollution episodes has been conducted by Aznarte (2017). The datasets used were for meteorological measures and NO2 concentrations in order to construct models for extreme NO2 concentration prediction. The experiments of results proved the reliability, accuracy and sharpness of predictions which outweigh point-forecasting alternatives; furthermore, investigation of the relative independent variables’ importance was included. This study has showed an approach to calculate the overlapping of thresholds’ probability that is not a complex and comprehensible manner to show probabilistic forecasts maximizing its advantages. Nevertheless, it is lacking of longer forecasting horizons and in need of inclusion of spatiotemporal consideration as well as other numerical prediction covariates.

3.2 Overview on prediction modelling

Based on the procedure described in Flowchart 2, the automatic generation of models from data is progressed and created in ML. Data is the origin of the ML approaches (Wang et al. 2019). The relationship of ML mode inputs and outputs can be described by the data which might be involved in unsupervised techniques (Khan & Kumar, 2019). Similarly, datasets encoded the specification that needed to be incorporated in the ML model. Moreover, in order to choose an effective ML model, the model’s selection should be implemented. Data processing, data partitioning, model selection, features, training and evaluation are progressed in ML, while are also empowering governance, repeatability and collaboration (Wang et al. 2019). In this study, we have concerned on different ML models which are the extensions of the well-known ML model, i.e. SVMs (Kumar et al. 2019; Tanaskuli et al. 2020).

Fig. 3
figure 4

Optimal hyperplane in SVM

3.3 Support Vector Machine

SVM is known as a supervised learning approach, where no assumption is applied to the implicit distribution of data (Mountrakis et al. 2011). SVM is a reputable machine learning technique that is widely implemented on regression cases and classification problems (Faris et al. 2018). The SVM is a ML approach, which performance depends on its determination of parameters, for pattern recognition purpose (Beltrami and da Silva 2020). In the case of regression forecasting, support vector machine created by Vapnik (1979) is massively utilized as a comprehensive mathematical model. To achieve differential subgroups in a short time, it provides reliable classification of high-dimensional big data to a narrow range of numbers of data points (support vectors) (Ozer et al. 2020). It is generated for classification case purpose; however, support vector regression (SVR) approaches capable to be successfully implemented in regression issues. SVR has utilized to forecast cloud cover, visibility, solar radiation and recently is used tremendously in predict air pollutant concentrations prediction like O3 (Quej et al. 2017). According to Vapnik (2000), the SVR provides a high boundary on the generalization error which is structural risk minimization principle. SVR showed more superiority and accurately compared to a number of statistical forecasting methods (Su et al. 2020). The SVR algorithm provides the approximated regression functions which is developed by using a high-dimensional group of linear equations (Eqs. (1), (2) and (3)) which are stated below

$$ y= w\phi (x)+b $$
(1)

where w is the weight transmitter, ϕ(x) is the high-dimensional feature space, x is the input, y is the output and b is the coefficient representing bias.

By reducing below-mentioned regulated risk equation, above-mentioned parameters can be estimated.

$$ R=\frac{{\left\Vert w\right\Vert}^2}{2}+C{\sum}_{i=1}^N{L}_{\epsilon}\left({x}_i,{y}_i,f\ \right) $$
(2)

Where

$$ {L}_{\varepsilon}\left({x}_i,{y}_i,f\right)=\left\{\begin{array}{c}\left|{y}_i-f\left({x}_i\right)\right|-\varepsilon, \mid {y}_i-f\left({x}_i\right)\mid \ge \varepsilon \\ {}0,\mathrm{otherwise}\end{array}\right. $$
(3)

Currently, during big data era, the ML users are exposed to a variety of challenges related to involving a support vector machine in an environmental scheme that leads to velocity data, variety and volume. The datasets possessing a variety which is constructed in daily basis are raised widely in the scientific majority and engineering fields, involving text categorization, medical imaging, computational biology, genomics and banking. It is known that many data imply many extracting probabilities and uncovering useful knowledge; however, it could show pretty benefit at the first glance, due to the exceeding time limit and complicated storage capacity of the SVM training (Liu et al. 2017b; Qiu et al. 2016).

In the training procedure, the training dataset was divided into two classes in the purposes of determining a hyperplane as shown in Fig. 3. Its location is identified with a—normally small—sub-dataset of vectors extracted from the training set (T) named as support vectors (SVs). Identification of the vectors selected to be the SVs boosts the capability of interpretation of the SVM decisions (Mountrakis et al. 2011). Although the data are linearly separated by the hyperplane, SVMs are capable to deal with nonlinear problems in which they are linearly separable by using kernel functions.

Flowchart 2
figure 5

Life cycles of ML

An important obstacle of SVMs exists in the increase of O(t2) time and O(t3) complexity of memory, where t is called for cardinality of T. This drawback possesses the researchers’ attention; owing to this, the developed approaches are targeted either in empowering training set or in achieving minimized training sets of SVM from which support vectors are likely identified (Nalepa and Kawulok 2019). This paper concludes the accomplishments in this area.

3.3.1 Model selection for SVMs

Model selection of support vector machines is intrinsic due to an issue of identifying the hyperparameters of SVM where a kernel tool and its parameter are involved; thus, it is an expensive task computationally (Ding et al. 2015). SVMs not only possess a high prediction rate in a large number of real applications, the SVM efficiency and the high accuracy of classification based on the subset feature selection as well as the parameter setting (Faris et al. 2018). Thus, automated model selection is an essential point, because unsuitably tuned parameters could impact the performance of SVM. A crucial obstacle in the SVM modelling is its hyperparameters’ determination of optimal values, due to the importance of these parameters on the SVMs’ efficiency and effectiveness. Determination of optimal value task for the SVM hyperparameter is known as a problem of the SVM model selection (Kalita and Singh 2020). As shown in Flowchart 3, the modelling stages of SVM are represented by starting inserting the input variables till the final decision of model selection.

Flowchart 3
figure 6

SVM modelling process

For the parameters’ determination of SVM, the grid search (GS) is the easiest, well-known and most suitable algorithm. On the other hand, the time consumption limits the GS for big-sized case (Beltrami and da Silva 2020). Swamy (2018) suggested a sophisticated adaptation strategy based on covariance matrix to deal with parameterized kernel space for kernel selection. This strategy research outweighs a standard grid search technique for determining the hypermeter (that is clearly not applicable for a wide range of parameters). Overall, support vector machine is influenced by the determination of regularization parameters from kernel function (Liu et al. 2017b). Referred to all preferences, the Gaussian function is applied as an approach for selecting SVM; this kernel approach is named as the radial basis function (RBF) parameters (Beltrami and da Silva 2020). Based on the study of Aladeemy et al. (2017), RBF kernel can achieve most of decision boundary shapes. By Eq. (4) or (5), the RBF kernel is expressed

$$ K\left({x}_i,{x}_j\right)={e}^{-\frac{{\left|\left|{x}_i-{x}_j\right|\right|}^2}{2{\sigma}^2}} $$
(4)
$$ K\left({x}_i,{x}_j\right)={e}^{-\gamma {\left|\left|{x}_i-{x}_j\right|\right|}^2} $$
(5)

to minimize the feature redundancy and enhance the prediction accuracy (Zhang et al. 2020), minimum redundancy and maximum relevance technique MRMR. This approach has performed well in the feature reduction that is an efficient technique to enhance the model selection framework performance.

Genetic algorithm (GA) performs well in solving problems related to the maximum value of fitness (Roy et al. 2020). This explains the process involving genetic algorithm into the criterion of model selection as incorporated by Lessmann et al. (2006). In order to empower the GA’s computational efficiency, Iram et al. (2018) proposed a genetic technique based on an advanced selection and mutation schemes. The efficiency of GA optimizer is due to the suggested operators, known as enhanced selection and log-scaled mutation GA (ESALOGA), which enhanced diversity, precision and consistency. The sophisticated optimizer, in the hybrid GA, is integrated with gradient descent approach (Roy et al. 2020). Basis model selection of genetic algorithms was utilized for the smooth twin parametric-margin SVMs by Wang et al. (2013). GA-based numerical moment matching approach proposed by Jahani et al. (2020) is adopted as a main uncertainty estimation approach of a wide range of dataset by using assessors’ data and key features.

In the latest algorithm proposed by Faris et al. (2018), multiverse optimizer (MVO) is implemented for selecting optimal features and optimizing the parameters of SVM simultaneously. For better accuracy and dimensionality reduction, a new variation of cohort intelligence (CI) algorithm, which is suggested by Aladeemy et al. (2017), is applied. Twin support vector machine (TSVM) is developed from standard support vector machine, and it is always better than support vector machine. However, wavelet TSVM enlarges the kernel function (KF) selection framework and improves generalization but incapable to perform well with a problem of parameter selection(Wang, et al., 2020c).

Another way is the usage of the grid research which is popular and considered as the easiest approach to choose the SVMs’ parameters. Novelty of Beltrami and da Silva (2020) is applying the quadtree technique to the GS, which is time consuming, in order to speed up the progress and deal with high. However, methods analyzing the class separability are overweighing the grid search algorithm. Thus, a developed index named as expected square distance ratio (ESDR) was proposed by Yin and Yin (2016) that is capable to perform as best as class separability criterion. Also, adaptive fusion of multiple kernel functions proposed by Wang et al. (2020b) achieves a robust ability of SVM generalization and increases forecast accuracy. Zhang and Song (2015) observed that different types of kernels might be similarly efficient for specific data and suggested a multilabel kernel recommendation approach based on the characteristics of data. A motivating model-transferring method, as model adaptation, is named heterogeneous max-margin classifier adaptation approach, abbreviated as HMCA, which was proposed by Mozafari and Jamzad (2016).

In the aim of speeding up the model selection progress of SVMs, parallel algorithms are suggested (Devos et al. 2014). Owing to this, by using parallel algorithms based on grid search and by removing the ones that have an extremely small chance of becoming support vectors, the computational time has been reduced. For boosting up the progress, Fayed and Atiya (2019) have proposed a sophisticated technique, which is faster, by using a SVM-based model with particle swarm optimization algorithm.

Researchers pay great attention in constructing new kernels rather than in developing the current kernel function (Zhang and Song 2015). For example, Gruca et al. (2014) proposed SVM techniques with kernel constructed by neuro-fuzzy systems. It is in the mind that identifying the targeted SVM model is ought to be combined with methods for training SVMs from a wide range of data (specifically for decreasing the cardinality SVM training sets), since the kernel’s best performance might be dependent on the result of training set selection technique.

3.4 Decision Tree

Decision tree (DT) is one of the machine learning techniques that is massively utilized in recognition, survival analysis, regression and classification (Li et al. 2019). With respect to the uniqueness of its benefits, DT has turned into one of the highest utilized and most reliable techniques for prediction purposes. Throughout training multiple trees and gathering their predictions, the forecasting performance of an individual DT is enhanced (Rokach 2016). The decision tree methodology is applying data mining approach in order to construct classification systems for enhancing forecast techniques for a target variable or based on multiple covariates (Song and Lu 2015). Generally, the feature number is reduced while the selection of features is applied with maintaining the equal or sometimes better learning performance (Rao et al. 2019). Using feature values of instances, decision trees classify those instances. Each node in a decision tree represents a feature in an instance to be classified (Pandya 2015).

It is capable to deal efficiently with wide-range, complex datasets without a complex parametric structure imposed since it is non-parametric. Also, it performs very well in prediction by using historical data (Song and Lu 2015). The decision tree training process is highly needed to be parallelized as in the big data requirement (Meng et al. 2016). Due its high communication cost, Meng et al. (2016) suggested parallel voting decision tree.

3.4.1 Different Algorithms of Decision Tree for Regression Model

Decision Tree Learning

The usage of a decision tree on the basis of predictive model that forms a map of observations for an item to result in obtaining the item’s target value is called decision tree learning (Somvanshi et al. 2017). Decision tree is preferred due its reputation of being easier to apply and explain than other quantitative data-driven approaches. As a regression model, a robust and simple decision tree learning algorithm was suggested by Liu et al. (2017a) to forecast the prices of copper as well as the prices of other metals. To identify the abnormal damage and shortcomings, Abdallah et al. (2018) implemented a decision tree learning algorithm which is used in the study to generate wind turbine telemetry.

A regression decision tree is capable to be implemented for regression cases that interact with a continuous target attribute. This type of tree might possess a minimum of four various representations of node such as internal nodes that are capable to be incorporated with oblique or univariate tests; meanwhile, the leaves could be joined with multivariate regression models or uncomplicated constant forecasts (Czajkowski and Kretowski 2016). The fundamental conception is to gather linear regression and decision trees to predict an attribute of numerical target.

Nowadays, great attention from research has been focused on the construction of decision tree ensembles, which are motivated by the massively used technique being Bayesian additive regression trees (BART) framework as a generative probabilistic model (Linero 2018). These techniques are executed by a tool of an efficient repeated partitioning model. To achieve the most efficient split at every tree, node is regularly influenced by using a least squares error criterion (Rokach 2016). The results show that without performing transformation for variables, the best performance of the regression tree models is achieved which are considered in trying to impute missing values that are restrained by non-random missingness (Taylor et al. 2018). Based on a leave-one-location-out cross-validation (LOLO CV) procedure, gradient boosting achieved the highest accuracy compared to other ten machine learning models with the lowest root mean square error (RMSE) (Watson et al. 2019). Due to the high accuracy of %R2, the workability of regression tree modelling for egg weight prediction might be preferred in practical term compared with the multiple linear regression analysis and ridge regression techniques (Orhan et al. 2016).

Random forest

Machine learning algorithms are capable to deal with interactions and nonlinearities which do not seek a specific regression model to be identified (Pavlov 2019). The most preferred and widely utilized ML technique in the non-streaming (batch) setting, nowadays, is random forests. This priority is imputable due to its low demands and high performance which taken into account the hyperparameter tuning and input arranging (Zimmerman et al. 2018). It is progressed by creating plenty of DT at training time and a result of the average forecasting (regression) of the single trees (Xi et al. 2015).

For regression relationships between air pollutant concentrations, random forests are widely applied as an advanced data mining approach (Kamińska 2018). In the challenge case of sophisticating data streams, the random forest (RF) approach is incapable of performing better than any of the boosting or bagging techniques (Gomes et al. 2017). To minimize the overfitting risk, bootstrap accumulation of multiple regression trees is utilized by the random forest; also, to achieve high-accuracy forecast, random forest is gathering the forecast extracted from a lot of trees (Amaral et al. 2013). By using random forest, Shah et al. (2014) suggested multiple attribution that might be considered helpful in multiple epidemiology datasets. An evolved technique known as random forest spatiotemporal kriging (RF-STK) was generated to predict the daily NO2 concentration which is reflected in good prediction (Zhan et al. 2018b). Kakkar and Jain (2016) have proposed a framework by using attribute selection to overcome defect prediction. This resulted in minimizing the total number of attributes utilized by an overage of 6-fold for every dataset.

Gradient Boosting

Gradient boosting decision tree (GBDT) is a machine learning technique for regression and classification cases that generates a prediction model in the shape of ensemble prediction models (Aartsen et al. 2015). The gradient boosting decision tree principle is gathering a weak set of classifiers to form a strong one. The primary obstacle is that it is required to peruse every single data instance to evaluate the information obtained for all probable split points that are, for each feature, consequently, time consumed. To overcome this limitation, Ke et al. (2017) suggest two approaches: exclusive feature bundling (EFB) and gradient-based one-side sampling (GOSS). In the context of ozone concentration prediction, Jumin et al. (2020) applied boosted decision tree which outperformed neural network techniques and linear regression for all stations examined. GBDT is a well-known approach and has a pretty number of efficient applications such as Extreme Gradient Boosting (XGBoost) and Pgbrt (Rao et al. 2019)

The XGBoost package is an effective and sophisticated enforcement of gradient boosting framework (Melville 2014). XGBoost is a very operative and excessively utilized machine learning approach which is massively applied by analysts to obtain desirable results on challenges of various machine learning techniques. The algorithm progresses faster than existing popular solutions on a single machine by ten times. Also, billions of examples are scale by it in memory-limited or distributed settings (Agarwal et al. 1994). The system achieves forecast accuracy higher compared with other studies (Liu et al. 2020c). The package involves tree learning algorithm and effective solver of linear model. XGBoost provides different objective functions and classification, including regression and ranking. The users are capable to define a new objective easily based on their interest since the system is created to be modifiable.

3.4.2 Attribute Selection for Regression

Dimensionality reduction and feature selection are crucial study cases in forming effective regression and classification models for enhancing the process of decision-making by utilizing data-based learning techniques. A suitable feature subset co-operates forward supporting the regression model performance, particularly while integrating with a high-dimensional feature space (Zhang et al. 2018b). However, the trees’ comprehensible rationality could be extremely influenced by the bias in the split attribute selection, and the conventional heuristic approaches are shown as multivalue. In the same context, to avoid the traditional heuristic attribute bias measurement, a feature selection approach for nodes based on the decision tree model conception is needed to be taken into account (Sun and Hu 2017). Among different types of machine learning, such as linear regressions (LRs), SVR, ridge regression, lasso regression, RF, ANNs and k-nearest neighbor (KNN), gradient boosted trees (GBTs) have the best performance as per correlation coefficient (R2), cross-validation (CV) and mean square error (MSE) values (Watson et al. 2019). A combination of particle swarm organization (PSO)-based feature selection was suggested by Niyonsaba and Jang (2015), and the result showed RF with 50 particles of PSO has the best performance of accuracy compared to other models by 99.8%.

For the typical regression tree (RT), the tree possesses leaves where every leaf possesses a fixed value, normally the mean targeted attribute value. A tree model might be shown like an ordinary RT extension. In general, feature selection reduces the number of features while keeping the same or even better learning performance (Rao et al. 2019).

3.5 Artificial Neural Network

ANN is a black-box computational approach which consists of interacting network to be shaped similarly as structures (Akrami et al. 2013). ANN possesses the capability and efficiency for dealing with nonlinear cases: eventually, it became the massively implemented approach recently (Farzad and El-Shafie 2017). Its structure consists of single input layer, hidden layers as needed and an output layer. Generally, it is widely recognized for its forecasting ability for the nonlinear variables (Kumar et al. 2020). ANN idea is similar to the human brain neurons to transfer data to following connected nodes. To form nonlinear relationships between the precursors of concentration ozone, Gavrila (2017) suggested a model containing a multilayer of neural network with error propagation which shows its ability to forecast ozone on short-term basis. To integrate a robust algorithm by utilizing ANN approaches, there are a lot of shortcomings that researchers are exposed to which are best input combinations, proper transfer function and continuous-time series data without missing (AlOmar et al. 2020).

By using ANN and BNN models, Solaiman et al. (2008) have investigated 3 emergent data-driven approaches in order to address the complexity of nonlinear relationships among metrological variables and ozone. These 3 dynamic neural network methods with different structures: Bayesian neural network models, recurrent neural network neural network and time-lagged feed-forward network. The outcomes illustrate that the three models are suitable in predicting equipment which outweigh the regularly utilized MLP which could be applied for a short period of O3 concentration prediction; nevertheless, its incapability to identify the underlying physical processes is the primary limitation when applied in pollutant prediction.

In order predict pollutant concentration of PM10, PM2.5, O3 and NO2, Agarwal et al. (2020) have constructed an ANN model that achieved the coefficient of determination (R2) of 0.88, 0.86, 0.87 and 0.79, respectively. However, in terms of 4 subsequent days, prediction accuracy has resulted to a low value as R2 for O3 is 0.48. Hooyberghs et al. (2005) describe the development of a neural network tool to forecast the daily average PM10 concentrations with a good accuracy up to 0.8 for R2. A wavelet adaptive neuro-fuzzy inference system (ANFIS) model proposed by Bhardwaj and Pruthi (2020), which is a gathered intelligent scheme integrating the learning potency of ANN, is less resource-intensive and more effective compared to the existing models in prediction nitrogen dioxide (Esfandani and Nematzadeh 2016)

In another wide-range-area study implemented by Goulier et al. (2020), ten of air pollutant concentration were predicted based on three predictors (sound, time, traffic) by using the approach of artificial neural network named as street canyon in Münster. The results of this study have shown high accuracy for the NO2 and O3 which reflects the reliability of models’ forecasting ability integrated with the variables of all three inputs. With the aid of random grid, search for hyperparameters was adopted, as an optimizer, in the ANN model to predict O3. This results in high accuracy of prediction achieved by Liu et al. (2019) where R2 is 0.9896. GA-integrated artificial neural network performed very well compared to other implemented models in predicting air overpressure (Jahed Armaghani et al. 2018).

3.5.1 Feed-Forward Neural Networks

The most parsimonious model suggested by Offenberg et al. (2017), a feed-forward neural networks, as specified by AICC, uses 11 input data, a single hidden layer of 4 tanh activation function nodes and a single linear output function. Artificial neural network algorithms based on feed-forward backpropagation network was developed by Hafeez et al. (2020) to predict the ozone production and prediction. Interestingly, this study has indicated that the proposed ANN model achieved 0.9965 for R2.

3.5.2 Recurrent or Feedback Neural Networks

The recurrent neural network (RNN) performs on the principle of keeping the output layer and insert them into the input to aid in forecasting layer’s outcome (Chung et al. 2015). The 1st layer is shaped like feed-forward neural network with the sum product of the features and the weights. The RNN progress commences once the transformation from one-step duration to the following neuron shall keep part of the knowledge obtained from the previous step duration, and is computed (Jin et al. 2017). The RNN progress might be noticed like a MLP network with the upgrading of adding feedback loops to its structure (Mo et al. 2020), which implies that every neuron performs as a memory cell in implementing computations.

ANN with backpropagation (BP) with a sigmoid activation function and middle layer and its hybrid with genetic algorithm (BP-GA) are utilized to enhance the proposed method performance (Esfandani and Nematzadeh 2016). To forecast ozone by utilizing, as input, the meteorological parameters only, Biancofiore et al. (2015) implemented a feed-forward neural structure technique which showed much more efficiency than multiple linear regression algorithm. For a long period of ozone prediction, the complementary ensemble empirical mode decomposition with cycle reservoir jumps, multiple linear regression (CEEMD + CRJ + MLR) has an outstanding performance where R2 is 0.9763 (Mo et al. 2020).

3.5.3 Multilayer Perceptron

MLP networks are a combination of interacted connected nodes or neurons, named as the input, output and hidden layers. The number of nodes in the input layer was based on the number of input variables, when the output layer contains a single node resembling the target variable (Cabaneros et al. 2017). In the context of the ozone concentration prediction attempts, Capilla (2016) studied the two efficient methods which are multiple linear regression and neural network models. The comparison between the performance of multiple linear regression and multilayer perceptron was based on the result of RMSE, mean absolute percentage error (MAPE) and R2 in short-term ozone prediction. The results showed that MLP achieved higher accuracy compared to MLR. Also, the MLP has better performance in predicting sediment yield when compared with SWAT based on the determination coefficient R2 (Singh et al. 2012). For ozone prediction, MLP shows better evaluation results with 9% improvement in the correlation coefficient compared to RBF (Kovač-Andrić et al. 2016). After comparison has been made between types of feature selection, for instance PCA, stepwise regression and classification and regression trees (CART) in MLP, the MLP-PCA has slightly better performance in the prediction of NO2 (Cabaneros et al. 2017). Messikh et al. (2017) employed MLP in the modelling of the phenol removal, where the result, interestingly, of this model has reach 0.99 for R2 correlation coefficient.

In a research study carried out by Chattopadhyay and Bandyopadhyay (2007) for ozone prediction in Switzerland throughout artificial neural network with backpropagation, the models are built in single one-hidden layer as well as two-hidden layer perceptron with sigmoid activation function with a learning rate of 0.9 at peak ozone concentration. The results have showed that both of the two techniques are very promising; meanwhile, the two-hidden layer perceptron is performed better in predicting the mean monthly total ozone concentrations. However, models are varying which implicates the increasing nonlinearity in the model which provides very much to the ozone prediction. It implies that the dataset has various complexity degrees.

In line with this attempt, a comparative research between two ML models in the nonlinear regression and MLP for prediction of tropospheric O3 was performed (Chattopadhyay et al. 2019). These models utilized PM10, SO2, NOx and temperature as independent variables to predict the dependent variable O3. This attempt has showed that MLP with tanhyperbolic activation function performed increasingly higher in correlation between the predicted and the actual datasets. The results highlighted that the MLP with gradient descent has similar efficiency in the performance to predict the ozone concentration the tropospheric layer. However, the coefficient of determination (R2) of 0.435, 0.343 and 0.272 is low for MLP1, MLP2 and regression models, respectively.

By using artificial neural network based on principal component analysis, Chattopadhyay and Chattopadhyay (2012) have built a model to predict monthly total O3 concentration. The model was constructed by using the predictor variables extracted from PCA to be treated in ANN basis. It is found that rainfall and cloud cover are good predictors for the total monthly ozone. Furthermore, ANN possesses a better potential to predict monthly total ozone during pre-monsoon and winter seasons compared to after and during monsoon. Nevertheless, it is required to increase the number of data points as well as the need of more specific research such as the use of daily and hourly dataset prediction. Similarly, Chattopadhyay et al. (2012) implemented PCA on the data, where the multicollinearity is removed, which has been trained in ANN model. This attempt showed the proposed ANN model generates very good predictions. However, the predictions of two zones are not up to the significant accuracy’s degree which is 0.301 to 0.254 in terms of WI (Chattopadhyay et al., 2012) .

3.6 Hybrid Model

Hybrid models generated various individual forecast algorithms, an approach to tackle many shortcomings of predictive algorithms in the cost of more complicated final solution. The primary target is to select and fairly integrate a series of predictive models in a path of enhancing final forecast accuracy (Rozinajová et al. 2018). The accuracy development is obtained by gathering individual forecast model’s positivity along with reducing their defects. The essential target for constructing hybrid models is to enhance their robustness and to increase accuracy and generalization ability (Wu and Shahidehpour 2014). Recently, hybrid models that are highly used utilized air pollution prediction. Ensembles from decision tree are most well known for achieving high-quality forecasts in non-parametric regression cases (Linero 2018). A combination of RF algorithms with accurately controlled advanced multipollutant sensor packages, such as real-time affordable multipollutant (RAMP) monitors, represents an auspicious technique to tackle the low execution which suffers from low-cost air quality sensors (Zimmerman et al. 2018).

Single ANN model and single econometric presented lower prediction—based on prediction—compared to hybrid model proposed by Kim and Won (2018). The overall prediction ability of the hybrid model (ARIMA-SVMs) is improved, which is higher compared to that of the SVM model (Nie et al. 2012). A highly efficient hybrid model was suggested by Kai et al. (2017), which combined various individual models to tackle their limitations. The outputs present that this hybrid model outweighs baseline individual models. This study shows that SOM-LSSVM performs better than single LSSVM (Ismail et al. 2011). Owing to this, the integration of various models is capable to overbalance the single model (Zhang et al. 2018a). Yuan et al. (2016) gathered autoregressive integrated moving average (ARIMA) and GM models by using specific weight to predict energy consumption and figure out that the performance of prediction of this model is higher than that of the single and ARIMA and GM models.

Rubal and Kumar (2018) proposed to gather advanced differential evolution techniques with RF algorithms rather than concerning on currently existing stand-alone approach. This study showed that the proposed approach outweighs the single model with an independent classifier of multilabel classifier and Bayesian network algorithm. Similarly, throughout forecasting of carbon monoxide, the study of Masih (2018) showed that SVM and ANN possess low performance when compared to an ensemble classifier like RF and bagging which implies that hybrid models are more robust.

Rahmati et al. (2020) implemented the first attempt to distinguish the source areas of dust by the aid of hybrid ML models, which are considered as an intelligent system, consist of ANFIS and are generated with combined meta-heuristic optimization models: cultural algorithm (CA), differential evolution (DE) and bat algorithm (BA). The outcome is that the hybrid ANFIS-DE model ought to be investigated in terms of its cost-effective, auspicious approach to effectively allocate the areas of dust source.

Recently, a novel clustering-based ensemble model (CEeSNN) for air pollution prediction based on evolving spiking neural networks (eSNN) suggested by Maciąg et al. (2019) performed better than other models such as singleton NeuCube model, an MLP network and the ARIMA model. However, based on the evaluation indexes which are RMSE, MAE and MAPE, the Box-Jenkins ARIMA approaches outweigh the neural networks (Elman recurrent neural network, backpropagation neural network and radial basis function neural network). Based on the combined experimental methods (Xi et al. 2015), the outcomes have shown that hybrid models are better than single models.

4 Conclusion

Air pollution is very critical and important due to its impact on environmental and social cases. The majority of papers have been carried out in China, USA, India, Malaysia and Australia. This means ML algorithms are getting to be a growingly common tool in environmental health. In this paper, a review on the concentration prediction of different pollutants of air pollution, especially ozone, has been implemented. The essential target is to investigate recent approaches to predict ozone concentration by using ML algorithms. The review paper has been carried out in four areas of interest based on ML which are ANN, SVM, decision tree and hybrid models. It concluded that:

  • A variety of theoretic approaches has been mentioned and discussed in terms of their methodology and effectiveness. These approaches are information set approach, fuzzy set approach and fuzzy set approach. The evaluation of the performance of these approaches varies due to its own procedure and theory applied as well as the complexity of the datasets utilized and its own duration profile.

  • The SVM is a ML approach, which performance depends on its determination of parameters, in the purpose of pattern recognition (Beltrami and da Silva 2020). Generally, based on the observation of previous studies, it is clear that SVM is influenced highly by the sort of the kernel function and the regularization parameter selection (αs) (Liu et al. 2017b). SVM based on Tanaskuli et al. (2020) was preferred in predicting ozone concentration due to its high performance and accuracy achieved.

  • Decision tree algorithm is one of the ML models that is hard to justify which type of it is recommended to be applied. Identifying the best decision tree in advance is often impossible. Generally, heterogeneous node representation is needed throughout the used tree for various problems. Owing to this, (Czajkowski & Kretowski, 2016) proposed an extension for the GMT and GRT systems called the mixed global model tree (mGMT), which is specialized evolutionary algorithm (EA), for more understanding of the underlying progress beyond the representation selection. This state-of-the-art technique is capable of testing models in the leaves or internal nodes and seeking for an optimal tree structure.

  • ANN algorithm is progressed in accumulative process in which it transfers from neuron to the following neuron to pass the three categories of layers which are input layer, hidden layers and output layer. To integrate a robust algorithm by utilizing ANN approaches, there are a lot of shortcomings that researchers expose to which are best input combinations, proper transfer function and continuous-time series data without missing. Based on the reviewed papers, the MLP has been widely used in various fields and in the concentration prediction of air pollutants due to high accuracy achieved.

  • Hybrid models generated various individual forecast algorithms, an approach to tackle many shortcomings of predictive algorithms in the cost of more complicated final solution. The essential target for constructing hybrid models is to enhance their robustness and to increase accuracy and generalization ability. Based on the paper reviewed, it is clear to say that hybrid models outweigh single model based on the accuracy and time of computation and other factors as well. That is the reason of being auspicious algorithms in data mining. This indicates that hybrid models outweigh single models based on various previous studies.

The ML techniques such as DT, SVM, ANN and hybrid models possess a crucial role in all the artificial intelligence applications. This paper concludes the accomplishments in this area which is continuing to increase with advancements in ML algorithms related to prediction of ozone concentration. With this development, it suggests that several fruitful algorithms could be anticipated in the future.