1 Introduction

Weather forecast has become an essential asset in our society, from the simple temperature and rain trends seen every day on newspaper and TV to domain-specific forecasts such as weather guidance for airports, wind and solar energy generation, emergency alerts for rainfall, flash floods, severe weather, hurricanes, etc. The existing methods for weather forecast can roughly be categorized into two classes (Sun et al. 2014), (i) methods based on numerical weather prediction (NWP) models, and (ii) empirical methods based on extrapolations from storms growing rates, velocities from radar echoes, infrared satellite images, etc.

Forecasts based on NWP use complex numerical integration schemes to solve the governing equations of physical phenomena in the atmosphere. These equations demand initial and boundary conditions coming from large data volumes from ground and satellite observations, and must to be processed in large computational systems in order to solve and create the weather forecast numerical models outputs in time to prevent natural disasters. Nowadays, sophisticate and computationally intensive data assimilation schemes are employed in models as NCEP-CFSv2 (Kistler et al. 2001) and ERA5 (Malardel et al. 2016).

The second approach, based on extrapolation methods, is gathering a lot of attention in recent years. Due to the development of artificial intelligence, especially deep learning, forecasts based on the analysis of historical time-series, images, and satellite data are achieving stunning results.

Two main sub-domains of meteorology ad atmosphere sciences are leading this development. The first one is related to the prediction of extreme weather conditions such as hurricanes (Prabhat et al. 2015; Moradi Kordmahalleh et al. 2016; Racah et al. 2017; Pradhan et al. 2018). The second sub-domain is that of precipitation nowcasting, whose resolution and time accuracy needs are much higher than other traditional forecasting tasks like weekly average temperature prediction. These domains rely on computer vision techniques, which have proven useful for making accurate extrapolation of radar and satellite maps (Sakaino 2013; Shi et al. 2015; Agrawal et al. 2019).

While weather forecasts usually focus on natural events as flash floods, rain, wind gust, and snowstorms, all embedded in the troposphere, the upper atmosphere also demands a special and expensive modeling treatment since this region may play an important role all over weather and climate circulations (Scaife et al. 2012). The stratosphere is a barotropic portion of the Atmosphere and its scalar fields advection (heat, gases, and particulate materials) happens constricted between in isentropic layers, in an almost horizontal displacement, a simpler environment if compared with the troposphere. As the main stratospheric information comes from remotely sensed data from satellites, its modeling may benefit from deep learning extrapolation methods, allowing, for example, to estimate the wind field using the cloudiness displacement and time-difference techniques, i.e., the atmospheric scalar field itself used to estimate the winds that promote its transport.

Among these high atmosphere interest topics, our interest is directed towards the forecast of the Ozone (\(O_3\)) layer. Ozone is the most important constituent of stratospheric gas traces. Due to its ability to absorb ultraviolet radiation (UV) (Salby 1996; Dobson 1968), \(O_3\) is also the most important component in the stratosphere from the point of view of skin protection against harmful UVB solar radiation. Many sources contribute to the monitoring of stratospheric \(O_3\). While ground-level equipment can provide information about the Total Ozone Column (TCO\(_3\)) every couple of minutes and ozone sounding balloons can provide ozone profiles for different heights, most of the data used to estimate the O3 global coverage is originated from satellite sources, which update only once or twice a day. While some NWP models are used to forecast \(O_3\) concentration (e.g., NASA’s OzoneWatch and ESA’s TEMIS websites), no extrapolation methods seem to have been used until now.

In this paper, we aim at the observation and forecast of Ozone total column and notably “Ozone Secondary Events” (OSE), i.e., episodes of exchange between the polar vortex and the mid-latitudes and the tropics (Bencherif et al. 2007; Marchand et al. 2005) in which the polar vortex is deformed and ozone-poor polar air masses move towards mid-latitudes. We aim to provide accurate predictions two to four days before the arrival of OSE above populated areas. This paper largely expands the preliminary results we presented in a poster at EGU General Assembly 2019 (Steffenel et al. 2019). Besides the proposal of a forecasting framework based on deep learning, this work contribution includes an extensive analysis of the results in both domain-specific parameters (interest of the results from the meteorological point of view) and deep learning metrics (image similarity, for example). To our knowledge, this is one of the first attempts to forecast the circulation of the Ozone layer using deep learning. In addition, this study aims to present deep learning as a viable, fast, and computationally inexpensive technique for forecasting, as the only computing-intensive phase is the training phase.

The remainder of this paper is structured as follows: Sect.  2 introduces the Ozone Secondary Event problem and reviews existent works that use Deep Learning techniques to perform forecasting. Section 3 describes our forecasting framework, including data preparation and training steps. Section 4 illustrates the forecasting framework through a detailed analysis of different OSE recorded in the last 10 years. In this section, we perform both domain-specific (meteorology) and domain-agnostic analysis of the results. Section 5 addresses directions to improve the framework in future works. Finally, Sect.6 presents our conclusions.

2 Problem description and methodology

2.1 Forecasting ozone secondary events

Since its discovery, the “Antarctic ozone hole” has attracted the interest of the scientific community. An ozone hole area is defined as a region with values below 220 DU (Hofmann et al. 1997). The concentration of ozone in a particular region of the Earth is mainly determined by the meridional transport of this element in the stratosphere (Gettelman et al. 2011). The explanation for the higher concentration of ozone found in polar rather than equatorial regions (where there is greater production) is precisely a special type of poleward transport known as the Brewer-Dobson circulation, in which air masses are transported quasi-horizontally from the stratospheric tropical reservoir to polar regions (Brewer 1949; Dobson 1968; Bencherif et al. 2007).

Given the dynamics of the atmosphere, the ozone hole is accompanied by episodes of exchange between the polar vortex and the mid-latitudes and the tropics (Bencherif et al. 2007; Marchand et al. 2005). During these events, the polar vortex is deformed and ozone-poor polar air masses move towards mid-latitudes. We call these episodes of isentropic exchanges “Secondary Effects of the Antarctic Ozone Hole” or “Ozone Secondary Effects” (OSE, for short). This temporary drop in ozone content first was observed by Kirchhoff et al. (1996) above the south of Brazil. Such episodes may last several days and reach the tropics, causing significant ozone decreases over these areas and potentially increasing the UV radiation levels at the surface (Bencherif et al. 2007; Casiccia et al. 2008).

Secondary effects of the Antarctic Ozone Hole in mid-latitude zones are regularly observed above populated zones in South America, south of Africa, and New Zealand. Indeed, ground observations reported the occurrence of dozens of OSEs in the last decades above the south of Brazil (Bittencourt et al. 2019), resulting in temporary reductions in the total ozone column of more than 10% over densely populated areas. According to the latest World Meteorological Organization (WMO) reports (Research and Project 2014, 2018), there is a growth trend between the 1980s and 1990s, stabilizing at high rates since the year 2000 despite indications of declining trends in Antarctic ozone in recent years (Solomon et al. 2016).

An example of OSE the event from October 10–14, 2012 presented below in Fig. 1a, which shows the displacement of a polar air mass (in blue) and its extension towards the south of Brazil – around 30\(^{\circ }\)S, driving a 13.7% reduction in the total ozone column (Vaz Peres et al. 2017). Figure 1b, which shows the potential vorticity (PV) map during this event, also illustrates the movement of a polar air mass (in blue) and its extension towards mid-latitudes above South America.

Fig. 1
figure 1

OSE event as a observed from space on 12 October 2014 and b corresponding PV simulated by the MIMOSA-Chim model (Hauchecorne et al. 2002)

It is also important to note that most works on OSE limit to the study of past events (Canziani et al. 2002; Bencherif et al. 2011; Peres 2013; Bittencourt et al. 2019), whose characterization depends on several indirect factors besides the \(O_3\) Total Column. For example, Potential Vorticity (PV) has been used to identify air masses originated in the pole, confirmed by a retroactive trajectory analysis using the HYSPLIT model (Stein et al. 2016). Forecasting OSE was never the main issue up to now.

Predicting sudden reductions in the Ozone coverage like OSE is important as the additional UV radiation may trigger several problems at the public health level. Indeed, reductions of up to 1% in total ozone content in southern Brazil cause an average 1.2% increase in surface ultraviolet radiation (Guarnieri et al. 2004) and, according to UNEP (United Nations Environment ProgramFootnote 1), the reduction of 10% of the stratospheric Ozone would cause additional 300,000 cases of carcinoma (malignant skin tumors) and 4500 cases of melanoma (skin cancer) each year, worldwide. Even if OSE hardly last more than 5–7 days, the increase of UV radiation is often in the order of 10% and may happen overnight, surprising individuals that are more exposed or require additional protection. OSE also may have undesirable effects on the flora and fauna, with notable risks for agriculture (Krupa and Jäger 1996) and biodiversity with, for example, the decline in amphibian species due to genetic malformations caused by increased UV radiation levels (Londero et al. 2019).

Unlike other regions of Brazil, the weather conditions in southern Brazil are strongly influenced by transient meteorological systems (Reboita et al. 2010). Examples of such systems are cold and hot fronts, which carry strong westerly winds at high tropospheric levels. Moreover, the upper troposphere–lower stratosphere (UT–LS) region in southern Brazil seems to be the home of many dynamical processes, such as stratosphere-troposphere exchanges and isentropic transport between the tropical stratosphere reservoir, polar vortex, and middle latitude. Indeed, understanding the patterns of the UT–LS is important in understanding transport and exchange processes and the links with tropospheric meteorology (Ohring et al. 2010).

2.2 Stratospheric \(O_3\) forecast with numerical models

Traditional models for \(O_3\) or other atmospheric constituents are often based on the combination of satellite/ground observations and numerical weather prediction (NWP) models. Indeed, the former is used to set demand initial and boundary conditions, while the latter relies on equations aiming to represent different physical phenomena. This integration (also known as assimilation) involves complex schemes and parameter tuning. Parameters are chosen to best reflect a physical (and chemical) understanding of the characteristics of the studied phenomenon, often requiring large computational systems to solve the forecast models (Godin-Beekmann 2010).

Nowadays, sophisticated data assimilation schemes for weather forecasts are used in models such as NCEP-CFSv2 (Kistler et al. 2001) and ERA5 (Malardel et al. 2016). In the case of \(O_3\), assimilation is also a way to provide a “complete” view of the globe despite the less frequent coverage from the satellites, as illustrated in Fig. 2.

Eskes et al. (2002) present one of the firsts results from ECMWF traditional NWP data assimilation system, using a three-dimensional ozone advection and near real-time \(O_3\) satellite observations as input. They showed that it was possible to forecast \(O_3\) concentrations for 6 days in advance in the extratropics and just for 2 days in the tropical region, opening the possibility to the use of NWP to forecast the South Pole Ozone hole dynamics in a 4–5 days time range.

The tropospheric chemistry model in the integrated forecast system of the European Center for Medium-Range Weather Forecasts (ECMWF) includes a stratospheric chemistry module (Huijnen et al. 2016) to improve stratospheric composition compared with the old chemistry module. The stratospheric \(O_3\) partial columns (10–100 hPa) show biases smaller than ±20DU when compared to the Aura MSL observations (Errera et al. 2019) and keeps the performance event the Antarctic hole season. However, Davis et al. (2017) evaluates water vapor and \(O_3\) data coming from different reanalysis models (ERA-40 and ERA-Interim from Europe, JRA-25 and JRA-55 from Japan, CFSR, MERRA and MERRA-2 from USA), concluding that the handling of ozone varies substantially among reanalyses. While the comparison is not simple, they were able to reproduce the Ozone Total Column (TCO) \(\sim\)10DU (3%) relatively to the observations. In the high stratosphere, the bias is ±20% of the \(O_3\) total column, while in the upper troposphere and lower stratosphere biases increase to ±50% of the \(O_3\) total column. Davis et al. (2017) also observes that the use of reanalysis ozone for Antarctic ozone hole studies is problematic, producing reasonable maps when satellite observations are available, and highly biased maps when observations are unavailable.

Fig. 2
figure 2

A collection of 24 h of GOME data on 30 Nov 1999, and the correspondent assimilated ozone field at 12 GMT [13]

Despite the high potential biases for \(O_3\) estimation, several web sites propose Ozone (\(O_3\)) global coverage information, including NASA OzoneWatchFootnote 2 and the Tropospheric Emission Monitoring Internet Service (TEMIS) from ESAFootnote 3. OzoneWatch relies on the MERRA-2 assimilation model (Gelaro et al. 2017) and the GEOS model, but it does not provide \(O_3\) forecasts on the website. TEMIS relies on the TM3-DAM model (Eskes et al. 2002; Elbern et al. 2015), and its website also proposes an 8-days forecast generated from the assimilation model.

Assimilation is therfore a computing intensive activity, requiring not only the integration of satellite (and ground/sounding balloons) data but also the execution of numerical models for both physical (transport) and chemical parameters. While precise details about the operation (execution time, number and type of nodes) of TEMIS or OzoneWatch are not available, the assimilation workflow described in Eskes et al. (2002) let us infer an important computational demand:

Every day two forecast runs are performed. Directly after completion of the 10-day ECMWF forecast (started at 12:00 UTC) the meteorological fields are extracted from the archive. The wind fields are converted into mass fluxes in a preprocessing step, and the data is sent to KNMI. Upon arrival an analysis and forecast run is started at KNMI, based on the latest near-real time GOME ozone data. Twelve hours later a new forecast run is performed, based on the same meteorological fields, but with an additional 12h of GOME measurements. (Eskes et al. 2002)

Besides global models, regional models are also an alternative for numerical models forecast. WRF (Skamarock et al. 2019) is a well-known simulation tool, which includes a chemical model plugin, WRF-Chem. WRF-Chem can be used to represent \(O_3\) transport (Thomas et al. 2019, 40), but is essentially a tropospheric model that relies on fixed values derived from climatology standards to represent the \(O_3\) concentration in higher altitudes.

2.3 Stratospheric \(O_3\) forecast with deep learning

Contrarily to traditional assimilation models, Artificial Intelligence models do not rely on numerical models but try to extract patterns from raw input data, using sophisticate statistical methods. This approach also has the advantage of being computing-intensive only during the “train” phase, where the patterns are extracted. Once consolidated, AI models are often fast to deploy, producing forecasts in a small amount of time, even with reduced computing resources. The recent breakthroughs in hardware and programming methods for artificial intelligence encouraged the development of machine learning strategies to complement (if not to concurrence) existing numerical models.

As stated by Shi et al. (2015), recent advances in deep learning such as recurrent neural network (RNN) and long short-term memory (LSTM) provide useful tools to address this problem. LSTM has been widely used to predict wind regimes (zhi Wang et al. 2017), pollution concentration (Zhang et al. 2020), weather conditions (Miao et al. 2020) or flooding risks (Ding et al. 2020) on specific zones. Recently, Mbatha and Bencherif (2020) developed a hybrid data-driven forecasting model, based on LSTM applied to the total column of ozone time-series recorded at Buenos-Aires, Argentina from 1966 to 2017. However, pure LSTM is not enough to deal with multidimensional data like the spatial distribution of air masses.

Predicting the shape and movement of air masses (and other meteorological phenomena) is a real challenge for predictive learning. Contrarily to time-series forecasting or object trajectory tracking problems, the speed of air masses are not regular, their trajectories are not always periodical, and their shapes may accumulate, dissipate or change rapidly due to the complex atmospheric environment. Hence, modeling spatial deformation is significant for the prediction of this data.

These technical issues may be addressed by viewing the problem from the machine learning perspective. In essence, the meteorological forecast is a spatiotemporal sequence problem where a set of past maps (radar, satellite) are used as input, and a sequence of maps is produced as output. However, due to the complex structure of atmospheric maps, capturing the spatiotemporal structure of the data would be a hard task with traditional machine learning techniques, especially those based on supervised learning.

For this reason, Shi et al. (2015) formulated a spatiotemporal sequence forecasting problem and proposed the Convolutional Long Short-Term Memory (ConvLSTM) model, which extends the LSTM (Hochreiter and Schmidhuber 1997) to tackle problems such as the precipitation nowcast problem by using radar echo sequences for model training. The ConvLSTM model was further developed in Shi et al. (2017), where GRU (Gated Recurrent Units) are used instead of LSTMs.

In Shi et al. (2015), the radar echo maps are first transformed to grayscale images before being fed to the prediction algorithm. Thus, precipitation forecast can be considered as a type of video prediction problem with a fixed “camera”, which is the weather radar. For this reason, methods proposed for predicting future frames in videos are also applicable to weather forecast. Ranzato et al. (2014) proposed the first RNN based model for video prediction, which uses a convolutional RNN to encode the observed frames. Srivastava et al. (2015) proposed the use of a LSTM encoder-decoder network to predict multiple frames ahead; this model was generalized in Shi et al. (2015) by replacing the fully connected LSTM with ConvLSTM to better represent spatiotemporal correlations. Also Finn et al. (2016) and Jia et al. (2016) extended the ConvLSTM model by making the network predict the input frame instead of the raw pixels. More recent works try to increase the recurrence depth of the networks, all while improving spatial correlations and short-term dynamics. Among such works, we can cite the VPN—Video Pixel Network (Kalchbrenner et al. 2017), the PredRNN, PredRNN++, and Eidetic 3D LSTM models from Wang et al. (2017), Wang et al. (20185), Wang et al. (2019), as well as the Cubic LSTM model (Fan et al. 2019).

Some works try to break free from LSTM, and we can cite (Agrawal et al. 2019), which relies on the U-Net CNN model and (Villegas et al. 2017), which proposed to use both an RNN that captures the motion and a CNN that captures the content to generate the prediction. Along with RNN based models, 2D and 3D CNN based models are also present in the literature (Mathieu et al. 2016; Vondrick et al. 2016). A recent example is the work from Zheng et al. (2020), who create a model based on cascading CNN layers to forecast sea surface temperatures based on satellite data.

Please note that while precipitation nowcast is one of the major target subjects from the literature, other applications may also profit from the same neural networks, as the prediction of air pollution dissemination (Fan et al. 2017), the transportation of volcanic plumes (du Preez et al. 2020) or the estimation of sea surface temperature (Jonnakuti et al. 2020). These models can also complement existing NWP models (Wiegerinck et al. 2019) or, in our case, be applied to forecast the stratospheric ozone circulation.

3 Implementing an OSE forecast model

3.1 Model description and configuration

For this work, we chose to rely on the PredRNN++ model (Wang et al. 20185). PredRNN++ is a recurrent network model for video predictive learning. Its predecessor, PredRNN, was already used in a precipitation nowcast scenario (Wang et al. 2017), so we adopted PredRNN++ as its performance and stability fits our needs for meteorological forecasting. Also, the source code for the model is freely available at Github (https://github.com/Yunbo426/predrnn-pp), which allows us to concentrate on data preparation and parameter tuning.

PredRNN++ is structured as a network of CausalLSTM and Gradient Highway Unit (GHU) modules. The first ones work as a cascaded mechanism, where the spatial memory is a function of the temporal memory structures, while the GHU organizes the LSTM architecture, preventing the gradients of the objective function from vanishing during back-propagation. Details of the internal organization of PredRNN++ can be seen in Fig. 3.

Fig. 3
figure 3

PredRNN++ model scheme, associating CausalLSTM and gradient highway units (GHU) (Wang et al. 20185)

The PredRNN++ model we used consists of two layers with 128 hidden states each. The convolution filters are set to \(1 \times 1\), in order to avoid losing information. For training, the dataset was sliced in consecutive images with a 9-days sliding window. Hence, each sequence consists of 36 frames, 20 for the input (5 days), and 16 for forecasting (4 days).

The choice of the hyperparameters (number of layers and hidden states, filter grids) was deduced empirically from several parameter combinations during a prototyping phase (for example, we considered 64–64, 64–64–64, 128–128, 128–128–128, and 128–64 layers, as well as 1, 2, 4, 16, 64 filters). For instance, adding more layers tended to smooth the resulting images, losing resolution on long-term forecasts. Similarly, using 64 hidden states accelerates the training phase but result in less sharp structures, and more than \(1 \times 1\) filter grids result in “pixelated” images that don’t fit our requirements.

3.2 Data preparation and training

We first collect the Total Column Ozone (TCO\(_3\)) dataset from ERA5 reanalysisFootnote 4, which covers the Earth on a 30 km grid. This parameter is the total amount of ozone in a column of air extending from the surface of the Earth to the top of the atmosphere. The ERA5 units for total ozone are kilograms per square meter (\(\text{kg}/\text{m}^{2}\)). Another common unit for Ozone total column is the Dobson Unit (DU), where \(1\,DU = 2.1415\times 10^{-5} \text{kg}/\text{m}^{2}\).

For this work, we delimited the study on the area between \(70^{\circ }S,100^{\circ }W\) and 20\(^{\circ }\)S,30\(^{\circ }\)W, which includes to the southernmost part of South America and parts of Antarctica. This area allows a good coverage for the air masses coming from the polar vortex, who eventually reach the Southern Spatial Observatory (SSO), located at 29.443752° S, 53.823084° W (Fig. 4).

The dataset used in this experiment consists of about 58500 observations for each coordinate, covering the period from 1980 to 2019 and recorded every 6 h. It was split into two sets: the train dataset, which corresponds to the period between 1980 and 2009, and a test dataset for the period 2010–2019.

As the PredRNN++ model was conceived to accept images as input, we decided to automatize the generation of TCO\(_3\) maps using the GrADS application. GrADS is used to generate maps of the O\(_3\) concentration from ERA5 NetCDF files, converting them to pixel values which are stored as \(128\times 128\) gray-scale images 5a. As GrADS maps are presented as layered zones (i.e., discretized data), we defined 8 levels covering the usual O\(_3\) concentration range in the atmosphere (\(5\times 10^{-3}\) to \(8\times 10^{-3}\; \mathrm{kg/m}^2\)).

Fig. 4
figure 4

Delimited area for O\(_3\) forecasting

During the training phase, the mini-batch size is set to 16, and the training process is stopped after 300,000 iterations. The training phase required about 22 h on a node from the ROMEO Computing Center (each node has \(2\times\) Intel® Xeon\(^{\mathrm{TM}}\) Gold “Skylake” 6132 - \(2\times 14\) cores 2.60 GHz, 96GB DDR4 RAM, and 4x NVidia Tesla P100/16GB SXM2).

While the training phase is computing intensive even in a high-end machine such as the ROMEO supercomputer, the trained model can be easily deployed and executed, even in machines without GPUs. For instance, making a forecast takes about 1 min in a CPU-only machine, with most of the time dedicated to the load of the model.

The resulting forecast is presented as a set of images with gradient values (Fig. 5b). To improve the readability and the analysis of the structure of the air masses, we also provide the output images as contour layered maps (using Matplotlib), as presented in Fig. 5c.

Fig. 5
figure 5

Example of input (a) and output (b, c) images

4 Results

At first, we notice how complex is the atmospheric motion: contrarily to most examples in the literature [moving MNIST Srivastava et al. 2015, KTH action Schuldt et al. 2004], atmospheric motion is variating due to interactions in the atmosphere. As a result, forecasting more than a few days is still very difficult. The length of the input dataset is also a problem, as too many frames favor large-scale events (and require a lot of memory), while too few frames don’t provide enough information for the learning model. In this work, we decided to provide 20 input frames (equivalent to 5 days of measures) and generate frames for the next 4 days (16 frames).

In this section, we concentrate our analysis on a list of 24 OSE compiled by Bittencourt et al. (2019). These events occur during the Austral Spring and are tightly related to the “opening” of the South Pole Ozone hole and the breaking of the polar vortex. As a consequence, such an event may present unusual atmosphere interactions that make forecasts more complex. Although the model can be applied to any period in the year, we concentrate in these cases as the prediction and quantification of OSE events is the main goal of our project MESOFootnote 5. For this reason, the following examples concern predictions with data up to 3 days before the events. In our understanding, this gives a reasonable margin to raise alerts to the population. Furthermore, many satellite datasets are not provided in real-time, so predictions must also take into account this delay.

In order to evaluate the accuracy of the machine learning model, we present 3 different modes of evaluations that will be developed in the next sections:

  1. 1.

    Graphical similarity analysis of graphical metrics between ground truth images and predicted ones;

  2. 2.

    Frame structures analysis of the frame structures from the meteorological point of view;

  3. 3.

    Scalar predictions capability of the framework to estimate TCO\(_3\) values at a given coordinate.

4.1 Graphical similarity

Before conducting a domain-wise analysis, i.e., the evaluation of the quality of the forecasts from the meteorological point of view, we present a generic analysis of the results based on image similarity metrics. These metrics are often used to quantify the noise or the divergence between images and can be applied to our results in order to verify how close are the predictions concerning validation data.

In Table 1 we list the per-frame mean square error (MSE) and the mean absolute error (MAE), the peak noise-signal ratio (PSNR), and the structural similarity index measure (SSIM) (Wang et al., 2004). In the case of MSE, MAE, low values indicate less difference or “noise” with respect to the reference image. In the case of PSNR and SSIM, higher values are better.

Table 1 Average metric values for different OSE forecasts (3 days before). These metrics are averaged over the 16 predicted frames

For each metric, Table 1 emphasizes the maximum and minimum values. From this list, we selected some OSE for further analysis (shaded cells in Table 1). The selected events include most of the maximum and minimum values cited above (or are sufficiently close to these extremes), as well as an average case.

To better evaluate these events, Fig.  6 shows the frame-wise evolution of each metric. For instance, the OSE cases from September 2016 and August 2017 shows a rapid degradation. The other examples are much more stable, keeping high structural similarity (SSIM) and low levels of MSE and MAE.

Fig. 6
figure 6

Frame-wise comparison of different predictions with metrics MSE and MAE (lower values are better), PSNR and SSIM (higher values are better)

When generalizing the analysis to the whole validation period (2010–2019), we obtain the summary statistics from Table 2. Please note that the case from November 2014 cited above stands very close to the average.

In addition, Fig.  7 shows this distribution by year. The boxplots indicate that MSE and MAE have too many outliers, which may compromise the search for reference parameters for an eventual automatic quality evaluator. On the opposite side, PSNR and SSIM are more stable metrics, with more balanced quartiles that may help the setup of an automatic quality assessor.

Table 2 Image metrics distribution from 2010–2019
Fig. 7
figure 7

Example of input (a) and output (b, c) images

4.2 Frame structural analysis

In the previous section, we compared forecasts using standard image similarity metrics. While such comparison provides useful insights into the raw efficiency of the deep learning algorithms, it doesn’t include domain-specific knowledge about the observed events. For this reason, this section provides a “qualitative analysis” based on visual insights from the resulting forecasts.

Indeed, we try here to evaluate the ability of the model to provide convincing (and precise) forecasts based on a set of historical spatiotemporal data. Therefore, Figs. 8 to 11 present a visual summary from the predictions for the OSE studied in the previous section.

The average atmospheric flow on the Southern Hemisphere mid-latitudes goes from West to East, or also called westerly winds. In Fig. 8, \(t= 1\), it is possible to observe that an area of high TCO\(_3\) values is present at windward of the Andes is embedded downstream of a trough that will soon cross the mountain chain. This high TCO\(_3\) concentration air mass is related to the convergence produced by the trough and do not cross at once, but fractionally due to the resistance offered by the mountain range. In the following frames, it is possible to observe that this mass crosses towards the leeward side of the mountain and presents itself as a high concentration TCO\(_3\), probably due to convergence produced by a cyclonic vortex circulation at the trough axis over Argentina and South Brazil.

As in the training phase, 20 frames from the past are used to predict 16 frames in the future. In the case from Fig.  8, which represents the OSE from September 26th, 2012, the forecast renders the air masses with enough similarities in both shape and intensity to the real data (ground truth), even when considering several frames in the future. Please remember that the deep learning model only has access to the “Inputs” part of the ground truth frames, the “Targets and Predictions” frames on the right are presented only for comparison purposes.

A similar result is presented in Fig. 9. Not only the model is able to render the TCO\(_3\) concentration and the trough pattern visible at \(t=25\) and \(t=29\) but is also predicts a new ozone-rich mass arriving at \(t=36\) from the west.

Fig. 8
figure 8

TCO\(_3\) prediction 3 days prior the September 26th, 2012 OSE. a Ground truth; b prediction output; c layered output

Fig. 9
figure 9

TCO\(_3\) prediction 3 days prior the November 7th, 2014 OSE. a Ground truth; b prediction output; c layered output

Although the prediction represents valuable information to short term weather analysis, not all long-term forecasts perform that well. In some cases, the model degrades too fast or diverges after a few frames, which is mostly due to the non-linear nature of the atmospheric physical phenomena. In other cases, complex atmospheric structures such as ridges and troughs, wave trains, and vortexes are clearly present, tending to be smoothed, losing sharpness, although still presenting valuable information for weather forecasters.

We can observe this in Fig. 10, which represents TCO\(_3\) predictions for November 2014. While the model provides frames with similar forms up to \(t=25\) and \(t=29\), the remainder frames keep a close TCO\(_3\) concentration shape but present a faster displacement eastward relative to the ground truth, as a new front with reduced Ozone concentration comes from the west. Also, the input frames present filaments and complex structures that are too specific to be followed by the model, at least in the current level of training.

The case of Fig. 11 is even more evident, with the model being unable to represent the evolution of the air masses. The succession of a low TCO\(_3\) air mass in the south followed by a high TCO\(_3\) from the west confused the model, which finally tends towards “average” values that minimize the overall distance.

From this analysis, we can affirm that the framework provides accurate frames for many OSE, but there are still some situations that require a model improvement. Nevertheless, the qualitative information provided may be a reliable resource to weather analysis operational services. Among the strategies to improve the model (also discussed later, in the Sect. 5), we can cite additional training rounds, larger coverage zones, data enrichment with other parameters besides TCO\(_3\) and perhaps the use of GANs (Generative adversarial networks) (Goodfellow et al. 2014).

Fig. 10
figure 10

TCO\(_3\) prediction 3 days prior the September 16th, 2016 OSE. a Ground truth; b prediction output; c layered output

Fig. 11
figure 11

TCO\(_3\) prediction 3 days prior the August 30st, 2017. a Ground truth; b prediction output; c layered output

Besides the comparison with ground truth (i.e., the observed satellite images), we also tried to compare our predictions against forecasts from a numerical model in order to establish a comparison baseline with existing forecast methods.

This task proved to be hard for several reasons. First, only TEMIS publishes forecasts on its web site, but these forecasts are not archived for further analysis. Hence, we could not access archived data for the same events presented above. Nonetheless, we started a collecting campaign, storing daily forecast since the beginning of September 2020.

We selected a scenario observed from September 16th, 2020 and, using PredRNN++, we produced forecasts for the same period. As a result, Fig. 12 presents the resulting forecasts side by side, together with the satellite ground truth. The main elements from the satellite images are present in all forecasts. First, the OSE ribbon is seen crossing Argentina on September 16, dissipating the next day (this is better seen on TEMIS in grayscale). The arrival of an \(O_3\) air mass is better predicted by the global model (TEMIS) than by our model based only on regional images, but the subsequent concentration of \(O_3\) at the Atlantic coast on September 19th is found in both TEMIS and PredRNN++ outputs.

While this single example does not constitute a comparison baseline, it emphasizes the potential of our approach. In truth, OSE forecasting (and stratospheric \(O_3\), in general) is an evolving target as both \(O_3\) production, destruction, and transport are influenced by external factors from both human and natural sources. Historical averages can’t show the trends of \(O_3\) over the years, neither explains the increasing number of OSE in recent years. However, identifying recurrent atmospheric transport patterns, something that Artificial Intelligence models excel at, may help to predict future events.

Fig. 12
figure 12

TCO\(_3\) predictions from September 16,2020. Ground truth, our forecasts and TEMIS forecasts (in color and grayscale)

4.3 Scalar prediction analysis

This last part is dedicated to the evaluation of the scalar precision of the forecast model. While the previous sections demonstrate that a good compromise can be achieved when regarding the structural similarity between the forecast frames and the observed data, it is also important to evaluate the accuracy of the predicted \(0_3\) column values.

To perform this evaluation, we compare the TCO\(_3\) raw values from ERA5 over one interest location—the Southern Space Observatory in Brazil—and compared with data extracted from the forecast frames. This last step is achieved by reversing the input process, which translates the input matrix from ERA5. For recall, TCO\(_3\) data is normalized and transformed into layered grayscale images by the GrADS software, then feed to the deep learning algorithm. The output is presented as gradient values, that once denormalized give values in similar ranges from ERA5.

Taking again the same four OSE examples from the previous sections, we obtain the results presented in Fig. 13. Here, the blue line represents the daily measures from ERA5, the vertical line indicates the beginning of the forecast and the orange line represents the predicted values. As before, we start predicting 3 days before the registered OSE (the small arrow in the lower right). In the case of the September 20th, 2012 event (Fig. 13a), predictions fit really well the observed data. The remainder cases (Fig. 13b–d) show a relative error in the TCO\(_3\) estimation but in most cases (except Fig. 13b), the forecasts tend towards the same values levels as the observed data.

We believe that the differences are mostly caused by the use of layered images from GrADS. Indeed, values are discretized in a limited number of layers, which introduce bias from the real values and fool the prediction algorithm. Using plain normalized data from ERA5 may help improve the precision of the forecasts.

Fig. 13
figure 13

Comparison between predicted TCO\(_3\) values and raw data from ERA5 for the location of the Southern Space Observatory—SSO (\(29.443752^{\circ }S\),\(53.823084^{\circ }W\))

5 Discussion and future works

The analyses from the previous section allow us to identify several strategies to improve the quality of the forecasts. Indeed, predictions are not only expected to keep a high similarity level of the structures (air masses) over time but also to allow the estimation of TCO \(_3\) over a locality with a reduced error.

Finding a good combination of input data and hyperparameters is the first step to do. The experiments presented here consider 20 frames (5 days) input dataset, which is a minimum threshold to capture the atmospheric circulation patterns. Adding extra input frames can help the model to understand additional interactions, at the expense of more memory. Also, as discussed earlier, this may reinforce the weight of large scale events, at the expense of rare or episodic patterns. Extra training epochs may help improve this resolution, but we must be extremely attentive to overfitting during the training phase.

Since we perform predictions on independent geographical tiles (i.e., a regional coverage), border effects are also a problem. When air masses originate outside the boundaries of a tile, the model can only guess their arrival based on recurrent historical patterns. Figure  11 shows an instance of this where the model is unable to predict a mass rich in Ozone coming from the west. By enlarging the tile towards the west would include more information about air masses coming to a given interest point. Another possible approach is to provide images containing a South Pole stereo-polar view, where an circular continuous flow can be observed, giving therefore a synoptic observational field to the PredRNN++ training system.

The analysis of graphical similarity metrics also shows that Structural Similarity (SSIM) gives much more valuable information about the quality of the forecasts. Adapting SSIM as a loss function on PredRNN++ may improve the accuracy of the model, especially for long-term forecasts.

Improvements may also be made concerning the input data quality. As explained before, the current framework uses data transformed by the GrADS application, which creates layered maps with flatten values categories. While this seems to accelerate the learning process, it induces a bias on the output values for a given location.

Data quality can also be enriched through the addition of other parameters of interest. While the current experiment uses only TCO\(_3\), at least two other atmospheric components are known for their correlation with the Ozone circulation: the temperature and the potential vorticity (PV). Instead of a simple channel in a grayscale image, these other values could be added as additional image channels, in a similar way as RGB images can be handled.

Another direction that may be explored to refine the topological structure and hyperparameters of the neural network is the use of Generative Adversarial Networks (GANs) (Goodfellow et al. 2014), which have shown excellent results in problems in which the output have quality requirements to match.

Despite such improvements, we strongly believe that AI-based methods are not intended to replace existing models but to complement them. Forecasts generated by deep-learning methods may represent an asset for traditional assimilation methods, highlighting patterns that are hard to identify and model otherwise. Besides, deep-learning models offer a fast and non-expensive way to forecast scalar advection, providing first-hand elements for decision making while numerical-based models forecasts are not yet available.

Our team also plans to expand the usage of PredRNN++ and future algorithms towards other usage cases. Indeed, scalar advection can be applied to other subjects, if properly trained. Wildfire smoke and volcanic plums forecasting are among the subjects currently studied by our laboratories, and we believe that the algorithms used in this paper may be easily transposed to solve these problems.

6 Conclusion

In this paper, we explore the usage of deep learning to handle stratospheric Ozone (\(O_3\)) spatiotemporal forecast. The motivation is that the Ozone layer lies in a zone in the middle atmosphere (stratosphere) with peculiar characteristics that are prone to deep learning extrapolation methods, instead of traditional numerical simulation methods. Indeed, the stratosphere is a barotropic portion of the atmosphere whose scalar fields advection (heat, gases, and particulate materials) are constricted between in isentropic layers, in an almost horizontal displacement. Data comes essentially satellite observations, which can easily be handled as matrix or images.

We focused our study on the forecast of Ozone Secondary Events (OSE), episodes of exchange between the polar vortex and the mid-latitudes and the tropics in which the polar vortex is deformed and ozone-poor polar air masses move towards mid-latitudes, resulting in drastic increases in the UV radiation. We aim to provide prediction tools to early identify such events.

In this paper, we leveraged a spatiotemporal algorithm initially developed for video frame predictions, adapting it to accept satellite data. We evaluate the performance of the algorithm output under both deep learning metrics (image similarity, for example) and domain-specific requirements (interest of the results from the meteorological point of view), showing interesting results and directions for improvement. To our knowledge, this is one of the first attempts to forecast the circulation of the Ozone layer using deep learning.

Another interest of deep learning-based forecasts is the amount of computing resources they require. If the training phase of a model may be long and computing-intensive, performing a prediction is very fast and requires almost no computing resources. By comparison, forecasts based in traditional numerical models often require dedicated infrastructures and several computing hours to process the data assimilation and the forecast models. Therefore, deep learning models allow almost immediate updates of the forecast with good short-term prediction quality, which is especially interesting for new satellite sources such as GOES-16 that produce hourly updates. If the total transition from traditional forecast methods in favor of deep learning remains an open question, the combined usage of both methods in an ensemble forecast approach is a promising choice to consider.