Reconstructing Missing and Anomalous Data Collected from High-Frequency In-Situ Sensors in Fresh Waters

Kermorvant, Claire; Liquet, Benoit; Litt, Guy; Jones, Jeremy B.; Mengersen, Kerrie; Peterson, Erin E.; Hyndman, Rob J.; Leigh, Catherine

doi:10.3390/ijerph182312803

Open AccessArticle

Reconstructing Missing and Anomalous Data Collected from High-Frequency In-Situ Sensors in Fresh Waters

¹

Laboratoire de Mathématiques et de Leurs Applications de Pau Fédération MIRA, UMR CNRS 5142, Université de Pau et des Pays de l’Adour, 64600 Anglet, France

²

Department of Mathematics and Statistics, Macquarie University, Sydney, NSW 2109, Australia

³

National Ecological Observatory Network, Battelle Boulder, Boulder, CO 80301, USA

⁴

Institute of Arctic Biology and Department of Biology and Wildlife, University of Alaska Fairbanks, Fairbanks, AK 99775, USA

⁵

School of Mathematical Sciences, Queensland University of Technology, Brisbane, QLD 4000, Australia

⁶

ARC Centre of Excellence for Mathematics and Statistical Frontiers, Melbourne, VIC 3000, Australia

⁷

Peterson Consulting, Brisbane, QLD 4000, Australia

⁸

Department of Econometrics and Business Statistics, Monash University, Clayton, VIC 3800, Australia

⁹

Biosciences and Food Technology Discipline, School of Science, RMIT University, Bundoora, VIC 3083, Australia

^*

Author to whom correspondence should be addressed.

Int. J. Environ. Res. Public Health 2021, 18(23), 12803; https://doi.org/10.3390/ijerph182312803

Submission received: 27 October 2021 / Revised: 26 November 2021 / Accepted: 2 December 2021 / Published: 4 December 2021

(This article belongs to the Special Issue Statistical Advances in Environmental Sciences)

Download

Browse Figures

Versions Notes

Abstract

:

In situ sensors that collect high-frequency data are used increasingly to monitor aquatic environments. These sensors are prone to technical errors, resulting in unrecorded observations and/or anomalous values that are subsequently removed and create gaps in time series data. We present a framework based on generalized additive and auto-regressive models to recover these missing data. To mimic sporadically missing (i) single observations and (ii) periods of contiguous observations, we randomly removed (i) point data and (ii) day- and week-long sequences of data from a two-year time series of nitrate concentration data collected from Arikaree River, USA, where synoptically collected water temperature, turbidity, conductance, elevation, and dissolved oxygen data were available. In 72% of cases with missing point data, predicted values were within the sensor precision interval of the original value, although predictive ability declined when sequences of missing data occurred. Precision also depended on the availability of other water quality covariates. When covariates were available, even a sudden, event-based peak in nitrate concentration was reconstructed well. By providing a promising method for accurate prediction of missing data, the utility and confidence in summary statistics and statistical trends will increase, thereby assisting the effective monitoring and management of fresh waters and other at-risk ecosystems.

Keywords:

anomaly correction; generalised additive model (GAM); missing data reconstruction; remote sensing; water quality

1. Introduction

Water quality sampling and analysis commonly relies on manual approaches, such as grab sampling and laboratory analyses, often conducted at monthly or longer intervals for variables such as sediment and nutrient concentration [1]. As such, the ability to capture water quality events or determine patterns and trends at fine spatial and temporal resolution are often limited [2]. Advances in the development of in situ, high-frequency environmental sensors have led to their expanded use in environmental monitoring, including for fresh waters [3]. As the cost-effectiveness and telecommunications capability of these sensors increases, their ability to provide high-frequency data in near real time likewise increases, allowing managers and decision makers to act in a timelier and more spatially specific fashion. The large datasets generated by high-frequency in situ sensors also present new opportunities for scientists when analysing, modelling, and reporting water quality data [4,5]. Consequently, high-frequency datasets collected from in situ sensors can provide a more thorough understanding of water quality dynamics at multiple time scales and help to improve data quality assurance and quality control [6].

In situ sensors, despite their benefits, are prone to technical errors due to biofouling, power failures, and other issues. These errors can lead to technical anomalies in water quality data and potentially confound the assessment or identification of true changes in water chemistry [7]. Given that the high frequency and large size of these datasets precludes the use of manual anomaly detection methods (one part of the entire data quality assurance and quality control process), various automated approaches have been proposed. For example, Shi et al. (2018) [8] integrated a wavelet artificial neural network with surrogate measurements for rapid warning of water quality anomalies, Liu et al. (2020) [9] integrated a Bayesian autoregressive model with an Isolation Forest algorithm for combined prediction and detection, while Rodriguez-Perez et al. (2020) [10] developed a semi-supervised Bayesian artificial neural network approach. To assist both developers and end users, Leigh et al. (2019) [7] developed a ten-step anomaly detection framework to systematically implement and compare suites of anomaly detection methods based on end-user needs.

Regardless of the method used to detect water quality anomalies from in situ sensors, observations that get labelled as anomalous are often subsequently removed from the time series, rendering them missing. Furthermore, given the variety of types of technical anomalies, such as sudden spikes, unrealistic values, drift, or periods of anomalously high or low variability [7], the resultant time series may contain missing point observations and/or sequences of contiguously missing observations after the data are passed through an anomaly detection algorithm. Failure to replace anomalies with corrected data may occur because methods to confidently reconstruct (accurately predict) the true values of the missing water quality observations are not available. Missing data then create data quality issues [11] and can lead to biased estimates of parameters, increased standard errors, decreased statistical power, and lost information [12], which may hinder the calculation of summary statistics [13] and affect statistical trend detection [14]. Slater et al., 2017 [15], for example, demonstrated via simulation that the loss in trend detection tends to increase with an increasing size of missing data ‘gaps’ and decreasing length of time series.

Many of the commonly used methods used to reconstruct missing water quality data were developed before the proliferation of high-frequency sensors, such as infilling based on surrounding data [16,17], regression analysis [18,19], state–space models with an estimation maximization algorithm [20], or artificial neural networks [21,22,23], and, therefore, were targeted at data with lower-frequency time steps, such as daily data. More recently, but also based on daily data, various infilling techniques, such as regression, scaling, and equi-percentile approaches [24], along with dynamic regression models [25], have been used to reconstruct missing streamflow data. Methods developed in other domains, such as computer science, have reconstructed missing sensor data based on temporal or spatial correlation, interpolation, and sparse theory [26]. In sensor networks, linear and non-linear regression methods have been developed that use the non-missing data adjacent to the missing data [27], along with algorithms based on combining K-means algorithms and neural networks with particle swarm optimization [28]. Similar methods have been developed in other domains, such as those used for power systems within computer science [29], while in the engineering domain, bidirectional recurrent neural networks have been developed to reconstruct sensor data used to monitor bridge construction [30]. As can be seen, these various methods of data reconstruction have been developed fit for purpose and as solutions for domain-specific problems. Hence, we aimed to develop a suitable method to reconstruct high-frequency nutrient data collected from in situ sensors in rivers, a problem, to our knowledge, that is yet to be addressed.

In the environmental domain, and specifically river management, nutrient monitoring, and specifically that of nitrate concentration, is particularly important. In its bio-available form, nitrate is assimilated for growth and metabolism by riverine biota (e.g., algae, macrophytes, and some bacteria) that form the basal components of aquatic food webs [31]. However, an excess of nitrate can lead to problems like eutrophication, leading to a decrease in light infiltration and dissolved oxygen concentration [32,33], which, in turn, can negatively affect the health of aquatic biota such as fishes and invertebrates [34,35,36], as well as increasing costs for water treatment and complicating management of river ecosystems spanning catchment headwaters to receiving waters downstream, including oceans [37]. Furthermore, nitrate concentration can vary substantially in space and time in river ecosystems due to instream processes and external inputs [38,39]. This high spatial and temporal variation has increased the interest in and use of high-frequency, in situ nitrate sensors in river monitoring programs and, thus, the need to develop appropriate methods to reconstruct missing nitrate concentration data from the resulting time series.

While it is important to develop a sound method to confidently reconstruct missing nitrate data for use in environmental management, the use of nitrate data can also serve as a case study to demonstrate the potential for the method to be applied more broadly. As such, the objective of this study was to develop and test a data reconstruction method using both a real time series of high-frequency nitrate concentrations and a simulation study.

2. Materials and Methods

2.1. Reconstruction Method

Let Y be the response (i.e., dependent) variable of interest and Y_t the value taken by Y at time t. For covariates X

\in m

(i.e., m explanatory variables or predictors) we denoted X_kt, the kth covariates observed at time t. We then identified two possible cases: (i) all X_kt are available at the same time step as the variable of interest Y_t, and (ii) at least one covariate is not available at time t. For the first case, when all X_kt were available, we used a generalised additive model to predict Y at each time t that was missing, following the equation:

Y_{t} = β_{0} + \sum_{k = 1}^{m} s_{k} (X_{k t}) + ε_{t}

(1)

where X_kt are covariates measured at the tth sample. Here, β₀ is an intercept and ε_t is an error term, following the usual i.i.d assumptions that we make about regression errors, ε_t ~ N(0, σ²). The associated smooth function s_k(·) of each water quality variable X_k was defined using thin plate spline regression [40]. A forwards and backwards stepwise variable selection procedure was implemented and the ’best’ GAM model (in terms of variables selected and penalisation of smooth splines) was identified based on the Akaike Information Criterion (AIC) [41].

For the case when at least one covariate was not available at time t, such that Y_t could not been predicted with the GAM model, we used an autoregressive integrated moving average model (ARIMA, Figure 1) [42]. For each missing Y_t that could not be predicted with a GAM, we used the 500 previous Y observations (Y_t−₅₀₀, Y_t−₄₉₉, …, Y_t−₂, Y_t−₁) and selected the best ARIMA model by AIC comparisons. Prediction intervals (95%) associated with the reconstructed values were then calculated according to the model used.

2.2. Arikaree River Case Study: Applying the Reconstruction Method

We applied the missing-data reconstruction method to time series of water-quality data collected from Arikaree River, a small wadeable stream in the semi-arid eastern Colorado plains of the United States of America. The Arikaree River site has a catchment of 2632 km², comprising mainly grasslands and irrigated agricultural land, and is part of the National Ecological Observatory Network (NEON). NEON collects and provides open data from aquatic and terrestrial sites across the United States of America (USA), including data from high-frequency, in situ sensors. NEON conducts standardised configuration, calibration, and preventive maintenance procedures on all their sensors [43,44] and follows in situ measurement and sample analysis protocols as outlined in [45]. As such, the Arikaree River site provided us with a suitable time series of water quality data for the purposes of this study.

Several water quality variables are available from each NEON aquatic site (Table A1). Nitrate concentration [46] is measured in µmol/L using a 10 mm path length SUNA V2 UV light spectrum sensor. The SUNA V2 collects data reported as a mean value from 20 measurements made during a sampling burst every 15 min. The published nitrate resolution is 0.1 µmol/L and the manufacturer’s stated sensor accuracy is approximately 2 µmol/L or 10% of the reading above 20 µmol/L. We, therefore, report units of measurement for nitrate in µmol/L (1 µmol nitrate/L = 0.062 mg nitrate/L). Other co-located sensors report specific conductance (µS/cm), dissolved oxygen (mg/L), water temperature (°C), and turbidity (Formazin Nephelometric Units, FNU) data as one-minute instantaneous measurements [47,48]. Water elevation data (i.e., water level as meters above sea level) are published as five-minute averaged measurements from data sampled at 1 min intervals [49].

For this study, we used a two-year period of nitrate concentration data from one of two sites (the downstream site) on the Arikaree River, from October 2018 to October 2020 (n = 73,056 nitrate concentration observations), in which there were already missing point data and missing periods of data (14,283 missing observations of nitrate concentration = 20% of the nitrate data in total) (Figure 2). These data had all been removed from the time series as part of the NEON data quality assurance and quality control process. For example, there was a technical issue with the nitrate sensor during winter 2019 so no nitrate concentration measurements were available for the first three months of 2019. Missing data were also present in the time series of the covariates as a result of quality control and assurance processing: 29% of the temperature time series, 35% of the specific conductance time series, 14% of the dissolved oxygen time series, 47% of the turbidity time series, and 11% of the elevation time series.

We considered nitrate concentration as our Y and the other water quality variables (specific conductance, dissolved oxygen, water temperature, turbidity, and water elevation) as the covariates X that could be related to Y [50]. Visual examination of the distributions of the response and covariates indicated that turbidity had a strongly right-skewed distribution and was, therefore, log-transformed (i.e., log (turbidity + 1)) prior to analysis [51]. We also included two additional covariates to account for temporal autocorrelation in the time series, as determined by the AIC. The first additional covariate was nitrate concentration at one time step before time t (i.e., Y_t−₁) and the second was nitrate concentration at two time steps before time t (i.e., Y_t−₂).

2.3. Simulation Study: Performance Evaluation

To evaluate the performance of our reconstruction method, we then repeatedly and randomly removed different combinations of data (both point observations and sequences of contiguous observations) from the two-year time series of nitrate concentration from Arikaree River. For the missing point data, we randomly removed 20%, 30%, and 40% of the observations from the nitrate concentration time series and repeated this process 100 times each (Simulations 1, 2, and 3). For the missing sequences of data, we randomly removed ten individual days (10 × 24 h worth) of observations, repeating the process 100 times (Simulation 4), as well as ten individual weeks of observations, again repeating the process 100 times (Simulation 5).

For each simulation, we then calculated the root-mean-square error (RMSE) and the proportion of reconstructed data within the precision interval of the nitrate sensor (PWPI), i.e., ±10% for readings > 20 µmol/L and ±2 µmol/L for readings < 20 µmol/L.

2.4. Implementation

Simulation and imputation were performed with the base packages within the R statistical software [52]. Modelling was undertaken using the car [53], gam [54], mgcv [39], and forecast [55] packages. The R script used to implement the analyses is provided in the GitHub repository available online at https://github.com/Claire-K/nitrate_time_serie_reconstruction (accessed on 4 December 2021) and the Arikaree data are available from NEON [46,47,48,49] (see Table A1 for data product numbers).

3. Results

3.1. Arikaree River Case Study

Model performance varied according to the characteristics of the missing data. For demonstration purposes, we focus here on four different cases: (a) a 12-day sequence in which data were sporadically missing, (b) a one-day peak flow event containing sporadically missing data, (c) a full day of missing data, and (d) a three-month sequence of missing data (Figure 3). Our method performed well at predicting values of nitrate concentration where point observations were sporadically missing from the time series. In other words, the predicted values followed the pattern of the surrounding data closely and prediction intervals were narrow compared to the sensor precision interval (e.g., Figure 3a,b).

However, the method performed less well when periods of contiguously missing observations were reconstructed. For the single day of missing data, the daily pattern in nitrate concentration present in the surrounding data was not reconstructed, and the prediction interval of the reconstructed nitrate values increased with the number of missing observations (Figure 3c). This was also the case for the reconstruction of the three-month period of missing data (Figure 3d). However, some extremely high nitrate concentrations (~80 µmol/L) were predicted to occur during this period, based on the values of the covariates at the time, which had not been detected as anomalous by the data quality assurance and control process [56]. This demonstrated that the quality of the reconstructed data can depend heavily on the covariates, when available, and therefore, reliable performance of any anomaly detection method implemented prior to reconstruction is crucial.

We also found that reconstructed values of nitrate had much larger prediction intervals when ARIMA, rather than GAM, was used due to the presence of missing data in the covariate(s), which simply by chance would more often occur during contiguous sequences of missing nitrate data than during periods of similar length with sporadically missing nitrate data. Overall, the prediction interval for the 14,283 missing values in the nitrate time series ranged from 0.01 to 56.03 µmol/L, with a median of 1.34 µmol/L.

3.2. Simulation Study: Performance Evaluation

3.2.1. Simulations 1, 2, and 3: Missing Point Data

In terms of the reconstruction performance of our method, the RMSE values from simulations 1, 2, and 3 (20%, 30% and 40% of randomly missing point data in the nitrate time series, respectively) were all similar and rarely >0.2 µmol/L, even with 40% of the data having been removed (Figure 4a). Furthermore, the method predicted more than 95% of the missing nitrate values with an RMSE of 0.2 µmol/L. Nevertheless, as the proportion of missing data increased, so did the maximum RMSE. Overall, 72% of the reconstructed nitrate values were within the precision interval of the sensors (Figure 4b).

The performance of our method in reconstructing missing point data can also be demonstrated by looking more closely at different periods of the simulated Arikaree River time series, including typical baseflow and storm event behaviours of nitrate concentration. In all cases, the predicted data followed the pattern of nitrate concentration closely, including a peak event that occurred over a period of less than 24 h (Figure 5).

3.2.2. Simulations 4 and 5: Missing Sequences of Data

When reconstructing missing sequences of data in the simulated time series, performance declined as sequence duration increased (i.e., the RMSE increased and PWPI decreased (Figure 6). The median and third quartile of RMSE for simulations where 10 day-long sequences of data were randomly removed were 0.25 µmol/L and 0.44 µmol/L, respectively, compared with 0.75 µmol/L and 1.16 µmol/L, respectively, for simulations where 10 week-long sequences were randomly removed. For the median and third quartile PWPI, the one-day vs. one-week comparisons were 0.70 and 0.74 vs. 0.66 and 0.70, respectively.

We also observed that performance depended on whether GAM and ARIMA, GAM alone, or ARIMA alone was used for the reconstruction. When ARIMA was used, the amount of missing data present in the preceding period also impacted performance. For example, both GAM and ARIMA were required for a week-long reconstruction in early March 2019 (Figure 7a), but this week occurred just after a three-month period of missing data, such that the ARIMA (based on the previous 500 observations) was unable to perform well. The ARIMA always predicted a nitrate concentration of 4.5 µmol/L for missing data in the week-long sequence, whereas the GAM predictions followed the actual concentrations closely. This was also the case when GAM was used alone due to all covariates being available throughout the week-long sequence (Figure 7c).

In the case where ARIMA was used after a period with little to no missing values (Figure 7b), almost all real values of nitrate concentration were within the prediction intervals of the reconstructed data. However, the nitrate prediction interval increased as the number of timestamps into the future increased.

4. Discussion

Data from low-cost, in situ water quality sensors provide unprecedented opportunities to better understand spatial and temporal water quality dynamics. However, in situ sensors are prone to technical issues, which presents a challenge for the processing and analysis of environmental data. The study presented here demonstrates that it is possible to predict these missing data for reconstruction of high-frequency environmental time series using appropriate statistical methods. To our knowledge, our study is among the first to reconstruct missing nitrate data from high-frequency data collected by in situ sensors. This may be in part due to the relatively recent, standard use of such sensors for measuring nitrate concentrations in river networks and makes comparison of our findings with other studies and methods difficult. Reconstruction of high-frequency runoff data showed that a new machine learning method, “nu-support vector machines,” outperformed other machine learning methods [57]. Performance of the new method was evaluated in terms of the correlation (R²) between observed and simulated data, with their method achieving values between 0.75 and 0.95. Applying the same R² coefficient to our simulations, we achieved R² values between 0.976 and 0.997 when 40% of the dataset was removed and between 0.119 and 0.994 (first quartile = 0.84) when sequences of 10 days were removed, indicating that our method attains a comparatively good performance, particularly for point and short periods of missing data. Blending ARIMA forecasts and backcasts has also shown promise for reconstruction of sensor-based water quality data, including temperature, pH, specific conductance, and dissolved oxygen [58]. However, we are unable to compare our results with this work given that performance of the correction method was assessed by comparing the ARIMA-based results with corrections done manually by technicians.

Our method was able to predict all missing values present in a real-time series of nitrate concentration data. Although the prediction intervals for some predicted values were relatively wide, the median prediction interval was very low (1.34 µmol/L nitrate), indicating that many missing data had a 95% prediction interval lower than the sensor accuracy (i.e., at least 2 µmol/L) and, therefore, were precise enough for the intended use of the data. We also showed via simulation that even when 40% of the initial dataset (point observations) was missing, our method was able to accurately recover approximately 70% of the data. When day-long sequences of contiguously missing data were simulated, mimicking, for example, a persistent sensor outage or prolonged periods of quality-flagged data, performance of the method was similarly efficient. However, for week-long periods of missing data, the percentage of accurately recovered data decreased, indicating that data reconstruction is more impacted by long sequences of missing data in a row than by multiple but sporadically missing data.

Consideration of different periods of the nitrate concentration time series provided insight into the overall utility of the method and why the method may not accurately reconstruct all missing data. For example, excessive nitrate can create eutrophication issues in aquatic systems and, therefore, for the purposes of environmental management, it is important to know (i) whether the presence of high nitrate concentrations is real or anomalous, and (ii) that accurate reconstruction of the real concentrations can be achieved in a timely fashion, particularly during floods. This appears possible with the method we have developed, given that missing values during periods of sudden rises and falls in nitrate concentration were predicted accurately (Figure 3b and Figure 5c). However, when environmental covariates from co-located sensors were not available, then reconstruction relied on ARIMA, for which prediction performance was inferior to that of GAM. This finding indicates the importance of having high-frequency sensors that can collect other environmental and water quality data besides nitrate concentration at collection sites and is in accordance with results from a study of daily streamflow data [25] that found increased accuracy in the reconstruction of missing data when multiple input variables were included.

The method presented here was developed with the objective of being able to reconstruct data that are missing, for example, due to their removal after being determined as technical anomalies from environmental data collected by high-frequency in situ sensors, using nitrate concentration data from Arikaree River. The method has currently only been applied in a binary fashion depending on the existence of covariates (other environmental data that can be used as predictors). For any one missing nitrate observation, GAM was used to predict the observation when data for all covariates were available, and ARIMA was used when data for at least one of the covariates were missing. Several avenues of study are envisaged from this work. First, future work could aim to develop a method whereby one or more environmental variables could be used as the covariate(s) according to their availability, such that ARIMA is only used when no covariate data are available. A second avenue would be to use different types of models in the framework, bearing in mind that the characteristics of time series data can influence the forecasting method that should be run. Other methods that may be suitable for the particular characteristics of water quality time series include seasonal autoregressive integrated moving average (SARIMA) for seasonal data or deep learning methods such as long short-term memory (LSTM) networks. Finally, future research could seek to confirm the applicability of our method to other sites and environmental data in order to generalize the framework.

5. Conclusions

Measurement errors or missing observations are recurrent and, in some cases, may reduce user perception of data quality, thereby preventing data from underpinning management actions. Here, we developed a method to successfully reconstruct missing nitrate concentration data from high-frequency in situ sensors in fresh waters, thereby adding value to the literature on anomaly detection and fulfilling a critical management need in the environmental domain. To mimic sporadically missing observations, both point data and sequences of data were removed from a two-year time series of nitrate concentration data. In 72% of cases with missing point data, predicted values were within the sensor precision interval of the original value, although the predictive ability declined when sequences of missing data occurred. The models also had stronger predictive ability when other water variables (covariates) were available. This suggests there may be advantages to deploying co-located sensors to measure covariates, even when there is a single constituent of concern, such as nitrate, by enabling a more reliable reconstruction of the nitrate time series. Our study is an important first step towards environmental data reconstruction in the information age and sets a benchmark against which future datasets and methodological developments can be compared. While we believe the general methodology presented here is generalizable to rivers in other ecosystems [59], the relationships between other water quality variables of interest may differ. Thus, future research should also focus on understanding these relationships so that co-located sensors can be optimally deployed. This will ensure that near real-time water quality data produced by low-cost in situ sensors are trustworthy and reliable enough to underpin data-enabled management decisions.

Author Contributions

Conceptualization, E.E.P. and K.M.; methodology, C.K., R.J.H., K.M. and B.L.; formal analysis, C.K. and R.J.H.; resources, G.L. and J.B.J.; writing—original draft preparation, C.K., E.E.P., G.L., J.B.J. and C.L.; writing—review and editing, all authors; visualization, C.K. and C.L.; supervision, C.L., E.E.P. and R.J.H.; funding acquisition, B.L. and K.M. All authors have read and agreed to the published version of the manuscript.

Funding

Funding was provided by the Energy Environment Solutions (E2S-UPPA) consortium through an international research chair. This study was part of a project funded by an Australian Research Council (ARC) Linkage grant (LP180101151) “Revolutionising water-quality monitoring in the information age”.

Institutional Review Board Statement

No appliable.

Informed Consent Statement

No appliable.

Data Availability Statement

Data are available through the NEON website https://www.neonscience.org/ (accessed on 4 December 2021). The National Ecological Observatory Network is a program sponsored by the National Science Foundation and operated under cooperative agreement by Battelle Memorial Institute. This material is based, in part, upon work supported by the National Science Foundation through the NEON Program.

Acknowledgments

The authors acknowledge and thank the Queensland Department of Environment and Science, and, in particular, the Water Quality and Investigations team for their valuable discussions regarding the ARC Linkage project LP180101151. We extend our thanks to all those involved across the project as a whole.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. NEON data. Details on sensors, variables collected, units of measurement, associated data collection intervals, and the NEON data product number for data used in this study.

Sensor	Water-Quality Variable	Unit	Published Data Resolution	Published Interval (min)	Product Number
SUNA V2	Nitrate	µmol/L	0.1	15	DP1.20033.001
YSI EXO Optical Dissolved Oxygen	Dissolved oxygen	mg/L	0.01	1	DP1.20288.001
Level TROLL 500	Water elevation	masl	0.01	5	DP1.20016.001
YSI EXO Turbidity	Turbidity	FNU	0.01	1	DP1.20288.001
YSI EXO Conductivity and Temperature	Specific conductance	µS/cm	0.01	1	DP1.20288.001
Platinum Resistance Thermometer	Water temperature	°C	0.01	1	DP1.20053.001

References

Adu-Manu, K.S.; Tapparello, C.; Heinzelman, W.; Katsriku, F.; Abdulai, J.-D. Water Quality Monitoring Using Wireless Sensor Networks. ACM Trans. Sens. Netw. 2017, 13, 1–41. [Google Scholar] [CrossRef] [Green Version]
Katsriku, F.A.; Wilson, M.; Yamoah, G.G.; Abdulai, J.-D.; Rahman, B.M.A.; Grattan, K.T.V. Framework for Time Relevant Water Monitoring System; Springer: Singapore, 2015; pp. 3–19. [Google Scholar]
Jones, A.S.; Horsburgh, J.; Reeder, S.L.; Ramírez, M.; Caraballo, J. A data management and publication workflow for a large-scale, heterogeneous sensor network. Environ. Monit. Assess. 2015, 187, 1–19. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Park, J.; Kim, K.T.; Lee, W.H. Recent Advances in Information and Communications Technology (ICT) and Sensor Technology for Monitoring Water Quality. Water 2020, 12, 510. [Google Scholar] [CrossRef] [Green Version]
Cawley, K.M. NEON Algorithm Theoretical Basis Document (ATBD); Technical Report; National Ecological Observatory Network: Boulder, CO, USA, 2018. [Google Scholar]
Pellerin, B.A.; Stauffer, B.A.; Young, D.A.; Sullivan, D.J.; Bricker, S.B.; Walbridge, M.R.; Clyde, G.A., Jr.; Shaw, D.M. Emerging tools for continuous nutrient monitoring networks: Sensors advancing science and water resources protection. J. Am. Water Resour. Assoc. 2016, 52, 993–1008. [Google Scholar] [CrossRef]
Leigh, C.; Alsibai, O.; Hyndman, R.J.; Kandanaarachchi, S.; King, O.C.; McGree, J.M.; Neelamraju, C.; Strauss, J.; Talagala, P.D.; Turner, R.D.; et al. A framework for automated anomaly detection in high frequency water-quality data from in situ sensors. Sci. Total Environ. 2019, 664, 885–898. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Rodriguez-Perez, J.; Leigh, C.; Liquet, B.; Kermorvant, C.; Peterson, E.; Sous, D.; Mengersen, K. Detecting Technical Anomalies in High-Frequency Water-Quality Data Using Artificial Neural Networks. Environ. Sci. Technol. 2020, 54, 13719–13730. [Google Scholar] [CrossRef]
Liu, J.; Wang, P.; Jiang, D.; Nan, J.; Zhu, W. An integrated data-driven framework for surface water quality anomaly detection and early warning. J. Clean. Prod. 2020, 251, 119145. [Google Scholar] [CrossRef]
Shi, B.; Wang, P.; Jiang, J.; Liu, R. Applying high-frequency surrogate measurements and a wavelet-ANN model to provide early warnings of rapid surface water quality anomalies. Sci. Total Environ. 2018, 610–611, 1390–1399. [Google Scholar] [CrossRef]
Batista, G.; Monard, M.C. An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell. 2003, 17, 519–533. [Google Scholar] [CrossRef]
Dong, Y.; Peng, C.-Y.J. Principled missing data methods for researchers. SpringerPlus 2013, 2, 1–17. [Google Scholar] [CrossRef] [Green Version]
Hannaford, J.; Buys, G. Trends in seasonal river flow regimes in the UK. J. Hydrol. 2012, 475, 158–174. [Google Scholar] [CrossRef] [Green Version]
Helsel, D.R.; Hirsch, R.M. Statistical Methods in Water Resources; Elsevier: Amsterdam, The Netherlands, 1992; Volume 49. [Google Scholar]
Slater, L.; Villarini, G. On the impact of gaps on trend detection in extreme streamflow time series. Int. J. Clim. 2017, 37, 3976–3983. [Google Scholar] [CrossRef] [Green Version]
Hirsch, R.M. An evaluation of some record reconstruction techniques. Water Resour. Res. 1979, 15, 1781–1790. [Google Scholar] [CrossRef]
Wallis, J.R.; Lettenmaier, D.P.; Wood, E. A daily hydroclimatological data set for the continental United States. Water Resour. Res. 1991, 27, 1657–1663. [Google Scholar] [CrossRef]
Raman, H.; Mohan, S.; Padalinathan, P. Models for extending streamflow data: A case study. Hydrol. Sci. J. 1995, 40, 381–393. [Google Scholar] [CrossRef]
Woodhouse, C.A.; Gray, S.T.; Meko, D.M. Updated streamflow reconstructions for the Upper Colorado River Basin. Water Resour. Res. 2006, 42, W05415. [Google Scholar] [CrossRef]
Amisigo, B.A.; Van De Giesen, N.C. Using a spatio-temporal dynamic state-space model with the EM algorithm to patch gaps in daily riverflow series. Hydrol. Earth Syst. Sci. 2005, 9, 209–224. [Google Scholar] [CrossRef] [Green Version]
Khalil, M.; Panu, U.; Lennox, W. Groups and neural networks based streamflow data infilling procedures. J. Hydrol. 2001, 241, 153–176. [Google Scholar] [CrossRef]
Elshorbagy, A.; Simonovic, S.; Panu, U. Estimation of missing streamflow data using principles of chaos theory. J. Hydrol. 2002, 255, 123–133. [Google Scholar] [CrossRef]
Coulibaly, P.; Baldwin, C.K. Nonstationary hydrological time series forecasting using nonlinear dynamic methods. J. Hydrol. 2005, 307, 164–174. [Google Scholar] [CrossRef]
Harvey, C.L.; Dixon, H.; Hannaford, J. An appraisal of the performance of data-infilling methods for application to daily mean river flow records in the UK. Hydrol. Res. 2012, 43, 618–636. [Google Scholar] [CrossRef] [Green Version]
Tencaliec, P.; Favre, A.-C.; Prieur, C.; Mathevet, T. Reconstruction of missing daily streamflow data using dynamic regression models. Water Resour. Res. 2015, 51, 9447–9463. [Google Scholar] [CrossRef] [Green Version]
Zhao, L.; Zheng, F. Missing Data Reconstruction Using Adaptively Updated Dictionary in Wireless Sensor Networks. In Proceedings of the 7th International Conference on Computer Engineering and Networks—PoS(CENet2017), Sissa Medialab, Shanghai, China, 22–23 July 2017; Volume 299, p. 40. [Google Scholar]
Pan, L.; Li, J. A multiple-regression-model-based missing values imputation algorithm in wireless sensor network. J. Comput. Res. Dev. 2009, 46, 2101. [Google Scholar]
Wu, H.; Xian, J.; Wang, J.; Khandge, S.; Mohapatra, P. Missing data recovery using reconstruction in ocean wireless sensor networks. Comput. Commun. 2018, 132, 1–9. [Google Scholar] [CrossRef]
Lee, C.-M.; Ko, C.-N. Time series prediction using RBF neural networks with a nonlinear time-varying evolution PSO algorithm. Neurocomputing 2009, 73, 449–460. [Google Scholar] [CrossRef]
Jeong, S.; Ferguson, M.; Hou, R.; Lynch, J.P.; Sohn, H.; Law, K.H. Sensor data reconstruction using bidirectional recurrent neural network with application to bridge monitoring. Adv. Eng. Inform. 2019, 42, 100991. [Google Scholar] [CrossRef]
Camargo, J.A.; Alonso, A.; Salamanca, A. Nitrate toxicity to aquatic animals: A review with new data for freshwater invertebrates. Chemosphere 2005, 58, 1255–1267. [Google Scholar] [CrossRef]
Boesch, D.; Boynton, W.R.; Crowder, L.B.; Diaz, R.J.; Howarth, R.W.; Mee, L.D.; Nixon, S.W.; Rabalais, N.N.; Rosenberg, R.; Sanders, J.G.; et al. Nutrient Enrichment Drives Gulf of Mexico Hypoxia. Eos 2009, 90, 117–118. [Google Scholar] [CrossRef]
Bricker, S.; Longstaff, B.; Dennison, W.; Jones, A.; Boicourt, K.; Wicks, C.; Woerner, J. Effects of nutrient enrichment in the nation’s estuaries: A decade of change. Harmful Algae 2008, 8, 21–32. [Google Scholar] [CrossRef]
Camargo, J.; Ward, J. Nitrate (NO3-N) toxicity to aquatic life: A proposal of safe concentrations for two species of nearctic freshwater invertebrates. Chemosphere 1995, 31, 3211–3216. [Google Scholar] [CrossRef]
Davidson, J.; Good, C.; Williams, C.; Summerfelt, S.T. Evaluating the chronic effects of nitrate on the health and performance of post-smolt Atlantic salmon Salmo salar in freshwater recirculation aquaculture systems. Aquac. Eng. 2017, 79, 1–8. [Google Scholar] [CrossRef]
Moore, A.; Bringolf, R.B. Effects of nitrate on freshwater mussel glochidia attachment and metamorphosis success to the juvenile stage. Environ. Pollut. 2018, 242, 807–813. [Google Scholar] [CrossRef]
Leigh, C.; Burford, M.A.; Connolly, R.M.; Olley, J.M.; Saeck, E.; Sheldon, F.; Smart, J.C.; Bunn, S.E. Science to Support Management of Receiving Waters in an Event-Driven Ecosystem: From Land to River to Sea. Water 2013, 5, 780–797. [Google Scholar] [CrossRef] [Green Version]
O’Brien, K.R.; Weber, T.R.; Leigh, C.; Burford, M. Sediment and nutrient budgets are inherently dynamic: Evidence from a long-term study of two subtropical reservoirs. Hydrol. Earth Syst. Sci. 2016, 20, 4881–4894. [Google Scholar] [CrossRef] [Green Version]
Fisher, S.G.; Grimm, N.B.; Martí, E.; Holmes, R.M.; Jones, J.J.B. Material Spiraling in Stream Corridors: A Telescoping Ecosystem Model. Ecosystems 1998, 1, 19–34. [Google Scholar] [CrossRef]
Wood, S.N. Generalized Additive Models: An Introduction with R; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
Sakamoto, Y.; Ishiguro, M.; Kitagawa, G. Akaike Information Criterion Statistics; D. Reidel: Dordrecht, The Netherlands, 1986. [Google Scholar] [CrossRef]
Box, G.E.P.; Pierce, D.A. Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models. J. Am. Stat. Assoc. 1970, 65, 1509–1526. [Google Scholar] [CrossRef]
Vance, J.; Nance, B.; Monahan, D.; Mahal, M.; Cavileer, M. NEON Preventive Maintenance Procedure: AIS Surface Water Quality Multisonde. NEON.DOC.001569 Revision: B, Technical Report, National Ecological Observatory Network 2019. Available online: http://data.neonscience.org/documents (accessed on 4 December 2021).
Willingham, R.; Cavileer, M.; Csavina, J.; Monahan, D. NEON Preventive Maintenance Procedure: Submersible Ultraviolet Nitrate analyzer. NEON.DOC.002716 Revision: B, Technical Report, National Ecological Observatory Network 2019. Available online: http://data.neonscience.org/documents (accessed on 4 December 2021).
Cawley, K.M. NEON Aquatic Sampling Strategy, Technical Report, National Ecological Observatory Network 2016. Available online: http://data.neonscience.org/documents (accessed on 4 December 2021).
NEON, Nitrate in Surface Water (DP1.20033.001) (2021). Available online: https://data.neonscience.org/data-products/DP1.20033.001/RELEASE-2021 (accessed on 4 December 2021). [CrossRef]
NEON, Water Quality (DP1.20288.001) (2021). Available online: https://data.neonscience.org/data-products/DP1.20288.001/RELEASE-2021 (accessed on 4 December 2021). [CrossRef]
NEON, Temperature (PRT) in Surface Water (DP1.20053.001) (2021). Available online: https://data.neonscience.org/data-products/DP1.20053.001/RELEASE-2021 (accessed on 4 December 2021). [CrossRef]
NEON, Elevation of Surface Water (DP1.20016.001) (2021). Available online: https://data.neonscience.org/data-products/DP1.20016.001/RELEASE-2021 (accessed on 4 December 2021). [CrossRef]
Leigh, C.; Kandanaarachchi, S.; McGree, J.M.; Hyndman, R.J.; Alsibai, O.; Mengersen, K.; Peterson, E.E. Predicting sediment and nutrient concentrations from high-frequency water-quality data. PLoS ONE 2019, 14, e0215503. [Google Scholar] [CrossRef] [Green Version]
Box, G.E.P.; Cox, D.R. An Analysis of Transformations. J. R. Stat. Soc. Ser. B 1964, 26, 211–243. [Google Scholar] [CrossRef]
Ratnasingham, S.; Hebert, P.D. BOLD: The Barcode of Life Data System (http://www.barcodinglife.org). Mol. Ecol. Notes 2007, 7, 355–364. [Google Scholar] [CrossRef] [Green Version]
Fox, J.; Weisberg, S.D.; Adler, D.; Bates, D.; Baud-Bovy, G.; Ellison, S. Package ‘car’; R Foundation for Statistical Computing: Vienna, Austria, 2012. [Google Scholar]
Hastie, T. Gam: Generalized Additive Models, R Package Version 1.20. 2020. Available online: https://CRAN.R-project.org/package=gam (accessed on 4 December 2021).
Wickham, H.; Chang, W. Ggplot2: An Implementation of the Grammar of Graphics. R Package Version 0.7. Available online: http://CRAN.R-project.org/package=ggplot2 (accessed on 10 September 2021).
Taylor, J.; Street, S.; Sturtevant, C. NEON Algorithm Theoretical Basis Document (ATBD): QA/QC Plausibility Testing. NEON.DOC.011081 Revision: C, Technical Report, National Ecological Observatory Network 2020. Available online: https://data.neonscience.org/api/v0/documents/NEON.DOC.011081vC (accessed on 4 December 2021).
Langhammer, J.; Česák, J. Applicability of a Nu-Support Vector Regression Model for the Completion of Missing Data in Hydrological Time Series. Water 2016, 8, 560. [Google Scholar] [CrossRef] [Green Version]
Jones, T.L.; Jones, A.S. Horsburgh, toward automating post processing of aquatic sensor data. Earth Archiv. Prepr. 2021. [Google Scholar] [CrossRef]
Kermorvant, B.; Liquet, G.; Litt, K.; Mengersen, E.E.; Peterson, R.; Hyndman, J.B.; Jones, C., Jr. Understanding links between water-quality variables and nitrate concentration in freshwater streams using high-frequency sensor data. arXiv 2021, arXiv:2106.01719. [Google Scholar]

Figure 1. Reconstruction method. Flow chart of the method to predict the values of missing observations in high-frequency sensor data.

Figure 2. Arikaree River nitrate data. Black points represent the original nitrate observations, grey shading represents the precision interval of the sensor.

Figure 3. Reconstruction of Arikaree nitrate data. Green points represent the nitrate concentration values predicted by the reconstruction method, along with intervals of prediction, for (a) a 12-day sequence of sporadically missing data, (b) a one-day peak flow event containing sporadically missing data, (c) a full day of contiguously missing data, and (d) a three-month sequence of contiguously missing data. Black points represent the original nitrate observations, grey shading represents the precision interval of the sensor. Prediction intervals may not be visible when they are narrow relative to the precision interval of the sensor.

Figure 4. Performance evaluation: reconstructing missing point data. Boxplots of (a) root-mean-square error (RMSE) and (b) the proportion of reconstructed data within the precision interval (PWPI) for different amounts of randomly removed point observations, simulated 100 times.

Figure 5. Performance evaluation: missing point data examples. Examples for different periods of randomly removed point observations, simulated 100 times: (a) one week, (b) one month, and (c) a nitrate event in which concentrations rose rapidly in less than 24 h. Dark points represent the real nitrate concentration value and the grey shading around those points represents the precision interval of the sensor. Green points and shading are the predicted values along with the prediction interval. Prediction intervals may not be visible when they are narrow relative to the precision interval of the sensor.

Figure 6. Performance evaluation: reconstructing sequences of missing data. Boxplots of (a) root-mean-square error (RMSE) and (b) the proportion of reconstructed data within the precision interval (PWPI) for different amounts of randomly removed point observations, simulated 100 times. Note that the y-axis on plot (a) has been truncated at 4 µmol/L (one extreme RMSE value of 45.23 µmol/L for the one-week simulations not shown).

Figure 7. Performance evaluation: missing sequential data examples. Examples for different periods of randomly removed one-week sequences of observations, simulated 100 times, where data were reconstructed using (a) GAM and ARIMA, (b) ARIMA only, and (c) GAM only. Dark points represent the real nitrate concentration value and the shaded area around those points is the precision interval of the sensor. Green points and shading are the predicted values along with the prediction interval. Prediction intervals may not be visible when they are narrow relative to the precision interval of the sensor.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kermorvant, C.; Liquet, B.; Litt, G.; Jones, J.B.; Mengersen, K.; Peterson, E.E.; Hyndman, R.J.; Leigh, C. Reconstructing Missing and Anomalous Data Collected from High-Frequency In-Situ Sensors in Fresh Waters. Int. J. Environ. Res. Public Health 2021, 18, 12803. https://doi.org/10.3390/ijerph182312803

AMA Style

Kermorvant C, Liquet B, Litt G, Jones JB, Mengersen K, Peterson EE, Hyndman RJ, Leigh C. Reconstructing Missing and Anomalous Data Collected from High-Frequency In-Situ Sensors in Fresh Waters. International Journal of Environmental Research and Public Health. 2021; 18(23):12803. https://doi.org/10.3390/ijerph182312803

Chicago/Turabian Style

Kermorvant, Claire, Benoit Liquet, Guy Litt, Jeremy B. Jones, Kerrie Mengersen, Erin E. Peterson, Rob J. Hyndman, and Catherine Leigh. 2021. "Reconstructing Missing and Anomalous Data Collected from High-Frequency In-Situ Sensors in Fresh Waters" International Journal of Environmental Research and Public Health 18, no. 23: 12803. https://doi.org/10.3390/ijerph182312803

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reconstructing Missing and Anomalous Data Collected from High-Frequency In-Situ Sensors in Fresh Waters

Abstract

1. Introduction

2. Materials and Methods

2.1. Reconstruction Method

2.2. Arikaree River Case Study: Applying the Reconstruction Method

2.3. Simulation Study: Performance Evaluation

2.4. Implementation

3. Results

3.1. Arikaree River Case Study

3.2. Simulation Study: Performance Evaluation

3.2.1. Simulations 1, 2, and 3: Missing Point Data

3.2.2. Simulations 4 and 5: Missing Sequences of Data

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI