Elsevier

Journal of Hydrology

Volume 590, November 2020, 125206
Journal of Hydrology

Research papers
A physical process and machine learning combined hydrological model for daily streamflow simulations of large watersheds with limited observation data

https://doi.org/10.1016/j.jhydrol.2020.125206Get rights and content

Highlights

  • A process-based model was used to solve the insufficient training problem.

  • CV method was used to improve the robustness of data-driven models.

  • A categorization method was used to improve the peak and low flow simulations.

  • Different modeling approaches were compared to improve hydrological simulations.

Abstract

Physically distributed hydrological models are effective in hydrological simulations of large river basins, but the complex characteristics of hydrological features limit their application. An easy-to-use and high-efficiency hydrological model is needed for efficient water resource management in practice. Machine learning (ML) based models have the potential to provide fast mapping pathways between meteorological predictors and hydrological responses without detailed descriptions of the corresponding physical processes. However, the extensive data requirements, ignoring of spatial variability and poor performance for extreme flows limit the application of ML models. This study attempts to develop an ML-based hydrological model by combining physically based distributed hydrological model with an artificial neural networks (ANN), computer vision (CV) and a categorization approach (CA). To solve the insufficient training problem, we use a physically distributed hydrological model (GBHM) together with a stochastic rainfall generator to generate a large amount of synthetic data (GBHM-ANN). To improve the extreme flow simulation, we add the categorization approach into GBHM-ANN (GBHM-ANN-CA). To capture the spatial variability of the predictors, we also use a local binary pattern-based computer vision method to form GBHM-ANN-CA-CV model. The effectiveness of the three modeling approaches are demonstrated by synthetic case studies. We finally evaluate GBHM-ANN-CA-CV using the real data from the upper Chao Phraya Basin in Thailand. The results show that the prediction accuracy of our new data-driven model is greatly improved in data-limited watersheds. Specifically, the CV extracted spatial information can improve the robustness of the data-driven hydrological model, and the CA can greatly improve high flow simulations. The combined model yields a satisfactory accuracy for long-term daily streamflow simulations. This study demonstrates the potential of ML-based hydrological models in water resource management, especially in changing environments.

Introduction

Physically based hydrological models have been widely used for streamflow simulations and forecasting (Arnold and Fohrer, 2005, Chen et al., 2011, Jia et al., 2001, Refsgaard, 1997, Yang et al., 2002). Among the various kinds of physically based models, fully distributed hydrological models, which can consider the spatial variability of the watershed landscape and its atmospheric forcing, are considered to be the gold standard for hydrological modeling (Chen et al., 2016) and have been applied in ungauged basins (Götzinger and Bárdossy, 2007, Hunukumbura et al., 2012, Wambura et al., 2018). However, with complex model structures and extensive calculation requirements, physically based distributed hydrological models often have high computational costs and require high levels of hydrological expertise for modelers and users, thus limiting their application in water resource management (Chen et al., 2016, Kratzert et al., 2018, Srivastav et al., 2007, Yaseen et al., 2019, Zhang et al., 2018a, Zhang et al., 2018b).

Machine learning (ML)-based methods can recognize patterns hidden in historical data, and they may provide quick and direct mapping pathways between predictors and hydrological responses without explicit descriptions of the underlying physical processes (Adnan et al., 2019, Kasiviswanathan et al., 2016, Sahoo et al., 2017). Many studies have demonstrated that ML models can outperform other state-of-the-art techniques in natural science fields, including hydrology (AlQuraishi, 2019, He et al., 2019, Kratzert et al., 2018, Kratzert et al., 2019a, Kratzert et al., 2019b, Zhang et al., 2020). Among the various ML models, neural networks are the most widely reported methods for streamflow simulations and forecasting. For example, Hsu et al. (1995) compared the runoff forecasting performance of a physically based model called the Sacramento Soil Moisture Accounting Model (SAC-SMA) to that of a three-layer artificial neural network (ANN) under various flow regimes, and the results showed that the ANN model provided a better representation of the rainfall-runoff relationship. Demirel et al. (2009) compared an ANN model with a process-based semi-distributed model called the Soil and Water Assessment Tool (SWAT) for streamflow forecasting one day in advance, and the comparisons showed that the ANN was more successful than SWAT in peak flow forecasting. Humphrey et al. (2016) combined a Bayesian artificial neural network with a physically based conceptual model called GR4J for monthly streamflow forecasting, and the results show that both the hybrid model and pure ANN model outperformed the GR4J conceptual hydrological model. Recently, deep learning (DL) neural network models, such as long short-term memory (LSTM) networks, have been reported to be suitable for rainfall-runoff process modeling. For example, Kratzert et al. (2019b) improved the standard LSTM architecture and proposed an entity-aware-LSTM (EA-LSTM) for hydrological modeling that could learn catchment similarities and outperformed five physically based hydrological models.

However, current ML-based hydrological models suffer from several drawbacks. First, ML models often require a large amount of training data to obtain robust performance (Kalra and Ahmad, 2009, Kratzert et al., 2019b). This characteristic severely limits the applicability of ML in hydrological simulation and prediction because the majority of streams in the world lack long-term hydrological observations (Goswami et al., 2007, Kratzert et al., 2019a, Sivapalan, 2003). Jiang et al. (2018) developed a computer vision-based data-driven model for daily runoff simulations in a gauged basin, and tested the transfer learning performance in an ungauged basin; their results showed that the overall performance of the model was satisfactory in the ungauged basin, but it could not accurately predict high flows. Kratzert et al. (2019a) added catchment characteristics as the predictor variables of an LSTM model that was trained with data from a large number of gauged basins and explored the capability of this regionally trained model for predictions in ungauged basins. Although the ungauged LSTM model displayed better performance than traditional models, it still required large amounts of data from gauged basins with long-term observations for training. Karpatne et al. (2017) introduced a physics-guided or theory-guided data science and showed that hydrological mechanism-guided methods are effective but need further study.

Moreover, traditional ML models often ignore the spatial variability in the watershed landscape and the related atmospheric forcing, and they aggregate data over the watershed as model inputs (Jiang et al., 2018). However, hydrological responses are often significantly impacted by the spatial patterns of driving forces and terrain characteristics (Chen et al., 2016, Jiang et al., 2018, Wang et al., 2015). Computer vision (CV) algorithms have made it possible to overcome this drawback. CV technology is used to extract information from images, which allows machines to understand images by processing digital signals (Wang et al., 2013). Instead of using the values associated with each pixel, CV uses features to quantitatively describe images in low dimensions. CV has been applied to some areas of Earth science and is reported to be able to effectively extract spatial information from images. For example, to classify crops in very high-resolution remote sensing images, Sun et al. (2020) proposed a method guided by hierarchical perception, a CV-based concept, and the model performed well and displayed high precision. With the help of CV, Ling et al. (2019) developed a convolutional neural network-based super-resolution mapping model that can effectively estimate the subpixel-scale details of rivers and the wetted width. By combining a convolution neural network and LSTM network, Miao et al. (2019) proposed a statistical downscaling method to improve the precipitation prediction of a general circulation model (GCM) in a monsoon region. Among various CV algorithms, local binary pattern (LBP) is one of the most popular methods used in the field of pattern recognition (Gupta et al., 2020, He and Sang, 2013, Heikkilä et al., 2009, Jiang et al., 2018, Khan et al., 2020). LBP was first developed by Ojala et al. (1996) in a comparative study of texture measures, which is an image operator based on local pixel information and is complementary to the contrast information (Gupta et al., 2020). The LBP can potentially be used to capture the watershed spatial pattern in hydrological modeling.

Third, as statistical methods are purely based on observation data, ML-based hydrological models often provide poor simulations of high flows, which are important in practical applications (Sudheer et al., 2003, Wu et al., 2009, Yang et al., 2019). The frequency of high flows in streamflow time series is relatively low, which may lead to insufficient model training (Yang et al., 2019). Sudheer et al. (2003) proposed a data transformation method based on a statistical model to improve the peak flow estimation with an ANN. Since the underlying mechanisms of runoff generation may be quite different under various flow regimes, a single global ML model often fails to provide satisfactory predictions for both extreme high and low values and even normal values (Hsu et al., 1995, Solomatine and Ostfeld, 2008, Wu et al., 2009). Several researchers have attempted to improve the modeling performance by using categorization models, which means that sub-processes were identified first, and separate models were built for each sub-process. To improve the prediction of high magnitude flows, Sivapragasam and Liong (2005) proposed a method to classify flow into low, medium and high flow and built the support vector machine model for each type of flow. Wu et al. (2009) proposed a crisp distributed support vectors regression (CDSVR) model for monthly streamflow forecasts. They used the Fuzzy C-means clustering technique to split the flow data into three subsets, and then fitted three single support vector regressions to three subsets. Zemzami and Benaabidate, 2016, Tongal and Booij, 2018 both tried to improve the ML model performance for peak flow simulations by separating the streamflow into a baseflow and quick response flow. Chu et al. (2019) divided the streamflow time series to different flow regimes by a Fuzzy C-means method and developed data-driven models for each regime to map the nonlinear relationship between the selected predictors and streamflow. The categorization approach (CA) needs to be further studied in peak flow simulations with insufficient observation data.

As a physically based distributed hydrological model, the geomorphology-based hydrological model (GBHM) developed by Yang et al. (1998) has been successfully applied in many regions, such as the Yangtze River (Li et al., 2015), Mekong River (Wang et al., 2016) and Jiulong River (Lu et al., 2018). In this study, GBHM is used to provide sufficient training samples for ML models. With gradient-based optimization schemes, the traditional ANN model may yield premature convergence and become trapped at local optima. In addition, the traditional model is very sensitive to the initial conditions, including the initial weights and biases (Yang et al., 2017, Zanchettin and Ludermir, 2007). To overcome these drawbacks, the GA is used in this study to optimize the initial conditions of ANN, which can achieve an optimal solution (Bahrami et al., 2016). Thus, the GA-ANN model is used for the hydrological simulation in this study.

In this study, we develop a physical process and ML combined hydrological model for continuous daily hydrological simulations in data-limited watersheds located in northern Thailand. With rainfall data generated by a stochastic rainfall generator, a physically based distributed hydrological model is used to generate sufficient streamflow samples for training and validation in the data-insufficient basins. The spatial features are extracted from the predictor images by using CV algorithms and are mapped to the hydrological responses by categorization models. We strive to (1) verify the effectiveness of the distributed hydrological model in supporting the ML-based data-driven model in data-insufficient basins; (2) explore the use of spatial information and predictors extracted by CV; (3) investigate the effectiveness of the categorization approach in improving high flow simulations; and (4) test the applicability of the combined model using real data.

Section snippets

Study area

The study area (see Fig. 1), the Ping River Basin, is located in northern Thailand with a drainage area of 26,386 km2; this basin is upstream of the Bhumibol Reservoir, which is the largest reservoir in Thailand. The Ping River is one of the main tributaries of the Chao Phraya River and flows through Chiang Mai, which is the largest city in northern Thailand. The elevation of the basin ranges from 229 m in the south to 2572 m in the north, and the spatial heterogeneity of the topography and

Modeling approaches

To overcome the limitations of previous ML-based hydrological models, we propose three corresponding modeling approaches and then develop a physical process and machine learning combined hydrological model. In addition, we evaluate the effectiveness of the proposed modelling framework by comparing different data-driven models step-by-step.

Results

In this section, we present the calibration and validation results of the GBHM and compare the performance of four data-driven models to demonstrate the effectiveness of the modelling framework proposed in this study. The prediction capability of the proposed model is examined with the data observed from 1 January 2010 to 31 December 2016 in the study area.

Comparison of different modeling approaches

  • 1)

    Effectiveness of the hydrological mechanisms-guided modeling approach

To demonstrate the effectiveness of hydrological mechanism-guided modeling approaches, we compare the performances of the GBHM-ANN with those of a simple ANN. GBHM-ANN, which adopts sufficient synthetic data for training and validation, performs better at all three gauges. The NSE values of the simulation results at the three gauges are improved by an average of 23% during the test period. Specifically, with sufficient

Conclusion

This study proposed three approaches to improve the prediction capability of machine learning-based hydrological models and developed a physical process and ML combined hydrological model for rainfall-runoff simulation in data-insufficient watersheds. The model performance in learning the hydrological behaviors of the physically distributed hydrological model was evaluated with a synthetic dataset, and the prediction capability was investigated using real data. According to the results, we

CRediT authorship contribution statement

Shuyu Yang: Conceptualization, Methodology, Data curation, Writing - original draft, Writing - review & editing. Dawen Yang: Supervision, Conceptualization, Methodology, Writing - review & editing, Funding acquisition. Jinsong Chen: Supervision, Methodology, Writing - review & editing. Jerasorn Santisirisomboon: Resources. Weiwei Lu: Investigation. Baoxu Zhao: Investigation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This study was financially supported by the National Natural Science Foundation of China (project Nos. 41661144031). The authors would like to thank the Royal Irrigation Department of Thailand and Department of Land Development of Thailand for providing the historical observations, including the daily runoff data at three gauges (P.1, P73 and P12), the daily precipitation, the daily air temperature (mean, maxima and minima), the daily mean relative humidity, the daily sunshine duration and the

References (86)

  • S. Jiang et al.

    A computer vision-based approach to fusing spatiotemporal data for hydrological modeling

    J. Hydrol.

    (2018)
  • K. Kasiviswanathan et al.

    Potential application of wavelet neural network ensemble to forecast streamflow for flood management

    J. Hydrol.

    (2016)
  • Y. Kaya et al.

    1D-local binary pattern based feature extraction for classification of epileptic EEG signals

    Appl. Math. Comput.

    (2014)
  • W. Lu et al.

    Quantifying the impacts of small dam construction on hydrological alterations in the Jiulong River basin of Southeast China

    J. Hydrol.

    (2018)
  • L. Mediero et al.

    Detection and attribution of trends in magnitude, frequency and timing of floods in Spain

    J. Hydrol.

    (2014)
  • Q. Miao et al.

    Establishing a rainfall threshold for flash flood warnings in China’s mountainous areas based on a distributed hydrological model

    J. Hydrol.

    (2016)
  • J.E. Nash et al.

    River flow forecasting through conceptual models part I—A discussion of principles

    J. Hydrol.

    (1970)
  • T. Ojala et al.

    A comparative study of texture measures with classification based on featured distributions

    Pattern Recogn.

    (1996)
  • Y. Qin

    Impacts of climate warming on the frozen ground and eco-hydrology in the Yellow River source region, China

    Sci. Total Environ.

    (2017)
  • J.C. Refsgaard

    Parameterisation, calibration and validation of distributed hydrological models

    J. Hydrol.

    (1997)
  • H. Tongal et al.

    Simulation and forecasting of streamflows using machine learning models coupled with base flow separation

    J. Hydrol.

    (2018)
  • C. Wang et al.

    Markov random field modeling, inference & learning in computer vision & image understanding: a survey

    Comput. Vis. Image Underst.

    (2013)
  • Q. Yang et al.

    Multi-label classification models for sustainable flood retention basins

    Environ. Modell. Softw.

    (2012)
  • Q. Yang et al.

    Feature selection methods for characterizing and classifying adaptive Sustainable Flood Retention Basins

    Water Res.

    (2011)
  • T. Yang

    An enhanced artificial neural network with a shuffled complex evolutionary global optimization with principal component analysis

    Inf. Sci.

    (2017)
  • Z.M. Yaseen et al.

    An enhanced extreme learning machine model for river flow forecasting: State-of-the-art, practical applications in water resource engineering area and future research direction

    J. Hydrol.

    (2019)
  • J. Zhang

    Large-scale baseflow index prediction using hydrological modelling, linear and multilevel regression approaches

    J. Hydrol.

    (2020)
  • R.M. Adnan

    Daily streamflow prediction using optimally pruned extreme learning machine

    J. Hydrol.

    (2019)
  • D.W. Aha et al.

    Instance-based learning algorithms

    Mach. Learn.

    (1991)
  • J.G. Arnold et al.

    SWAT2000: current capabilities and research opportunities in applied watershed modelling

    Hydrol. Process.

    (2005)
  • M.H. Beale et al.

    Neural Network Toolbox™ user's guide

    (1992)
  • Y. Chen et al.

    Improving flood forecasting capability of physically based distributed hydrological models by parameter optimization

    Hydrol. Earth Syst. Sci.

    (2016)
  • Y. Chen et al.

    Liuxihe model and its modeling to river basin flood

    J. Hydrol. Eng.

    (2011)
  • H. Chu et al.

    Streamflow prediction using LASSO-FCM-DBN approach based on hydro-meteorological condition classification

    J. Hydrol.

    (2019)
  • FAO, IIASA, ISRIC, ISSCAS, JRC, 2012. Harmonized world soil database (version 1.2). FAO, Rome, Italy and IIASA,...
  • H.V. Gupta et al.

    Status of automatic calibration for hydrologic models: comparison with multilevel expert calibration

    J. Hydrol. Eng.

    (1999)
  • S. He

    Learning to predict the cosmological structure formation

    Proc. Natl. Acad. Sci. U.S.A.

    (2019)
  • Y. He et al.

    Multi-ring local binary patterns for rotation invariant texture classification

    Neural Comput. Appl.

    (2013)
  • K.L. Hsu et al.

    Artificial neural network modeling of the rainfall-runoff process

    Water Resour. Res.

    (1995)
  • P.B. Hunukumbura et al.

    Distributed hydrological model transferability across basins with different hydro-climatic characteristics

    Hydrol. Process.

    (2012)
  • A. Jarvis et al.

    Hole-Filled Seamless SRTMdata V4

    (2008)
  • Y. Jia et al.

    Development of WEP model and its application to an urban watershed

    Hydrol. Process.

    (2001)
  • A. Kalra et al.

    Using oceanic-atmospheric oscillations for long lead time streamflow forecasting

    Water Resour. Res.

    (2009)
  • Cited by (106)

    View all citing articles on Scopus
    View full text