Research papersA physical process and machine learning combined hydrological model for daily streamflow simulations of large watersheds with limited observation data
Introduction
Physically based hydrological models have been widely used for streamflow simulations and forecasting (Arnold and Fohrer, 2005, Chen et al., 2011, Jia et al., 2001, Refsgaard, 1997, Yang et al., 2002). Among the various kinds of physically based models, fully distributed hydrological models, which can consider the spatial variability of the watershed landscape and its atmospheric forcing, are considered to be the gold standard for hydrological modeling (Chen et al., 2016) and have been applied in ungauged basins (Götzinger and Bárdossy, 2007, Hunukumbura et al., 2012, Wambura et al., 2018). However, with complex model structures and extensive calculation requirements, physically based distributed hydrological models often have high computational costs and require high levels of hydrological expertise for modelers and users, thus limiting their application in water resource management (Chen et al., 2016, Kratzert et al., 2018, Srivastav et al., 2007, Yaseen et al., 2019, Zhang et al., 2018a, Zhang et al., 2018b).
Machine learning (ML)-based methods can recognize patterns hidden in historical data, and they may provide quick and direct mapping pathways between predictors and hydrological responses without explicit descriptions of the underlying physical processes (Adnan et al., 2019, Kasiviswanathan et al., 2016, Sahoo et al., 2017). Many studies have demonstrated that ML models can outperform other state-of-the-art techniques in natural science fields, including hydrology (AlQuraishi, 2019, He et al., 2019, Kratzert et al., 2018, Kratzert et al., 2019a, Kratzert et al., 2019b, Zhang et al., 2020). Among the various ML models, neural networks are the most widely reported methods for streamflow simulations and forecasting. For example, Hsu et al. (1995) compared the runoff forecasting performance of a physically based model called the Sacramento Soil Moisture Accounting Model (SAC-SMA) to that of a three-layer artificial neural network (ANN) under various flow regimes, and the results showed that the ANN model provided a better representation of the rainfall-runoff relationship. Demirel et al. (2009) compared an ANN model with a process-based semi-distributed model called the Soil and Water Assessment Tool (SWAT) for streamflow forecasting one day in advance, and the comparisons showed that the ANN was more successful than SWAT in peak flow forecasting. Humphrey et al. (2016) combined a Bayesian artificial neural network with a physically based conceptual model called GR4J for monthly streamflow forecasting, and the results show that both the hybrid model and pure ANN model outperformed the GR4J conceptual hydrological model. Recently, deep learning (DL) neural network models, such as long short-term memory (LSTM) networks, have been reported to be suitable for rainfall-runoff process modeling. For example, Kratzert et al. (2019b) improved the standard LSTM architecture and proposed an entity-aware-LSTM (EA-LSTM) for hydrological modeling that could learn catchment similarities and outperformed five physically based hydrological models.
However, current ML-based hydrological models suffer from several drawbacks. First, ML models often require a large amount of training data to obtain robust performance (Kalra and Ahmad, 2009, Kratzert et al., 2019b). This characteristic severely limits the applicability of ML in hydrological simulation and prediction because the majority of streams in the world lack long-term hydrological observations (Goswami et al., 2007, Kratzert et al., 2019a, Sivapalan, 2003). Jiang et al. (2018) developed a computer vision-based data-driven model for daily runoff simulations in a gauged basin, and tested the transfer learning performance in an ungauged basin; their results showed that the overall performance of the model was satisfactory in the ungauged basin, but it could not accurately predict high flows. Kratzert et al. (2019a) added catchment characteristics as the predictor variables of an LSTM model that was trained with data from a large number of gauged basins and explored the capability of this regionally trained model for predictions in ungauged basins. Although the ungauged LSTM model displayed better performance than traditional models, it still required large amounts of data from gauged basins with long-term observations for training. Karpatne et al. (2017) introduced a physics-guided or theory-guided data science and showed that hydrological mechanism-guided methods are effective but need further study.
Moreover, traditional ML models often ignore the spatial variability in the watershed landscape and the related atmospheric forcing, and they aggregate data over the watershed as model inputs (Jiang et al., 2018). However, hydrological responses are often significantly impacted by the spatial patterns of driving forces and terrain characteristics (Chen et al., 2016, Jiang et al., 2018, Wang et al., 2015). Computer vision (CV) algorithms have made it possible to overcome this drawback. CV technology is used to extract information from images, which allows machines to understand images by processing digital signals (Wang et al., 2013). Instead of using the values associated with each pixel, CV uses features to quantitatively describe images in low dimensions. CV has been applied to some areas of Earth science and is reported to be able to effectively extract spatial information from images. For example, to classify crops in very high-resolution remote sensing images, Sun et al. (2020) proposed a method guided by hierarchical perception, a CV-based concept, and the model performed well and displayed high precision. With the help of CV, Ling et al. (2019) developed a convolutional neural network-based super-resolution mapping model that can effectively estimate the subpixel-scale details of rivers and the wetted width. By combining a convolution neural network and LSTM network, Miao et al. (2019) proposed a statistical downscaling method to improve the precipitation prediction of a general circulation model (GCM) in a monsoon region. Among various CV algorithms, local binary pattern (LBP) is one of the most popular methods used in the field of pattern recognition (Gupta et al., 2020, He and Sang, 2013, Heikkilä et al., 2009, Jiang et al., 2018, Khan et al., 2020). LBP was first developed by Ojala et al. (1996) in a comparative study of texture measures, which is an image operator based on local pixel information and is complementary to the contrast information (Gupta et al., 2020). The LBP can potentially be used to capture the watershed spatial pattern in hydrological modeling.
Third, as statistical methods are purely based on observation data, ML-based hydrological models often provide poor simulations of high flows, which are important in practical applications (Sudheer et al., 2003, Wu et al., 2009, Yang et al., 2019). The frequency of high flows in streamflow time series is relatively low, which may lead to insufficient model training (Yang et al., 2019). Sudheer et al. (2003) proposed a data transformation method based on a statistical model to improve the peak flow estimation with an ANN. Since the underlying mechanisms of runoff generation may be quite different under various flow regimes, a single global ML model often fails to provide satisfactory predictions for both extreme high and low values and even normal values (Hsu et al., 1995, Solomatine and Ostfeld, 2008, Wu et al., 2009). Several researchers have attempted to improve the modeling performance by using categorization models, which means that sub-processes were identified first, and separate models were built for each sub-process. To improve the prediction of high magnitude flows, Sivapragasam and Liong (2005) proposed a method to classify flow into low, medium and high flow and built the support vector machine model for each type of flow. Wu et al. (2009) proposed a crisp distributed support vectors regression (CDSVR) model for monthly streamflow forecasts. They used the Fuzzy C-means clustering technique to split the flow data into three subsets, and then fitted three single support vector regressions to three subsets. Zemzami and Benaabidate, 2016, Tongal and Booij, 2018 both tried to improve the ML model performance for peak flow simulations by separating the streamflow into a baseflow and quick response flow. Chu et al. (2019) divided the streamflow time series to different flow regimes by a Fuzzy C-means method and developed data-driven models for each regime to map the nonlinear relationship between the selected predictors and streamflow. The categorization approach (CA) needs to be further studied in peak flow simulations with insufficient observation data.
As a physically based distributed hydrological model, the geomorphology-based hydrological model (GBHM) developed by Yang et al. (1998) has been successfully applied in many regions, such as the Yangtze River (Li et al., 2015), Mekong River (Wang et al., 2016) and Jiulong River (Lu et al., 2018). In this study, GBHM is used to provide sufficient training samples for ML models. With gradient-based optimization schemes, the traditional ANN model may yield premature convergence and become trapped at local optima. In addition, the traditional model is very sensitive to the initial conditions, including the initial weights and biases (Yang et al., 2017, Zanchettin and Ludermir, 2007). To overcome these drawbacks, the GA is used in this study to optimize the initial conditions of ANN, which can achieve an optimal solution (Bahrami et al., 2016). Thus, the GA-ANN model is used for the hydrological simulation in this study.
In this study, we develop a physical process and ML combined hydrological model for continuous daily hydrological simulations in data-limited watersheds located in northern Thailand. With rainfall data generated by a stochastic rainfall generator, a physically based distributed hydrological model is used to generate sufficient streamflow samples for training and validation in the data-insufficient basins. The spatial features are extracted from the predictor images by using CV algorithms and are mapped to the hydrological responses by categorization models. We strive to (1) verify the effectiveness of the distributed hydrological model in supporting the ML-based data-driven model in data-insufficient basins; (2) explore the use of spatial information and predictors extracted by CV; (3) investigate the effectiveness of the categorization approach in improving high flow simulations; and (4) test the applicability of the combined model using real data.
Section snippets
Study area
The study area (see Fig. 1), the Ping River Basin, is located in northern Thailand with a drainage area of 26,386 km2; this basin is upstream of the Bhumibol Reservoir, which is the largest reservoir in Thailand. The Ping River is one of the main tributaries of the Chao Phraya River and flows through Chiang Mai, which is the largest city in northern Thailand. The elevation of the basin ranges from 229 m in the south to 2572 m in the north, and the spatial heterogeneity of the topography and
Modeling approaches
To overcome the limitations of previous ML-based hydrological models, we propose three corresponding modeling approaches and then develop a physical process and machine learning combined hydrological model. In addition, we evaluate the effectiveness of the proposed modelling framework by comparing different data-driven models step-by-step.
Results
In this section, we present the calibration and validation results of the GBHM and compare the performance of four data-driven models to demonstrate the effectiveness of the modelling framework proposed in this study. The prediction capability of the proposed model is examined with the data observed from 1 January 2010 to 31 December 2016 in the study area.
Comparison of different modeling approaches
- 1)
Effectiveness of the hydrological mechanisms-guided modeling approach
To demonstrate the effectiveness of hydrological mechanism-guided modeling approaches, we compare the performances of the GBHM-ANN with those of a simple ANN. GBHM-ANN, which adopts sufficient synthetic data for training and validation, performs better at all three gauges. The NSE values of the simulation results at the three gauges are improved by an average of 23% during the test period. Specifically, with sufficient
Conclusion
This study proposed three approaches to improve the prediction capability of machine learning-based hydrological models and developed a physical process and ML combined hydrological model for rainfall-runoff simulation in data-insufficient watersheds. The model performance in learning the hydrological behaviors of the physically distributed hydrological model was evaluated with a synthetic dataset, and the prediction capability was investigated using real data. According to the results, we
CRediT authorship contribution statement
Shuyu Yang: Conceptualization, Methodology, Data curation, Writing - original draft, Writing - review & editing. Dawen Yang: Supervision, Conceptualization, Methodology, Writing - review & editing, Funding acquisition. Jinsong Chen: Supervision, Methodology, Writing - review & editing. Jerasorn Santisirisomboon: Resources. Weiwei Lu: Investigation. Baoxu Zhao: Investigation.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This study was financially supported by the National Natural Science Foundation of China (project Nos. 41661144031). The authors would like to thank the Royal Irrigation Department of Thailand and Department of Land Development of Thailand for providing the historical observations, including the daily runoff data at three gauges (P.1, P73 and P12), the daily precipitation, the daily air temperature (mean, maxima and minima), the daily mean relative humidity, the daily sunshine duration and the
References (86)
End-to-end differentiable learning of protein structure
Cell Syst.
(2019)- et al.
Application of artificial neural network coupled with genetic algorithm and simulated annealing to solve groundwater inflow problem to an advancing open pit mine
J. Hydrol.
(2016) - et al.
A rainwater harvesting system reliability model based on nonparametric stochastic rainfall generator
J. Hydrol.
(2010) Flood susceptibility modelling using novel hybrid approach of reduced-error pruning trees with bagging and random subspace ensembles
J. Hydrol.
(2019)- et al.
Flow forecast by SWAT model and ANN in pracana basin Portugal
Adv. Eng. Softw.
(2009) - et al.
Comparison of four regionalisation methods for a distributed hydrological model
J. Hydrol.
(2007) - et al.
Development of regionalisation procedures using a multi-model approach for flow simulation in an ungauged catchment
J. Hydrol.
(2007) - et al.
Automatic recognition of SEM microstructure and phases of steel using LBP and random decision forest operator
Measurement
(2020) - et al.
Description of interest regions with local binary patterns
Pattern Recogn.
(2009) - et al.
A hybrid approach to monthly streamflow forecasting: Integrating hydrological model outputs into a Bayesian artificial neural network
J. Hydrol.
(2016)
A computer vision-based approach to fusing spatiotemporal data for hydrological modeling
J. Hydrol.
Potential application of wavelet neural network ensemble to forecast streamflow for flood management
J. Hydrol.
1D-local binary pattern based feature extraction for classification of epileptic EEG signals
Appl. Math. Comput.
Quantifying the impacts of small dam construction on hydrological alterations in the Jiulong River basin of Southeast China
J. Hydrol.
Detection and attribution of trends in magnitude, frequency and timing of floods in Spain
J. Hydrol.
Establishing a rainfall threshold for flash flood warnings in China’s mountainous areas based on a distributed hydrological model
J. Hydrol.
River flow forecasting through conceptual models part I—A discussion of principles
J. Hydrol.
A comparative study of texture measures with classification based on featured distributions
Pattern Recogn.
Impacts of climate warming on the frozen ground and eco-hydrology in the Yellow River source region, China
Sci. Total Environ.
Parameterisation, calibration and validation of distributed hydrological models
J. Hydrol.
Simulation and forecasting of streamflows using machine learning models coupled with base flow separation
J. Hydrol.
Markov random field modeling, inference & learning in computer vision & image understanding: a survey
Comput. Vis. Image Underst.
Multi-label classification models for sustainable flood retention basins
Environ. Modell. Softw.
Feature selection methods for characterizing and classifying adaptive Sustainable Flood Retention Basins
Water Res.
An enhanced artificial neural network with a shuffled complex evolutionary global optimization with principal component analysis
Inf. Sci.
An enhanced extreme learning machine model for river flow forecasting: State-of-the-art, practical applications in water resource engineering area and future research direction
J. Hydrol.
Large-scale baseflow index prediction using hydrological modelling, linear and multilevel regression approaches
J. Hydrol.
Daily streamflow prediction using optimally pruned extreme learning machine
J. Hydrol.
Instance-based learning algorithms
Mach. Learn.
SWAT2000: current capabilities and research opportunities in applied watershed modelling
Hydrol. Process.
Neural Network Toolbox™ user's guide
Improving flood forecasting capability of physically based distributed hydrological models by parameter optimization
Hydrol. Earth Syst. Sci.
Liuxihe model and its modeling to river basin flood
J. Hydrol. Eng.
Streamflow prediction using LASSO-FCM-DBN approach based on hydro-meteorological condition classification
J. Hydrol.
Status of automatic calibration for hydrologic models: comparison with multilevel expert calibration
J. Hydrol. Eng.
Learning to predict the cosmological structure formation
Proc. Natl. Acad. Sci. U.S.A.
Multi-ring local binary patterns for rotation invariant texture classification
Neural Comput. Appl.
Artificial neural network modeling of the rainfall-runoff process
Water Resour. Res.
Distributed hydrological model transferability across basins with different hydro-climatic characteristics
Hydrol. Process.
Hole-Filled Seamless SRTMdata V4
Development of WEP model and its application to an urban watershed
Hydrol. Process.
Using oceanic-atmospheric oscillations for long lead time streamflow forecasting
Water Resour. Res.
Cited by (106)
Assessment of the impact of climate change on streamflow of Ganjiang River catchment via LSTM-based models
2024, Journal of Hydrology: Regional StudiesComparing conceptual and super ensemble deep learning models for streamflow simulation in data-scarce catchments
2024, Journal of Hydrology: Regional StudiesThe water level change and its attribution of the Qinghai Lake from 1960 to 2020
2024, Journal of Hydrology: Regional Studies