1 Introduction

Precipitation is usually measured using rain gauges, ground-based weather radars, satellite sensors, etc. For continental-scale estimates, precipitation reanalysis is generated by the assimilation of observations into weather prediction models or climate models, and precipitation reanalysis products have been widely used for hydrological modelling (Michaelides et al., 2009). Several such products have been developed with different objectives, temporal ranges, data sources and resolutions, as summarized in Table 1. Precipitation is also one of the most important forcing variables for hydrological modelling. Thus, it is crucial to evaluate how hydrological models respond to different precipitation products to gain insight into model uncertainty and characterize the different biases in precipitation reanalysis. Although rainfall measurements are assimilated into the reanalysis, different data sources and assimilation systems lead to different accuracies, which are difficult to assess. Therefore, hydrological modellers are often confronted with the question of how their hydrological simulations depend on the choice of precipitation products and which product performs the best for a given region and time (Bitew et al. 2012).

Table 1 Overview of seven (3 or 6-) hourly gridded precipitation products evaluated in this study. Abbreviations in the data source column are defined as follows: G: gauge; S: satellite; and R: reanalysis

Numerous studies have evaluated precipitation products to understand their respective advantages and limitations (Beck et al. 2019, 2020; Maggioni et al. 2016). Some studies (Dinku et al. 2008; Gao and Liu 2013; Hersbach et al. 2020; Hirpa et al. 2010) used statistical methods and rain gauge observations for the evaluation, while others (Bitew et al. 2012; Qi et al. 2016; Stisen and Sandholt 2010; Tang et al. 2016) employed hydrological modelling to compare the simulated streamflow or soil moisture with the corresponding observations. Although hydrological models offer the possibility of using observed water discharge to evaluate precipitation products, previous studies have focused primarily on wet headwater areas and small basins when using hydrological models (Bitew et al. 2012; Tang et al. 2016; Wang et al. 2023), because quantifying streamflow in large watersheds is difficult, especially in arid and semiarid areas (Pilgrim et al. 1988; Vorosmarty and Sahagian 2000).

This study aims to evaluate precipitation products in large-scale, arid and semiarid basins using a sub-continental scale hydrological model considering the significant impacts of human activities on the water cycle. Specifically, we evaluate the precipitation reanalysis products for the Yellow River Basin using the offline Atmospheric and Hydrological Modelling System (AHMS) integrated with an irrigation module (AHMS-IRR hereafter, Jiang et al. (2022)). The Yellow River Basin offers a compelling case study for assessing the accuracy of these precipitation products, for its diverse range of climates and geomorphic features. We initially use the hydrological model to evaluate its response to various precipitation products and subsequently analyse the model’s response. Our evaluation includes widely used near-surface precipitation reanalysis products, including NCEP/NCAR, ERA5, GLDAS 2.0, WFDE5/CRU, WFDE5/(CRU+GPCC) and CMFD. We directly compare these products with rain gauge observations and indirectly assess their reliability by comparing the model simulated streamflow with streamflow observations.

2 Study area and data

2.1 Study area

As shown in Fig. 1, the Yellow River Basin (795,000 km2) is the largest watershed in northern China, and the Yellow River is the second-longest river in the country (5464 km). The Yellow River flows through the Qinghai-Tibetan Plateau, Inner Mongolia and Loess Plateaus and the Huang-Huai-Hai (Yellow, Huai and Hai Rivers) Plain. Much of the Loess Plateau is an arid and semiarid region. The average annual temperature of the Yellow River Basin is −4°, and the annual precipitation amount is approximately 450 mm, which is heavily affected by the East Asian monsoon and unevenly distributed. According to the Yellow River Water Resources Bulletin, the middle and upper reaches up to the Huayuankou station cover an area of 73,036 km2 and account for 91.82% of the total basin area. The annual runoff at the Huayuankou station is 56.7 billion m3. The Yellow River downstream from the Huayuankou station is an above-ground hanging river with a small catchment area covering 3% of the basin. Therefore, the present study evaluates precipitation data covering the upper reaches of the Huayuankou station in the Yellow River Basin.

Fig. 1
figure 1

Location and topography of the Yellow River Basin. The map includes the Yellow River network, the Loess Plateau and the four main hydrological stations referred to in the main text, i.e. Tangnaihe (TNH), Lanzhou (LZ), Toudaoguai (TDG) and Huayuankou (HYK)

2.2 Data

Table 1 presents seven gridded precipitation products, including one gridded rain gauge product and six precipitation reanalysis products. These products were classified into three types: the first was based exclusively on gauge data, the second on reanalysis data that are not bias-corrected by gauge data and the third on the assimilation of reanalysis data and gauge data. Amongst these, we consider one first-type (OBS), two second-type (NCEP/NCAR and ERA5) and four third-type (WFDE5/CRU, WFDE5/(CRU+GPCC), GLDAS 2.0 and CMFD) products. These seven gridded precipitation products were further interpolated to the AHMS-IRR simulation grids with a resolution of 20 km in this study. They were used as forcing data for the model runs and for comparative analysis. Monthly, seasonal and annual precipitation data were generated by averaging the gridded daily precipitation data.

2.2.1 Non-gauge-corrected reanalysis precipitation products

NCEP/NCAR (National Centers for Environmental Prediction and the National Center for Atmospheric Research) Reanalysis is a global dataset of atmospheric variables to support the needs of the climate research communities (Kalnay et al. 1996). A frozen state-of-the-art global data assimilation system and a complete database were used in the NCEP/NCAR Reanalysis project.

ERA5 is the fifth-generation reanalysis product created by the European Center for Medium-Range Weather Forecasts (ECMWF) for global weather and climate from 1979 to near real-time (Hersbach et al. 2020). This reanalysis is carried out with the 4D-Var data assimilation and model forecasts in CY41R2 of the Integrated Forecast System (IFS). ERA5 hourly data on single levels are used in the present work. ERA5 has the following main advantages compared to other reanalysis products: higher spatial and temporal resolutions, advanced modelling and data assimilation systems (4D-Var), and a more considerable amount of integrated historical observations and parameters.

2.2.2 Observed precipitation products

An observed hourly gridded precipitation product across China was developed by Yang et al. (2010) based on the observed daily precipitation data provided by the China Meteorological Administration (CMA). Hereby, the daily rain gauge data from 833 weather stations across China, spanning the period from 1951 to 2006, was gridded to 1° grid resolution using the method described in Milly and Dunne (2002) and Xia (2008), which considered the topographic effects on rainfall. This spatial resolution was conditioned by the distribution of weather stations in western China, shown in Fig. 17 for the Yellow River Basin. Furthermore, the daily gridded precipitation was downscaled to hourly by using a random statistical function that is the same over a calendar month (Waichler and Wigmosta 2003). The selection of this gridded precipitation observation product was made based on its comprehensive coverage of data from the majority of meteorological stations during the time period considered and the corresponding availability of hourly temporal resolution. These features are essential for conducting large-scale hydrological model simulations and facilitating climate model diagnostics.

2.2.3 Gauge-corrected reanalysis precipitation products

WATCH Forcing Data methodology applied to ERA5 (WFDE5) is derived from ERA5 with the methodology used to derive the widely employed water, energy and climate change (WATCH) forcing data. It is often used as a meteorological forcing dataset for land surface and hydrological models (Cucchi et al. 2020). The referenced reanalysis consists of a bias-corrected reconstruction of near-surface meteorological variables based on ERA5 (Hersbach et al. 2020).

Precipitation fields in the WFDE5 have been adjusted using monthly-scale bias corrections based on Climatic Research Unit (CRU) and Global Precipitation Climatology Centre (GPCC) data. Two different precipitation products (WATCH/CRU and WATCH/(CRU+GPCC)) come from WFDE5, one corrected using the CRU TS product and the other corrected using both the CRU TS and the GPCC observational products.

The China Meteorological Forcing Dataset (CMFD) is a high spatiotemporal resolution gridded near-surface meteorological dataset that was developed for studies on land surface processes (He et al. 2020). This dataset combines remote sensing and reanalysis products with in situ observations from weather stations. Precipitation fields in the CMFD are produced based on the assimilation of 753 weather stations from the CMA and gridded background data (base reanalysis) GLDAS NOAH10SUBP 3H for the research period of this work, i.e. from 1979 to 1989.

The goal of the Global Land Data Assimilation System (GLDAS) is to integrate satellite and ground-based observational data products using advanced land surface modelling and data assimilation techniques to generate optimal fields of land surface states and fluxes (Rodell et al. 2004). GLDAS 2.0 is one of the two components of the GLDAS Version 2 (GLDAS-2) dataset, which was produced by GLDAS. Moreover, GLDAS 2.0 was derived from the global meteorological forcing dataset from Princeton University (Sheffield et al. 2006). Specifically, precipitation in GLDAS 2.0 was constructed by combining global observation-based products from the Global Precipitation Climatology Project (GPCP) daily product, the Tropical Rainfall Measuring Mission (TRMM, in operation from 1997 to 2015) 3-hourly real-time dataset and the National Centers for Environmental Prediction–National Center for Atmospheric Research (NCEP/NCAR) reanalysis.

2.2.4 Streamflow dataset

To calibrate and validate streamflow, observed daily streamflow (m3 s−1) for the period 1979–1987 was acquired from the National Science and Technology Infrastructure of China (http://gre.geodata.cn), and data came from the four main gauging stations in the basin, i.e. Tangnaihe (TNH), Lanzhou (LZ), Toudaoguai (TDG) and Huayuankou (HYK).

2.2.5 Evapotranspiration dataset

The Global Land Evaporation Amsterdam Model (GLEAM) v3.5 dataset (Martens et al. 2017) was applied to validate the model prediction of evapotranspiration. The choice of using the GLEAM data for evaluating simulated evapotranspiration is based on its robust foundation. GLEAM employs a water balance equation and assimilates multiple observations, including surface net radiation, near-surface air temperature and surface soil moisture, which collectively provide a comprehensive picture of the land surface dynamics (Martens et al. 2016, 2017). GLEAM’s evapotranspiration estimates offer insights into real-world evapotranspiration, including the effects of irrigation.

3 Methods

3.1 The offline mode of the coupled Atmospheric and Hydrological Modelling System integrated IRRigation module (AHMS-IRR)

The Atmospheric and Hydrological Modelling System (AHMS) is a fully coupled atmospheric and hydrological modelling system (Jiang et al. 2022, 2020; Xia 2019; Xia et al. 2022) that combines the Weather Research and Forecasting Model, WRF (Skamarock and Klemp 2008), with a physically-based distributed regional hydrological model, HMS (Yu et al. 2006), through the land surface model NoahMP-LSM (Chen and Dudhia 2001; Niu et al. 2011). The coupling method developed in WRF-Hydro (Gochis et al. 2020) for downscaling and upscaling the variables associated with the land surface and hydrological models was adopted. AHMS can either be run offline by using prescribed near-surface atmospheric forcing variables or coupled with the WRF model. A schematic illustration of the online and offline AHMS versions is shown in Fig. 2. The near-surface atmospheric forcing data required to run the offline AHMS include incoming shortwave and longwave radiation, near-surface specific humidity, air temperature and wind, precipitation and surface pressure (see Table S1 in the Supplement).

Fig. 2
figure 2

Schematic diagram of the AHMS-IRR which integrates the Atmospheric and Hydrological Modelling System (AHMS) with a new irrigation model (Jiang et al. 2022)

The AHMS-IRR used here was developed by Jiang et al. (2022) and incorporates a new irrigation module within the land surface, flow routing and groundwater models of the AHMS, allowing for the explicitly representation of irrigation processes in the water cycle. The irrigation amount was calculated based on the soil moisture deficits method, water availability in rivers or lakes and irrigation fraction dataset (see Section S1 in the Supplement for more details). Figure 3 shows a schematic representation of the hydrological cycle in AHMS-IRR. In previous work, we have shown that simulations using the offline AHMS-IRR yield good agreement between predicted and observed streamflow in the Yellow River Basin for the time period considered (see Fig. S1 in the Supplemental Material of the present manuscript). Moreover, this agreement is not satisfactory if irrigation is neglected in our simulations (Jiang et al. 2022), as also shown in Fig. S1 of the Supplemental Material. By incorporating the irrigation into the simulation, a more realistic hydrological response was obtained near the outlet of the Yellow River Basin compared to previous studies (Cong et al. 2009). The model results thus demonstrate the ability of AHMS-IRR to reproduce long-term hydrological processes in the Yellow River Basin, provided that water taken for irrigation is included in the simulation.

Fig. 3
figure 3

Sketch of the hydrologic cycle simulated in AHMS-IRR, adopted from Jiang et al. (2022)

3.2 Hydrological model setup

The AHMS-IRR model was configured for the Yellow River Basin with a temporal resolution of one hour and a spatial resolution of 20 km (Jiang et al. 2022). The key input static parameters, including vegetation type and soil texture data, are aggregated from the 30 arc-second data of the hybrid State Soil Geographic Database Food and Agriculture Organization (STATSGO/FAO) soil texture dataset and the USGS 24-category vegetation (land use) datasets. The performance of AHMS-IRR was rigorously assessed through a comprehensive evaluation, incorporating a combination of ground-based (in situ) and remote-sensing data sources. This evaluation encompassed various factors, including observed streamflow, terrestrial water storage data derived from the Gravity Recovery and Climate Experiment (GRACE) and assimilated evaporation data provided by the Global Land Evaporation Amsterdam Model (GLEAM). Based on sensitivity analysis conducted in Jiang et al. (2022), the most crucial parameters for calibrating the model against observations were identified as soil saturated hydraulic conductivity in the rainfall-runoff model, Manning’s roughness coefficient in the flow routing model, irrigation-triggered soil moisture availability and leaf area index in the irrigation model. To effectively achieve our objective of evaluating various precipitation products using hydrological modelling in this study, we focused our calibration efforts solely on the rainfall-runoff model parameter known as soil-saturated hydraulic conductivity. We incorporated calibrated values of Manning’s roughness coefficient, irrigation-induced soil water availability and leaf area index from the study of Jiang et al. (2022).

Furthermore, following the approach of Bitew et al. (2012) and Tang et al. (2016), we designed two scenarios to distinguish between uncertainties related to rainfall products and those related to models. For Scenario 1, the rain-runoff model parameters were calibrated using gauged precipitation and remained constant for the other precipitation products. For Scenario 2, the rain-runoff model parameters were recalibrated for each precipitation product.

Table 2 lists the two hydrological settings used for model calibration or validation. In Scenario 1, the offline AHMS-IRR was calibrated for gauged precipitation from 1 Jan 1979 to 31 Dec 1983. The model was then validated using the same parameters (Para 1) for each of the six (non-) gauge-corrected precipitation reanalysis products described above from 1 Jan 1984 to 31 Dec 1987.

Table 2 Two scenarios were designed to evaluate different precipitation products using the offline AHMS-IRR

In Scenario 2, the offline AHMS-IRR was recalibrated for each precipitation product, and the set of parameters was assigned corresponding names (Para 1 to Para 7). For scenarios 1 and 2, the performance of six (non-) gauge-corrected precipitation reanalysis products was assessed by comparing simulated and observed streamflow from major hydrological stations during the validation period. Moreover, it should be noted that the other required near-surface atmospheric forcing variables (as shown in Table S1), except for precipitation, were the same for each setup and were derived from ERA5 in Scenarios 1 and 2.

Furthermore, Table S3 in Section S5 of the Supplement shows the calibrated parameters (soil saturated hydraulic conductivity) of the rainfall-runoff model in the four subbasins, including Headwater-Tangnaihe, Tangnaihe-Lanzhou, Lanzhou-Toudaoguai and Toudaoguai-Huayuankou. Specifically, we calibrated the rainfall-runoff model against the mean annual observed runoff in the calibration period (from 1979 to 1983). The parameters in the rainfall-runoff model were calibrated here to keep the BIAS (see next section) of runoff for each reanalysis product as small as possible. Moreover, Table S4 in Section S5 of the Supplement presents the simulated and observed mean annual area runoff from 1979 to 1987, including the calibration and validation periods in the Yellow River Basin. Our choice for the simulation period from 1979 to 1987, with the first five years dedicated to calibration and the last four years for validation, was determined in view of the following constraints:

  1. (1)

    Meteorological Data Availability: Many near-surface meteorological products begin in 1979, making it a practical starting point for our simulation and analysis.

  2. (2)

    Human Activities: The construction of the Longyangxia reservoirs (with a storage capacity of 27.6×109 m3) in the 1980s significantly altered the Yellow River’s hydrology. Since the AHMS-IRR does not currently account for reservoirs and dams, and given the absence of comprehensive reservoir management data, we intentionally focused on the pre-1990 period to avoid the influence of these anthropogenic influences.

  3. (3)

    Limited Streamflow Data: Availability of streamflow observation data along the Yellow River is constrained, further impacting our choice of the simulation period.

3.3 Evaluation metrics

The agreement between the model-predicted and observed values was quantified by employing the following performance metrics: the Nash–Sutcliffe model efficiency coefficient (NSE) (Nash and Sutcliffe 1970), the root-mean-square error (RMSE), the Pearson correlation coefficient (PCC), the mean absolute error (MAE), the mean absolute percentage error (MAPE), the relative bias in percent (BIAS) and the modified Kling-Gupta efficiency (mKGE) (Gupta et al. 2009; Kling et al. 2012). These performance metrics are defined as follows, and the value ranges and perfect agreement corresponding values are summarized in Table S2 in the Supplement.

$$\textrm{NSE}=1.0-\frac{\sum_{i=1}^N{\left({P}_s^i-{P}_o^i\right)}^2}{\sum_{i=1}^N{\left({P}_o^i-\overline{P_o}\right)}^2}$$
(1)
$$\textrm{RMSE}=\sqrt{\frac{1}{N}\sum_{i=1}^N{\left({P}_s^i-{P}_o^i\right)}^2}$$
(2)
$$\textrm{PCC}=\frac{\sum_{i=1}^N\left({P}_o^i-\overline{P_o}\right)\left({P}_s^i-\overline{P_s}\right)}{\sqrt{\sum_{i=1}^N{\left({P}_s^i-{\overline{P}}_s\right)}^2{\left({P}_o^i-{\overline{P}}_o\right)}^2}}$$
(3)
$$\textrm{MAE}=\frac{1}{N}\sum_{i=1}^N\left|{P}_o^i-{P}_s^i\right|$$
(4)
$$\textrm{MAPE}=\frac{1}{N}\sum_{i=1}^N\left|\frac{P_o^i-{P}_s^i}{P_o^i}\right|\times 100\%$$
(5)
$$\textrm{BIAS}=\frac{\sum_{i=1}^N\left({P}_s^i-{P}_o^i\right)}{\sum_{i=1}^N{P}_o^i}\times 100\%$$
(6)
$${\displaystyle \begin{array}{c}\textrm{mKGE}=1-\sqrt{{\left( PCC-1\right)}^2+{\left(\beta -1\right)}^2+{\left(\gamma -1\right)}^2}\\ {}\beta =\textrm{bias}\ \textrm{ratio}=\frac{\mu_s}{\mu_o}\\ {}\gamma =\textrm{variability}\ \textrm{ratio}=\frac{\sigma_s/{\mu}_s}{\sigma_o/{\mu}_o}\end{array}}$$
(7)

where Ps and Po are the simulated and observed values of the P variable, respectively, and \(\overline{P_o}\) is the mean value of Po. Moreover, μs and μo represent the averages of the predicted and observed values, respectively, while σs and σo correspond to the standard deviations of the predicted and observed values, respectively.

3.4 Standard deviation of elevation

The standard deviation of elevation is used to depict the complexity of the terrain and is defined as follows:

$$\textrm{SDE}=\sqrt{\frac{\sum_{i=1}^N{\left({x}_i-\mu \right)}^2}{N}}$$
(8)

where SDE is the standard deviation of elevation of the larger grid, which is at a 20 km resolution in this study, xi is the elevation of the ith sub-grid at a 1 km resolution in the cell, N is the number of sub-grids at a 1 km resolution in the larger cell and μ is the average elevation of the cell. Here, a moving window of 20 × 20 was used to calculate the standard deviation of elevations. Values close to zero indicate no variation (i.e. flat areas), while high standard deviations indicate areas with complex and steep terrain.

4 Results

4.1 Performance evaluation on the monthly, seasonal and annual scales using gauge observations

In this section, we evaluated the performance of six reanalysis precipitation products, including NCEP/NCAR, GLDAS 2.0, ERA5, WFDE5/CRU, WFDE5/(CRU+GPCC) and CMFD, across annual, seasonal and monthly timeframes by comparing them with gridded gauge precipitation (OBS). For this comparative analysis, we obtained monthly, seasonal and annual precipitation data by resampling the gridded daily precipitation data at a 20 km resolution using the mean method.

Figure 4 displays the average yearly precipitation of each product along with the observed annual precipitation. As shown, the non-gauge-corrected reanalysis precipitation products (NCEP/NCAR and ERA5) significantly overestimated precipitation by 50% and 45%, respectively. Furthermore, the CMFD results were almost identical to the observed precipitation on an annual scale.

Fig. 4
figure 4

Annual precipitation in the Yellow River Basin from 1979 to 1989: NCEP/NCAR (brown dashed line), OBS (orange solid line), GLDAS 2.0 (orange dashed line), ERA5 (green dashed line), WFDE5/CRU (blue dashed line), WFDE5/(CRU+GPCC) (pink dashed line) and CMFD (green dotted line)

Moreover, Table 3 lists the annual precipitation statistics of seven products from 1979 to 1989. As shown, NCER/NCAR yielded the highest average annual precipitation (664 m), had the widest range in the annual mean precipitation (514–772 mm) and was associated with the largest deviation (82.3 mm) over the Yellow River Basin. To compare, WFDE5/CRU had the narrowest range in the annual mean precipitation (331–479 mm) as well as the smallest associated deviation (46.3 mm). As shown in Fig. 5, the annual precipitation density from the CMFD yielded the best agreement with the gauge data (OBS), followed by the WFDE5/(CRU+GPCC). Moreover, WFDE5/CRU and GLDAS 2.0 yielded an overall low precipitation level with a peak of approximately 450 mm. Furthermore, the non-gauge-corrected NCEP/NCAR and ERA5 products significantly overestimated the precipitation levels in the study period.

Table 3 Annual precipitation statistics for the Yellow River Basin from 1979 to 1989 (unit: mm), associated with the different products considered in this study
Fig. 5
figure 5

Density curve (kernel density estimates KDE, which is the estimation of probability density functions of random variables based on kernels as weights) of annual precipitation of the products over the Yellow River Basin from 1979 to 1989

Figure 6 displays the spatial distributions of the precipitation from all products and the Pearson correlation coefficients (PCCs) for the six reanalysis products in the Yellow River Basin (upland of Huayuankou) from 1979 to 1989. This figure shows that the average annual precipitation gradually increased from northwest to southeast. The areas with the highest annual precipitation were located southern Yellow River Basin, including the upper reaches of this basin, the Weihe Basin and the Sanmenxia to Huayuankou area (see Fig. 1). In contrast, the lowest annual precipitation occurred at Ningxia and the Hetao Plateau in the northwest of the basin. According to the PCC and RMSE, precipitation data from WFDE5/(CRU+GPCC) (PCC=0.99 and RMSE=0.08) and CMFD (PCC=0.98 and RMSE=0.09) were the most consistent with OBS. GLDAS 2.0 and WFDE5/CRU slightly underestimated precipitation in the upper reaches of the river, while the non-gauge-corrected precipitation products (NCEP/NCAR and ERA5) significantly overestimated precipitation, especially in the mountainous area of the basin. For further details and interpretations, please refer to the discussion in Sect. 5a.

Fig. 6
figure 6

Spatial distribution of the daily mean precipitation of the seven precipitation products considered in the present study, i.e. (a) OBS, (b) NCEP/NCAR, (c) ERA5, (d) GLDAS 2.0, (e) WFDE5/CRU, f WFDE5/(CRU+GPCC) and (g) CMFD, from 1979 to 1989 over the Yellow River Basin (upland of Huayuankou), along with the root-mean-square error (RMSE, mm) and the Pearson correlation coefficient (PCC) between the reanalysis product and OBS for each grid in the basin

To evaluate extreme precipitation, we calculated the 99th percentile of the daily precipitation for each precipitation product. The results are presented in Fig. 7. Based on the RMSE and PCC values calculated over the research area, the spatial distribution of extreme precipitation (i.e. 99th percentile precipitation) predicted from WFDE5/(CRU+GPCC) (PCC=0.97 and RMSE=1.81) and CMFD (PCC=0.96 and RMSE=2.13) exhibited the best agreement with the observations (OBS). Remarkably, although NCEP/NCAR and ERA5 exhibited similar annual precipitation, their spatial patterns of extreme precipitation showed significant differences, as indicated by the RMSE values of 8.09 and 4.13 and the PCC values of 0.21 and 0.93 for NCEP/NCAR and ERA5, respectively. As illustrated in Fig. 7, the statistical comparison suggests that all gauge-corrected reanalysis precipitation products underestimate extreme rainfall to various degrees, whereas non-gauge-corrected reanalysis precipitation products (ERA5) tend to overestimate extreme rainfall.

Fig. 7
figure 7

Spatial distribution of the 99th percentile precipitation of the seven precipitation products considered in the present study, i.e. (a) OBS, (b) NCEP/NCAR, (c) ERA5, (d) GLDAS 2.0, (e) WFDE5/CRU, (f) WFDE5/(CRU+GPCC) and (g) CMFD, from 1979 to 1989 over the Yellow River Basin (upland of Huayuankou), along with the root mean square error (RMSE, mm) and the Pearson correlation coefficient (PCC) between the reanalysis product and OBS for each grid in the basin

Figure 8 shows the monthly precipitation of the products in the Yellow River Basin from 1979 to 1989 and the average annual cycle of monthly precipitation for this period. As shown in Fig. 8, NCEP/NCAR and ERA5 overestimated precipitation in all months, while all other products did not differ much in terms of monthly precipitation. Table 4 lists the monthly precipitation statistics of the six reanalysis precipitation products. These monthly statistics showed that CMFD yielded the highest correlation coefficient. Similarly, CMFD attained the lowest MAE, MAPE and RMSE.

Fig. 8
figure 8

(a) Monthly precipitation of the products over the Yellow River Basin from 1979 to 1989. (b) The annual cycle of monthly precipitation averaged over 1979–1989

Table 4 Statistics of the monthly precipitation of the products over the Yellow River Basin from 1979 to 1989 (unit: mm)

Figure 9 shows the average monthly evaluation metrics of the precipitation products in the Yellow River Basin from 1979 to 1989. As seen, Fig. 9 indicates that CMFD had the lowest RMSE, MAE and MAPE, while NCEP/NCAR performed the worst. Compared to other months, all data showed the most considerable error in August. On the contrary, Fig. 9d shows that the percentage error in precipitation for these products was significantly higher in December, January and February.

Fig. 9
figure 9

Averaged monthly evaluation metrics of the precipitation products over the Yellow River Basin from 1979 to 1989: (a) RMSE (mm), (b) PCC, (c) MAE (mm) and (d) MAPE (%)

Similarly, Fig. 10a, c, d shows that CMFD (NCEP/NCAR) produced the lowest (largest) RMSE, MAE and MAPE for seasonal precipitation, while all products led to an overestimation of this precipitation in the summer. Figure 10d shows that the percentage error in precipitation for these products increased significantly in the winter, especially for NCEP/NCAR and ERA5. This finding is consistent with Beck et al. (2017) and Jiang et al. (2021), as it occurs because the atmospheric model produces spurious drizzle and tends to overestimate low rainfall due to deficiencies in rainfall generation parameterization scheme. For further insights and interpretations, please refer to Sect. 5a.

Fig. 10
figure 10

Seasonal evaluation metrics of the precipitation products over the Yellow River Basin from 1979 to 1989: DJF (winter, December to February), MAM (spring, March to May), JJA (summer, June to August), SON (autumn, September to November), (a) RMSE (mm), (b) PCC, (c) MAE (mm) and (d) MAPE (%)

4.2 Performance evaluation using the offline AHMS-IRR

In this section, we evaluated the performance of the six reanalysis precipitation data products by comparing the offline AHMS-IRR simulated monthly streamflow with the corresponding observations. Figure 11 compares simulated monthly streamflow with gauge-calibrated parameters and observed monthly streamflow in both the calibration and the validation periods at all four main hydrological stations (Tangnaihe, Lanzhou, Toudaoguai and Huayuankou). As seen, the monthly streamflow was significantly overestimated when using NCEP/NCAR and ERA5 products and slightly underestimated when using gauge-corrected reanalysis precipitation products. The Nash–Sutcliffe model efficiency coefficient (NSE) and the Kling-Gupta efficiency (KGE) for the validation periods are displayed in Fig. 12, and additional data for both calibration and validation periods can be referenced in Table S5 in the Supplement. As shown in Fig. 12, the model performed much better upstream than in the middle stream area. The disparities in observed and simulated streamflow may be linked to inadequate representations of the precipitation patterns of these products. Notably, the model’s performance appears to degrade as we move downstream, implying that errors accumulate along the river’s course.

Fig. 11
figure 11

Comparison of monthly streamflow simulated by offline AHMS-IRR with gauge-calibrated parameters (Scenario 1). The comparison includes seven different precipitation products against observed monthly streamflow at all four main hydrological stations, i.e. Tangnaihe (a), Lanzhou (b), Toudaoguai (c) and Huayuankou (d). The calibration period ranges from January 1, 1979, to December 31, 1983, while the validation period spans from January 1, 1984, to December 31, 1987. Note that OBS represents the observed streamflow, while OBS_P represents the streamflow simulated using the gauged precipitation

Fig. 12
figure 12

Performance of streamflow simulations forced by seven different precipitation products using the static rainfall-runoff parameters during the validation period (Scenario 1)

In particular, the highest values of NSE (0.82) and KGE (0.82) were observed at the Tangnaihai station in the validation period (from 1984 to 1987). Moreover, WFDE5/(CRU+GPCC) and CMFD outperformed the other products during the validation period, with the highest mean NSE and KGE values of the four stations, consistent with our statistical comparison discussed in the previous text.

Figure 13 compares simulated and observed monthly streamflow, where the simulated values were obtained from the hydrological model run with recalibrated parameters. Furthermore, Fig. 14 and Table S6 in the Supplement show the model performance (NSE and KGE) associated with the different precipitation products in Scenario 2. The recalibration of the rainfall-runoff model substantially improved the model performance for all precipitation products compared to the results of Scenario 1 (Fig. 12). The improvement in model performance was particularly notable for models using gauge-corrected reanalysis precipitation products. For the validation period considered in the present work, the gauge-corrected reanalysis precipitation products, including WFDE5/(CRU+GPCC) and CMFD, demonstrated slightly better performance than other gauge-corrected products. However, non-gauge-corrected reanalysis precipitation products such as NCEP/NCAR and ERA5 exhibited poorer performance than gauge-corrected products. Overall, the performance of all gauge-corrected reanalysis precipitation products was similar.

Fig. 13
figure 13

Same as in Fig. 11 but with product-specific calibrated parameters for each of the seven precipitation products (Scenario 2)

Fig. 14
figure 14

Performance of offline AHMS simulations of streamflow forced by seven different precipitation products using the dynamic rainfall-runoff parameters during the validation period (Scenario 2)

Furthermore, we found the discrepancies in the pattern of extreme rainfall events drive variations in calibrated soil parameters with different precipitation products. By comparing Table S2 and Fig. 7, we observed a clear correlation between extreme rainfall events and the calibrated soil parameters. Specifically, in areas where extreme rainfall was underestimated by the precipitation product, the calibration process led to a reduction in soil saturated hydraulic conductivity. This reduction compensated for the underestimated extreme rainfall rates, aligning the model’s surface runoff with observed data. This dependency illustrates that the accuracy of rainfall products to depict extreme rainfall events is critical for hydrological modelling.

Figure 15 displays the performance of the offline AHMS (RMSE and NSE) in predicting evapotranspiration by comparing it with the GLEAM results (Martens et al. 2017). The analysis is associated with the different precipitation products used in Scenarios 1 and 2. The results suggest that simulations driven by the non-gauge-corrected precipitation products significantly overestimated evapotranspiration. Moreover, the recalibration of the rainfall-runoff model led to an increase in runoff and an improvement in the evapotranspiration simulation for the gauge-corrected precipitation products, whereas it did not produce similar effects for the non-gauge-corrected precipitation products. Therefore, both Figs. 14 and 15 reveal that when the precipitation bias exceeded a specific threshold (e.g. ERA5 and NCEP/NCAR with BIAS≥45%), the calibration of the hydrological model was no longer valid in terms of improving the model’s performance.

Fig. 15
figure 15

Performance of offline AHMS-IRR simulations in terms of evapotranspiration forced by seven different precipitation products using the static (Scenario 1 in blue) and dynamic (Scenario 2 in yellow) rainfall-runoff parameters for the period 1980–1987

In Scenario II, as shown in Fig. 15 and Table S7, the performance of simulated evapotranspiration exceeds the OBS data for selected precipitation products (CMFD, GLDAS 2.0, WFDE5/CRU, WFDE5/(CRU+GPCC)). The reasons for this divergence between the AHMS-IRR simulation and the GLEAM data become apparent as we delve deeper into the analysis.

As shown in Fig. S2, the AHMS-IRR simulation driven by observed precipitation tends to slightly overestimate evapotranspiration during the summer, especially in response to extreme rainfall events. It is worth noting that, as indicated in Fig. S3, this overestimation is, in part, offset by the underestimation of precipitation during extreme rainfall events (see Fig. 7), resulting in a closer agreement with GLEAM data.

Our current model calibration focused primarily on soil parameters, specifically against runoff. We acknowledge that incorporating additional parameters and targets, such as vegetation characteristics and evapotranspiration, may further enhance model performance. Future model calibration efforts will consider these factors to achieve a more comprehensive representation of land surface processes.

5 Discussion

5.1 Why do the non-gauge-corrected reanalysis products overestimate the precipitation in the Yellow River Basin?

Many previous studies concluded that the reanalysis product overestimates the observed precipitation in areas of complex terrain, which is consistent with the present study (see Fig. 6). Amjad et al. (2020) and Izadi et al. (2021) found that ERA5 consistently overestimated the observed precipitation over the mountainous areas of Turkey and Iran, while Jiang et al. (2021) reported that ERA5 tended to overestimate light precipitation events, especially in mountainous areas.

There are two possible reasons for the overestimation of precipitation in areas with complex terrain. On the one hand, this overestimation may be due to the scarcity of weather stations in areas of complex terrain, where the limited observations do not capture precipitation patterns in sufficient detail (Jiao et al. 2021). On the other hand, the precipitation simulations in complex terrain areas are not accurate because the resolution of the precipitation simulation is relatively coarse in areas of complex terrain. Specifically, as discussed in previous work (Amjad et al. 2020; Izadi et al. 2021), the mechanisms of humidity injection and turbulence through land surface-atmosphere interactions in complex terrain areas are poorly understood. Furthermore, the lack of snow removal processes in the ERA5 simulation could lead to excessive snowfall and precipitation in the Tibetan Plateau region, according to Orsolini et al. (2019). These findings are consistent with the fact that the percentage errors of non-gauge-corrected precipitation products (ERA5 and NCEP/NCAR) were much larger in winter than in summer (see Figs. 9, 10 and Sect. 4.1).

To analyse the effect of the complex topography of the Yellow River Basin on the quality of the reanalysis precipitation data, Pearson correlation coefficients (PCCs) were calculated between precipitation observations and individual reanalysis precipitation using the standard deviation of elevation and mean annual precipitation over the Yellow River Basin. According to Eq. 8, the standard deviation of elevation was calculated at a 20 km resolution and was based on a digital estimation model at a 1 km resolution (USGS 1K). As shown in Fig. 16a, the Pearson correlation coefficient between the average annual observed precipitation and NCEP/NCAR reanalysis precipitation decreased rapidly with the standard deviation of elevation, while ERA5 precipitation maintained a highly stable correlation with the observed data. This behaviour was most likely due to the higher resolution of ERA5 (2.5° of NCEP/NCAR and 0.25° of ERA5) and the advanced 4d-Var data assimilation system of ERA5. Surprisingly, as the standard deviation of elevation increased (>230 m), the Pearson correlation coefficient for ERA5 also slightly increased and fluctuated, which was also observed in the gauge-corrected reanalysis precipitation products (see Fig. 16b). As shown in Fig. 16c, this behaviour may be explained by the fact that grids with higher standard deviation elevations are scarce. Furthermore, Fig. 16b shows that the gauge-corrected reanalysis precipitation data maintained a stable correlation with the observed data as the standard deviation of elevation changed compared with the non-gauge-corrected reanalysis precipitation products. Specifically, WFDE5/(CRU+GPCC) and CMFD performed better than GLDAS 2.0 and WFDE5/CRU.

Fig. 16
figure 16

Pearson correlation coefficient (PCC) between the average annual observed precipitation and six reanalysis precipitation products with a standard deviation of elevation calculated in each 20 km grid over the Yellow River Basin (a) and (b). (c) The number of grids with a 20 km resolution and standard deviation of elevation over the Yellow River Basin

Considering the influence of complex terrain on precipitation simulations, the simulation grid for future climate modelling should be refined in areas of complex terrain to reach a pre-defined threshold of the standard deviation of elevation (Bacon et al. 2000; Dietachmayer 1992). This refinement process improves the simulation of precipitation in areas with complex terrain without wasting computing capacity (Skamarock and Klemp 2008).

5.2 Why do CMFD and WFDE5/(CRU+GPCC) agree the best with the precipitation observations?

The strong agreement between the CMFD and WFDE5/(CRU+GPCC) products with precipitation observations can be attributed to several key factors discussed below.

As indicated by the comparison of annual and monthly precipitation (Figs. 4, 6, 7 and 9) and monthly streamflow (Figs. 12 and 14), CMFD performs exceptionally well and is nearly on par with the gridded rain gauge product and WFDE5/CRU+GPCC. However, its superior performance can be attributed to a combination of factors, including—in addition to its high spatial resolution (0.1°)—a substantial volume of rain gauge data, as well as the incorporation of remote sensing data. As indicated by He et al. (2020), CMFD integrates a more extensive product of weather station observations from the China Meteorological Administration (CMA), which began sharing data through the China Meteorological Data Service Center (CMDC) in recent years.

However, when comparing the performance of CMFD and WFDE5/CRU+GPCC, it is clear that the high spatial resolution of CMFD (0.1°) does not yield a clear advantage over WFDE5/(CRU+GPCC) (0.5°). Therefore, we attribute the better performance of CMFD and WFDE5/(CRU+GPCC) to the incorporation of a significantly larger number of weather stations, as explained below.

The disparities in the performances of gauge-corrected reanalysis precipitation products are closely linked to the number of assimilated rain gauges. During the study period (1979~1989), only a few weather stations from CMA were shared globally through the Global Telecommunications System (GTS). Figure 17 shows the distribution of a total of 82 CMA weather stations in the Yellow River Basin, including 22 national benchmark climate stations shared worldwide through the GTS and the other 60 national ordinary stations not shared via the GTS. This is an important reason why the gauge-corrected reanalysis precipitation products (WFDE5/CRU and GLDAS 2.0) underestimated the annual precipitation compared with gauge observations and explained why CMFD and WFDE5/(CRU+GPCC) agreed the best with the precipitation observations.

Fig. 17
figure 17

Distribution of weather stations of the China Meteorological Administration in the Yellow River Basin (upstream of HYK station). Twenty-two national benchmark climate stations are shared worldwide through the Global Telecommunication System (GTS) (red points), and the other 60 national ordinary stations (green points) are not shared through the GTS

Specifically, CMFD and WFDE5/(CRU+GPCC) employed a more significant number of stations than WFDE5/CRU and GLDAS 2.0. According to He et al. (2020) and Yu et al. (2020), the CMFD uses approximately 753 stations across China, while WFDE5/(CRU+GPCC) essentially assimilates the same stations. In contrast, only approximately 300 stations across China, shared by the Global Telecommunications System (GTS), have been incorporated to derive the WFDE5/CRU and GLDAS 2.0 products. In this study, we found that WFDE5/CRU and GLDAS 2.0 slightly underestimated the observed annual precipitation, which was consistent with the observations of Shi et al. (2017). These authors also concluded that the CRU TS product underestimated precipitation over the three-river headwaters region of China.

5.3 What challenges do hydrological model uncertainties and anthropogenic influences pose in this study?

In this study, we employed an offline AHMS-IRR and measured water discharge to assess (non-) gauge-corrected precipitation reanalysis. Although the use of offline AHMS-IRR and measured water discharge provides us with an additional dimension for assessing precipitation, it is associated with additional uncertainties related to the parameters of the rainfall-runoff model and the flow routing model. Therefore, to reduce model uncertainty as much as possible, we recalibrated the rainfall-runoff model separately with each meteorological forcing data in the present study. In addition, we used two evaluation metrics to evaluate model performance, i.e. the Nash-Sutcliffe model efficiency coefficient (NSE) and the modified Kling-Gupta efficiency (mKGE) (Gupta et al. 2009; Kling et al. 2012). Hereby, we focused on the agreements between the observed and simulated streamflow peak and correlation, bias and variability.

Furthermore, anthropogenic factors such as irrigation play an important role in hydrological processes and profoundly impact the water balance in arid and semiarid basins such as the Yellow River Basin. To quantify anthropogenic influences, we used the recently introduced and extended offline AHMS-IRR that applies to arid and semiarid regions by explicitly considering the effects of irrigation on hydrological processes (Jiang et al. 2022). The extended offline AHMS-IRR integrates a new irrigation module, which computes the dynamic irrigation demand, into the modified land surface, flow routing and groundwater models. This new extended offline AHMS-IRR explicitly represents anthropogenic processes, such as irrigation, which significantly impact the water balance in the arid and semiarid areas of the Yellow River Basin.

The pronounced discrepancy at Toudaoguai, also observed in companion papers (Jia et al. 2006; Yin et al. 2021), may arise from various factors, including the complex diversion and drainage process in the Hetao irrigation district of Inner Mongolia. Another potential contributing factor could be the reliance on a single set of empirical hydraulic geometry equations in the channel routing model of AHMS-IRR and the absence of bathymetry observation data, which may not comprehensively capture the intricate bathymetry of the Yellow River, especially in its middle and lower reaches. Furthermore, NSE is sensitive to peak flow values (Krause et al. 2005). As suggested by Gupta et al. (2009) and Kling et al. (2012), mKGE (Modified Kling-Gupta Efficiency) offers a more balanced metric, taking into account three independent criteria: correlation coefficient, bias ratio and variability ratio.

As seen in Figs. S3 and S4, Tables S8 and S9, we evaluated of the model’s performance based on daily streamflow for the Tangnaihe, Lanzhou, Toudaiguai and Huayuankou stations. We indeed observe a slightly worse performance of the model results for the daily streamflow, compared to results associated with the monthly streamflow, particularly when using the NSE metric.

It is worth noting that our study area encompasses a large basin with extensive human interventions—including complex diversion and drainage processes in the irrigation district, damming and reservoir operations—which can strongly affect the behaviour of daily streamflow. Moreover, the profiles of riverbed, width and Manning roughness coefficient are poorly known, thus constraining the accuracy of flow dynamics simulation within the routing model and limiting our ability to accurately represent daily streamflow dynamics. As mentioned earlier, our choice to employ the NSE metric is influenced by its sensitivity to peak flows, and monthly streamflow offers a more robust metric for assessing model performance across various precipitation products.

5.4 Why do the calibrated model parameters vary with the different precipitation products?

In our study, the calibration process is primarily based on matching the model-simulated surface runoff with observed data. Surface runoff is fundamentally determined by the balance between incoming water from rainfall and the soil’s infiltration capacity. Detailed calculations for infiltration capacity and infiltration-excess runoff in this study can be found in Section S2 of the Supplement.

However, the following two main factors make the calibration process challenging:

  1. (1)

    Heterogeneity of rainfall and soil properties: Both rainfall and the surface soil properties exhibit spatiotemporal heterogeneity (Yu 1998). Rainfall patterns vary across time and space, and soil properties also exhibit variations across the watershed. This inherent heterogeneity complicates the calibration process, as it requires modifying the soil parameters to account for inaccuracies in the rainfall data.

  2. (2)

    Impact of extreme rainfall events: Surface runoff generation and peak river runoff are strongly influenced by extreme rainfall events. These events often play a crucial role in shaping the calibrated soil parameters. The accurate representation of rainfall extremes in precipitation products significantly impacts the magnitude of the calibrated soil parameters.

In our analysis, we examined the relationship between extreme rainfall events and the calibrated soil parameters. By comparing Table S2 and Fig. 7, we observed a clear correlation. Specifically, in areas where extreme rainfall was underestimated by the precipitation dataset, the calibration process led to a reduction in soil saturated hydraulic conductivity. This reduction compensated for the underestimated extreme rainfall rates, aligning the model’s surface runoff with observed data. This dependency illustrates that the accuracy of rainfall products to depict extreme rainfall events is critical for hydrological modelling.

In short, we found that the discrepancies in the pattern of extreme rainfall events drive variations in the calibrated soil parameters with different precipitation products. The calibration process aims to ensure that the model accurately reproduces surface runoff, and variations in calibrated soil properties are a means to achieve this goal, considering the challenges posed by heterogeneity in both rainfall and soil characteristics.

6 Summary and conclusions

In this study, we evaluated the performance of six widely used precipitation reanalysis products in the hydrology community for the Yellow River Basin, focusing on their accuracy against observed data and their influence on streamflow predictions using offline AHMS-IRR. The evaluated products included both non-gauge-corrected (NCEP/NCAR and ERA5) and gauge-corrected reanalysis products (GLDAS 2.0, WFDE5/CRU, WFDE5/(CRU+GPCC) and CMFD). We conducted a comprehensive evaluation based on annual, seasonal and monthly precipitation observation statistics, alongside a comparison of streamflow observations at four main hydrological stations along the Yellow River with corresponding predictions using a recently introduced regional hydrological model (Jiang et al. 2022). Each precipitation product served as the meteorological input for offline AHMS-IRR simulations, considering two model setups with static and dynamic parameters of the rainfall-runoff model. Our key findings can be summarized as follows:

  1. 1)

    CMFD and WFDE5/(CUR+GPCC) demonstrated the best agreement with observed precipitation and streamflow through statistical and hydrological comparisons in the Yellow River Basin. This superior performance can be attributed to CMFD and WFDE5/(CUR+GPCC) assimilate a significantly larger number of rain gauges compared to other precipitation products, as well as their high spatial resolutions.

  2. 2)

    Non-gauge-corrected reanalysis products (NCEP/NCAR and ERA5) significantly overestimated annual precipitation, especially in mountain areas. ERA5 performed better due to its higher resolution and the advanced data assimilation system.

  3. 3)

    The hydrological response in the Yellow River Basin is significantly influenced by the precipitation, and calibration of the rainfall-runoff model can compensate for data deficiencies. We found that the accurate representation of extreme rainfall events in the precipitation dataset plays a crucial role in shaping the calibrated soil parameters. The dependency illustrates that the accuracy of rainfall products in depicting extreme rainfall events is critical for hydrological modelling.

Hence, we recommend the utilization of CMFD or WFDE5/(CUR+GPCC) as near-surface meteorological variables in the hydrological modelling in the Yellow River Basin. To enhance reanalysis products, it is crucial to integrate weather station observations sharing through the China Meteorological Data Service Center (CMDC) by the China Meteorological Administration (CMA). Furthermore, improving precipitation simulations in mountainous regions can be achieved through a deeper understanding and parameterization of land surface-atmosphere interactions, coupled with higher-resolution simulations. It is important to carefully calibrate the hydrological model to reduce dependence on the quality of precipitation input, which has significant impacts on the components of the water balance. Our evaluation contributes not only to improved hydrological modelling practices but also but also holds broader implications for Earth system and climate modelling across various spatiotemporal scales.