Holistic environmental soil-landscape modeling of soil organic carbon

https://doi.org/10.1016/j.envsoft.2014.03.004Get rights and content

Highlights

  • A novel environmental soil-landscape modeling framework (the holistic ESLM) is presented.

  • The holistic ESLM is generalizable to other environmental domains due to its flexibility.

  • The holistic ESLM is built on the all-relevant and minimal-optimal variable selection theory.

  • All-relevant variable selection reveals insights to the soil–environment system modeled.

  • Minimal-optimal variable selection shows an objective way for parsimonious model development.

Abstract

In environmental soil-landscape modeling (ESLM), the selection of predictive variables is commonly contingent on the researchers' domain expertise on soil–environment processes. This variable selection strategy may suffer bias or even fail in regions where the process knowledge is insufficient. To overcome this problem, this study demonstrates a holistic ESLM framework which consists of five components: model conceptualization, data compilation, process identification, parsimonious model calibration, and model validation. Based on the STEP-AWBH conceptual model, a comprehensive pool of 210 potential environmental variables that exhaustively cover pedogenic and environmental factors was constructed. This was followed by strategic variable selection and development of parsimonious prediction models using machine learning techniques. The all-relevant variable selection successfully identified the major and minor factors relevant to the SOC variation, showing that the major factors important for explaining SOC variation in Florida were vegetation and soil water gradient. Topography and climate showed moderate effects on SOC variation. Parsimonious SOC models developed using four minimal-optimal variable selection techniques and simulated annealing yielded optimal predictive performance with minimal model complexity. The holistic ESLM framework not only provides a new view of selecting and utilizing variables for predicting soil properties but can also assist in identifying the underlying processes of soil-environment systems of interest. Due to the flexibility of the framework to incorporate various types of variable selection and modeling techniques, the holistic environmental modeling strategy can be generalized to other environmental modeling domains for both prediction and process identification.

Introduction

Environmental soil-landscape modeling (ESLM) is a useful tool for predicting soil properties and classes and understanding the relationships between soils and the environmental factors (Grunwald, 2005). The ESLM lays its foundation on work by Jenny (1941) and V.V. Dokuchaev (Glinka, 1927) who conceptualized the soil formation as a function of five factors, i.e., CLimate, Organism, Relief, Parent material, and Time (CLORPT model). It has been undergoing additional development over the past decades. McBratney et al. (2003) first encapsulated the conceptual model into a quantitative framework with the SCORPAN model which describes the relationships between soil and environmental factors for the purpose of spatial prediction of soils. For the past century, human activities have been influencing the environment, exerting critical impact on the pedosphere in terms of soil formation, change, and degradation (Richter et al., 2011). In response, the STEP-AWBH model (S: soil, T: topography, E: ecology, P: parent material, A: Atmosphere, W: Water, B: Biota, H, Human) was devised to explicitly model the effects induced by human activity on the soil system (Grunwald et al., 2011, Thompson et al., 2012).

It is a common practice in ESLM that the environmental factors are selected based on the researchers' domain knowledge of the soil–environment processes in the study area (Florinsky et al., 2002, Grunwald, 2009). This variable selection strategy heavily relies on the legitimacy of the researchers' knowledge. In some cases when the process knowledge is not comprehensive, a limited set of predictor variables could lead to biased and suboptimal model performance (Grunwald, 2009). Therefore, it is necessary to adopt a more unbiased strategy that allows models to access a broad set of environmental variables that represent a spectrum of possible soil-forming processes operating in a given landscape. The more exhaustive such a set of predictive variables is, the higher the potential is to unravel complex soil–environmental interactions and identify an unbiased, optimal model to predict a soil property of interest. Just as Jakeman et al. (2006) argued, a good practice in the development of environmental models should embrace alternative model families and structures to allow model comparison that avoids biased or even false conclusions drawn from a certain model favored by the model developer(s). Poggio et al. (2013) used stepwise methods to select the most predictive variables from a large pool of satellite-derived variables in a regional application of mapping properties, however their covariate scope was confined in Digital Elevation Model (DEM) and vegetation indices derived from Moderate-Resolution Imaging Spectroradiometer (MODIS) products while other important factors (e.g., soil, atmosphere, water) were not considered.

With the advance of Geographic Information Systems (GIS), Global Positioning System (GPS), and remote and proximal sensing technologies, it is feasible to build a comprehensive pool of spatially exhaustive environmental variables to characterize a full spectrum of environmental properties. These spatially explicit environmental datasets are available in much more abundance and finer spatial resolutions when compared with more sparsely sampled soil pedon data. In that sense, digital environmental covariates serve as critical predictors to infer on soil properties, although it is usually not known which combination of the environmental predictors has the highest predictive power in a given geographic region due to their scale dependent behavior (Vasques et al., 2012). It should be noted that collecting a large set of predictive variables for models can potentially be problematic as well. Some key issues are redundancy and collinearity between the variables, and the deleterious effects of noisy or non-informative variables. Strategic variable selection is required to identify the major ecosystem processes and identify parsimonious predictive models (Guyon and Elisseeff, 2003). In addition, variable selection can reduce model development and application time, increase model interpretability, and reduce overfitting (Belanche-Muñoz and Blanch, 2008, May et al., 2008). Variable selection has been an important research topic in machine learning. It involves two problems – minimal-optimal and all-relevant. The former is aimed at searching for the minimum set of predictor variables yielding the best prediction accuracy (Guyon and Elisseeff, 2003, Nilsson et al., 2007), while the latter is focused on finding all-relevant variables to the target property. Therefore, the minimal-optimal set is of special interest for developing predictive models, while the all-relevant set has great value in understanding the mechanisms underlying the soil–environment relationship. Nilsson et al. (2007) gave an in-depth discussion about the relationships between the two problems and showed that the minimal-optimal set is a subset of the all-relevant set when the data conforms to the strictly positive distribution which is the case for most data encountered in practical applications (Fig. 1).

Environmental variables that represent soil-landscape processes may come as different data types, generally continuous and categorical (including ordinal and nominal). Categorical variables (e.g., land use and geology type) discretize observations (samples) into unbalanced groups and may impose problems for model validation and predictions. A common approach in model validation is data-division, in which the observation data are split into calibration and validation sets (Bennett et al., 2013). The split of data may result in some classes of a categorical predictor underrepresented or not represented by the calibration set, which can lead to poor predictions or failed predictions for the underrepresented or non-represented classes. The same issues related to modeling using categorical predictors can occur in validation mode. The occurrence of this problem increase exponentially as the number of categorical variables included in a model increases. Therefore, it worthwhile to pay special attention to the categorical variables in ESLM and build models that strike the balance between model performance and the number of categorical variables used.

Soil organic carbon (SOC) is a key property that not only indicates soil quality but also has profound significance to the global climate system (Trumbore et al., 1996). Thus, the focus of this study is to model SOC in Florida, USA. The aim of the study was to demonstrate a new holistic ESLM strategy based on a comprehensive environmental variable pool using variable selection techniques that serve two purposes – revealing the underlying processes and making predictions of SOC. It involves five steps – model conceptualization, data compilation, process identification, predictive model calibration and model validation (Fig. 2). The specific objectives are threefold: 1) from a comprehensive set of environmental variables, identify an all-relevant set of variables of topsoil SOC in order to reveal the underlying SOC processes; 2) from the all-relevant set, identify the minimal-optimal sets that simplify models and optimize model performance for prediction; 3) explore the possibility of reducing the use of categorical variables in predictive models.

Section snippets

Study area

The study area is the state of Florida, located in the southeastern region of the United States, with latitudes from 24°27′ N to 31°00′ N and longitudes from 80°02′ W to 87°38′ W. Florida covers approximately 150,000 km2 (United States Census Bureau, 2000). The climate is humid and subtropical in northern and central Florida and is humid and tropical in southern Florida. The mean annual precipitation of Florida is 1373 mm and the mean annual temperature is 22.3 °C (National Climatic Data

Characteristics of soil organic carbon measurements

The SOC in the top 20 cm soils showed considerable variation with a range of 33.7 kg m−2, mean of 5.0 kg m−2, and median of 3.4 kg m−2 (Table 2). The data were strongly positively skewed with most values lying on the low value end. A high kurtosis value evidenced that the deviations of infrequent high SOC values accounted for a large amount of the SOC variation as opposed to frequent modest deviations. Both skewness and kurtosis indicate that the distribution of SOC values was highly

Conclusions

In this study, a new holistic ESLM strategy based on the STEP-AWBH conceptual model was demonstrated to effectively reveal soil–environmental processes and relationships while also making efficient predictions of SOC. An exhaustive set of 210 potential environmental variables was compiled to characterize Florida's soil-landscape based on state-of-the-art pedological knowledge and technical and computational capabilities. Models were developed to predict SOC (0–20 cm) using the comprehensive

Acknowledgments

This study was funded by USDA-CSREES-NRI grant award 2007-35107-18368 ‘Rapid Assessment and Trajectory Modeling of Changes in Soil Carbon across a Southeastern Landscape’ (National Institute of Food and Agriculture (NIFA) – Agriculture and Food Research Initiative (AFRI)). This project is a Core Project of the North American Carbon Program. The authors would like to thank Aja Stoppe, Christopher Wade Ross, Samiah Moustafa, Lisa Stanley, Adriana Comerford, and Anne Quidez for their hard work in

References (57)

  • J.A. Thompson et al.

    Digital soil mapping: Interactions with and applications for hydropedology

  • J.H.M. Thornley et al.

    Soil carbon storage response to temperature: an hypothesis

    Ann. Bot.

    (2001)
  • G. Vasques et al.

    Regional modelling of soil carbon at multiple depths within a subtropical watershed

    Geoderma

    (2010)
  • F. Villa et al.

    Design of multi-paradigm integrating modelling tools for ecological research

    Environ. Modell. Softw.

    (2000)
  • P.H. Bellamy et al.

    Carbon losses from all soils across England and Wales 1978-2003

    Nature

    (2005)
  • L. Breiman

    Bagging predictors

    Mach. Learn.

    (1996)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • T.H. Cormen et al.

    Introduction to Algorithms

    (1990)
  • P.M. Cox et al.

    Acceleration of global warming due to carbon-cycle feedbacks in a coupled climate model

    Nature

    (2000)
  • E.A. Davidson et al.

    Temperature sensitivity of soil carbon decomposition and feedbacks to climate change

    Nature

    (2006)
  • R.W. Eberhardt et al.

    Relationships among vegetation, surficial geology and soil water content at the Pocono mesic till barrens

    J. Torrey Bot. Soc.

    (2000)
  • N. Fierer et al.

    Litter quality and the temperature sensitivity to decomposition

    Ecology

    (2005)
  • Florida Fish and Wildlife Conservation Commission (FFWCC)

    Florida Vegetation and Land Cover Data Derived from Landsat ETM+ Imagery

    (2003)
  • K.D. Glinka

    Dokuchaiev's ideas in the development of pedology and cognate sciences

    (1927)
  • S. Grunwald

    Environmental soil-landscape modeling: geographic information technologies and pedometrics

    (2005)
  • S. Grunwald et al.

    Digital soil mapping and modeling at continental scales: finding solutions for global issues

    Soil Sci. Soc. Am. J.

    (2011)
  • L.B. Guo et al.

    Soil carbon stocks and land use change: a meta analysis

    Glob. Change Biol.

    (2002)
  • Y. Guo et al.

    Analysis of factors controlling soil carbon in the conterminous United States

    Soil Sci. Soc. Am. J.

    (2006)
  • Cited by (94)

    View all citing articles on Scopus
    View full text