Instance-based transfer learning for soil organic carbon estimation

Bursać, Petar; Kovačević , Miloš; Bajat, Branislav

doi:10.3389/fenvs.2022.1003918

ORIGINAL RESEARCH article

Front. Environ. Sci., 21 September 2022
Sec. Soil Processes
Volume 10 - 2022 | https://doi.org/10.3389/fenvs.2022.1003918

Instance-based transfer learning for soil organic carbon estimation

Petar Bursać www.frontiersin.org

Miloš Kovačević

Branislav Bajat*

University of Belgrade, Faculty of Civil Engineering, Belgrade, Serbia

Soil organic carbon (SOC) is a vital component for sustainable agricultural production. This research investigates the transfer learning-based neural network model to improve classical machine learning estimation of SOC values from other geochemical and physical soil parameters. The results on datasets based on LUCAS data from 2015 showed that the Instance-based transfer learning model captured the valuable information contained in different source domains (cropland and grassland) of soil samples when estimating the SOC values in arable cropland areas. The effects of using transfer learning are more pronounced in the case of different source (grassland) and target (cropland) domains. Obtained results indicate that the transfer learning (TL) approach provides better or at least equal output results compared to the classical machine learning procedure. The proposed TL methodology could be used to generate a pedotransfer function (PTF) for target domains with described samples and unknown related PTF outputs if the described samples with known related PTF outputs from a different geographic or similar land class source domain are available.

1 Introduction

Soil organic carbon (SOC) is an essential part of the global carbon cycle (Bruhwiler et al., 2018) and one of the most significant soil parameters. It is the main element present in the soil organic matter (SOM), the significant nutrient source for crop yields, which is crucial to agricultural production (Obalum et al., 2017). SOC is a major source of terrestrial carbon and a pivotal element for soil quality and fertility, representing an important element of terrestrial ecosystems due to its great potential to affect the climate, food security, and agricultural sustainability. Soils contain approximately double of the world’s organic carbon compared to the amount present in the atmosphere (Schmidt et al., 2011). However, SOC has been found to sink in many regions, whereas atmospheric CO₂ permanently increases. It is well known that there is an obvious linkage between SOC and climate (Horwath and Kuzyakov, 2018). New studies have shown that changes in land use and land management practice can affect SOC stocks more than climatic changes (Niu et al., 2021).

Measuring and assessment of the soil components and properties is generally a time-consuming and costly procedure. The absence of measured data can be reimbursed by results of predictions or modeling (Kovačević et al., 2010). A commonly used approach to the estimation of not-supplied soil parameters is more conducted on their indirect assessment based on measures of the values of some other already available parameters.

The spatial distribution of soil variables is usually realized by predictive soil mapping that presents a numerical or statistical model of the relationship among environmental variables and soil properties, which is then applied to a geographic database to create a predictive map (Scull et al., 2003). Most of those models are based on geostatistical or regression methods (McBratney et al., 2003). The increasing richness and accessibility of different data sources led to attempts to enrich the soil information through the use of ancillary data (McBratney et al., 2000). In the early age of digital soil mapping, the co-kriging geostatistical method was developed for predicting a target soil variable based on subsidiary variables that are cheaper or quicker to measure and well correlated with the target one (McBratney and Webster, 1983).

The other approach to the estimation of soil properties is based on a pedotransfer function (PTF) that presents a mathematical model used to convert direct soil measurements into known and unknown soil properties (Bouma, 1989). Since most soil properties are interdependent, PTF is widely used for modeling relationships in soil research, hydrology, environmental science, etc. Most PTFs are developed to predict soil properties for a geographical area with limited information available (McBratney et al., 2002).

In the last decades, machine learning (ML) techniques have taken a very important role in the Spatio-temporal modeling of geochemical variables, especially in their stock assessment and dynamic changes (Wadoux et al., 2020; Heuvelink et al., 2021). Most of those studies are primarily focused on mapping soil properties or classes from local to regional and even worldwide scales by using different ML algorithms (Taghizadeh-Mehrjardi et al., 2016; Hengl et al., 2017; Estévez et al., 2022). The final outputs are mainly digital soil maps (DSM) produced by different ML algorithms. As a consequence, the main goals of researchers were to model soil properties in space and time by reaching satisfactory prediction accuracy without obtaining new knowledge on the core soil structure and process (Wadoux et al., 2020).

In recent years, ML was also applied for developing PTFs in different environmental studies: Benke et al. (2020) developed generalized PTF to predict soil electrical conductivity and SOC content for regional locations in the state of Victoria, Australia; Ramcharan et al. (2017) developed PTF using the Random Forest algorithm to fill in data gaps for SOC stock volumetric assessments across the US; Gunarathna et al. (2019) use different ML algorithms to develop PTF for available water content in tropical regions. All these studies mainly deal with the estimation of soil properties within the same geographical space, which implies similar or the same environmental conditions.

Traditional SOC measurements are time-consuming and laborious. Therefore, using ML to obtain SOC values based on other geochemical parameters could be the direction of future development. SOC concentration depends on various factors such as soil type, climate, topography, and soil management practices. SOC is greatly influenced by vegetation through the organic matter input and consequently, land use change is one of the most important factors which impacts SOC stock increase/decrease. Padarian et al. (2020) found that studies related to SOC cycles and dynamics and the estimation of SOC stocks in different ecosystems, with particular importance on grasslands and topsoil, were one of the main topics that preoccupied researchers who used ML techniques in soil science. Other recent geosciences studies indicate that the application of ML techniques was boosted with the increasing implementation of remote sensing techniques that enabled a significant extension of attribute space in the modeling process (Kovačević et al., 2009; Bouasria et al., 2022). Recently published studies indicate the efficient application of ML in combination with remote sensing and open access data in precision and smart agriculture management (Benos et al., 2021; Mahmood et al., 2022a,b).

Despite the increased number of predictors, there is still a lack of in situ collected data in geosciences disciplines that are essential for model building. As a possible solution to resolve this problem, the concept of transfer learning was suggested. Transfer learning (TL), as a sub-field of ML, adjusts a model developed for one learning task to be used as the starting point for building a model in another learning task (Tan et al., 2018). This technique is already used in environmental studies. Xiong et al. (2021) used TL on different satellite data for identifying pre-earthquake ionospheric perturbation before earthquakes to improve the performance of earthquake prediction. The application of TL has already been investigated in soil science. So far, researchers were focused on its application in soil spectroscopy. Liu et al. (2018) used neural network-based TL for soil spectroscopy and its performance on soil clay content estimation using hyperspectral data. Padarian et al. (2019) applied TL to improve a localized general soil spectral calibration model generated with a continental LUCAS dataset. TL aims to use the knowledge learned from previous tasks, or from large datasets, to solve new related tasks with limited data (Pan and Yang, 2010). Mallavan et al. (2010) introduced a procedure for identifying areas with similar soil forming factors seeking the smallest taxonomic distance between the factors of the region of interest (lacking in soil data) and other reference areas (with soil data) to export soil-landscape rules between those regions. A similar concept was used to detect Areas of Applicability estimated by the dissimilarity index. This index is based on the minimum multivariate distance calculation to the training data in the multidimensional predictor space (Meyer and Pebesma, 2021; Ludwig et al., 2022). Malone et al. (2016) elaborated a similar approach to extrapolate data across a large mapping area with sparsely measured points, called recipient site, from a nearby donor site (usually of a smaller spatial extent) with very detailed soil mapping.

The goal of this research is to investigate how an ML model, which estimates SOC values (output) in arable cropland by using geochemical and physical characteristics of the cropland soil samples (inputs), can be improved with a TL approach. Commonly, an ML model is built using the inputs and associated outputs in one (source) domain, and then applied in the other (target) domain where only the inputs are available. As opposed to this procedure, we propose to build the model not only on the inputs and outputs of the source domain but also to utilize the available inputs from the target domain. The instance-based TL approach assumes the difference between the underlying probability distributions of source and target domain samples described with geochemical and physical inputs. In this paper, two types of different but related source domains were examined: transfer from a global cropland domain to a local cropland domain and from a global grassland domain to a local cropland domain.

In the first part of section two, a description of the soil samples dataset is given. Basic theoretical foundations of the instance-based TL and Bhattacharyya distance are presented in the second part of section two. In section three, the model training process, the experimental setup, and the discussion of the results are given. Concluding remarks about the contribution and future research are presented in section four.

2 Materials and methods

2.1 LUCAS-based twelve countries soil organic carbon dataset

Following a decision of the European Parliament, the European Statistical Office (EUROSTAT), in close cooperation with the Directorate General responsible for Agriculture and the technical support of the Joint Research Centre (JRC), organizes regular, harmonized surveys across all EU Member States to gather information on land cover and land use. This survey is known as LUCAS - Land Use/Cover Area frame statistical Survey (Jones et al., 2020). In this paper, a data set from the year 2015 was used (the latest available release during the research). The 2015 LUCAS data set consists of 21,857 samples, with SOC measured following the ISO 10694:1995 protocol (Orgiazzi et al., 2018), and ranging from 0.10 to 560.20 g/kg, as shown in Figure 1 (in the original LUCAS data set, SOC is labeled as organic carbon (OC) content). The samples originate from 28 countries (EU member states) and are divided into eight land cover classes: grassland, shrubland, woodland, cropland, bareland, artificial land, wetlands, and water.

FIGURE 1

FIGURE 1. Spatial distribution of LUCAS samples and SOC values in [g/kg] for the year 2015.

The spatial variability of SOC depends on the climate and the share of land cover (i.e., vegetation type) across the EU. Organic carbon was the highest in the boreal zone, most of the Atlantic zone, and the temperate mountainous zone. It was intermediate in the sub-oceanic zone and lowest in the Mediterranean and sub-continental zones. Wetland, woodland, shrubland, and grassland were the main land cover (LC) classes in zones with the highest SOC values. On the contrary, cropland and bareland were the more common LC class in zones with the lowest SOC values (Jones et al., 2020). The lowest SOC values under arable land could be due to reduced inputs of organic matter and frequent tillage.

The proposed TL approach was evaluated on the subset of the LUCAS data set which consists of samples that 1) belong to the soil classes with similar geochemical characteristics and dissimilar SOC - cropland (relatively low), and grassland (relatively high); 2) come from the countries with at least 100 samples in each land class (12 countries). The cropland land cover class includes fields of cereal, root crops, industrial crops, dry pulses and vegetables, fodder crops, fruit trees, olive groves, and vineyards, while the grassland land cover class includes fields of grass with and without sparse trees below 1,000 m altitude (Jones et al., 2020). The subset is named LUCAS-12 and its statistical summary is presented in Table 1 and Figure 2.

TABLE 1

TABLE 1. LUCAS-12 dataset: summary statistics of SOC values per country and LC class.

FIGURE 2

FIGURE 2. Spatial distribution of mean SOC values [g/kg], and number of samples aggregated per country in LUCAS-12 data set.

2.1.1 Explanatory variables for soil organic carbon estimation

The analysis of chemical and physical properties represents the core of the LUCAS Soil survey. According to Jones et al. (2020), a composite sample of approximately 500 g was taken from five subsamples collected with a spade at each LUCAS point. The first subsample was collected at the geo-referenced point location; the other four subsamples were collected at a distance of 2 m following the cardinal directions (North, East, South, and West).

In this research, five chemical and physical properties, measured at identical locations, were used as explanatory variables for SOC estimation: Nitrogen (Total Nitrogen concentration in g/kg for $<$ 2 mm soil fraction - labeled on figures as n_tot), pH - H₂O (1:1 Soil-Water Suspension for $<$ 2 mm soil fraction - ph_h20), pH - CaCl₂ (pH, CaCl₂ Suspension for $<$ 2 mm soil fraction - ph_cacl2), Potassium (Extractable in mg/kg for $<$ 2 mm soil fraction - k_ext), and Electrical Conductivity (Saturation Extract in dS/m for $<$ 2 mm soil fraction - ex_satp). All explanatory variables represent interconnected elements or indicators related to plant nutrition. The spatial distribution of explanatory variables is shown in Figure 3.

FIGURE 3

FIGURE 3. Spatial distribution of explanatory variables in a LUCAS-12 data set.

Pearson correlation coefficient, calculated between explanatory variables and SOC for the LUCAS-12 data set (Figure 4), showed that SOC values are highly correlated with Nitrogen values in both LC classes $(\geq 0.89)$ . This finding is in accordance with (Jones et al., 2020). Moreover, the SOC-to-Nitrogen ratio is relatively stable across different soil types. Overall, mineral soils generally have a SOC-to-Nitrogen ratio close to 12:1, while organic-rich soils have a SOC-to-Nitrogen ratio close to 30:1 (Jones et al., 2020). There is a significant correlation between Electrical Conductivity and SOC in the grassland class.

FIGURE 4

FIGURE 4. Correlation matrix as a heatmap between explanatory variables and soil organic carbon per land cover class.

2.2 Methodology

2.2.1 Instance-based non-inductive transfer learning

The proposed model for estimating SOC is designed to use the instance-based non-inductive transfer learning (Yang et al., 2020). We first define the basic concepts of transfer learning. A domain $D = D (X, P_{X})$ consists of two components: a feature space $X$ from which samples $x \in X$ come from, and a marginal probability distribution P_X that produces each sample x (in ML terminology, explanatory variables are denoted as features). All samples are or can be transformed to vectors of real numbers $x \in R^{n}$ . Given a domain of interest, one can perform a task $T$ which consists of two components: a label space $Y$ , and an unknown predictive function $f : X \to Y$ for which y = f(x). In a regression task, $y \in Y$ is a real number, while in a classification task it takes one of several discrete values (classification labels). Estimating SOC values from geochemical and physical characteristics of soil samples is a regression task. In a regression task, the real f remains unknown, but we can learn y = g(x) from the available data ${(x_{i}, y_{i} = f (x_{i}))}_{i = 1 \dots N}$ , hoping that g will approximate f outside of the training set well. Function g is selected from a predefined family of functions and is completely determined with its parameter vector θ (i.e., if $g (x) = w_{0} + \sum_{i = 1}^{n} w_{i} x_{i}$ then θ = (w₀, w₁, …, w_n)). To learn g from the available data, one needs to find θ which minimizes the prediction error $\sum_{i = 1}^{N} l (x_{i}, y_{i}, θ)$ , where the loss function l(x, y, θ) estimates the error for each sample. In a regression task, commonly used loss function is $l (x, y, θ) = {(g_{θ} (x) - y)}^{2}$ . Mathematically, the learning process on a training set can be described as:

θ^{*} = \arg \min_{θ} \sum_{i = 1}^{N} l (x_{i}, y_{i}, θ) (1)

Due to the noisy nature of measurements, a training set often contains different values of y for the same x. Therefore, an unknown f can be interpreted as an expectation E(y|x) defined over the probability distribution P(y|x). Hence, a task $T$ is defined to be $T = T (Y, P_{Y | X})$ .

In the context of transfer learning, there are two domains of interest, source domain $D_{s}$ and target domain $D_{t}$ . They are represented with source-labeled and destination-labeled datasets where the labelling process is the outcome of two tasks $T_{s}$ and $T_{t}$ : ${(x_{s_{i}}, y_{s_{i}})}_{i = 1 \dots N}$ , where $x_{s_{i}} {\in X}_{s}, y_{s_{i}} {\in Y}_{s}$ , and ${(x_{t_{i}}, y_{t_{i}})}_{i = 1 \dots M}$ , where $x_{t_{i}} {\in X}_{t}, y_{t_{i}} {\in Y}_{t}$ . According to Pan and Yang (2010), transfer learning can be defined as follows:

Definition 1: Given $D_{s}, T_{s}, D_{t}$ , and $T_{t}$ , transfer learning aims to imporve the learning of the predictive function f_t from the target domain, using the knowledge in $D_{s}$ and $T_{s}$ , where $D_{s} \neq D_{t}$ or $T_{s} \neq T_{t}$ .

From Definition 1 follows that a traditional machine learning setting arises when $D_{s} = D_{t}$ and $T_{s} = T_{t}$ . However, the nature of the difference between the domains or between the tasks can be used to categorize different transfer learning settings. We now define the instance-based non-inductive transfer learning:

Definition 2: Let $Y_{s} {= Y}_{t}$ and $P_{Y | X}^{s} = P_{Y | X}^{t}$ (i.e., $T_{s} = T_{t}$ ). If $X_{s} {= X}_{t}$ and $P_{X}^{s} \neq P_{X}^{t}$ (i.e., $D_{s} \neq D_{t}$ ), the learning setting becomes instance-based and non-inductive.

An instance-based non-inductive setting assumes the same feature and label spaces as well as the same underlying process that maps inputs to outputs in both domains. However, the marginal probability distributions of instances (samples) are different across domains. In this paper, we assume the marginal probability distributions of the observed samples are different across various land cover types. Therefore, this setting can be applied when one tries to predict cropland SOC values using both geochemical + physical and SOC values from grassland samples (source domain), and only geochemical + physical values from cropland samples (target domain). Now, we explain how one can find the optimal parameters of the target prediction model g_t ≈ f_t.

Suppose that $D_{s}$ , $T_{s}$ , and $D_{t}$ are represented with ${(x_{s_{i}}, y_{s_{i}})}_{i = 1 \dots N}$ , and ${(x_{t_{i}})}_{i = 1 \dots M}$ . We would like to find the optimal parameters $θ_{t}^{*}$ of the target task prediction model under the assumption of the instance-based non-inductive setting. Using the empirical risk minimization framework (Vapnik, 1998), we minimize the following expectation:

θ_{t}^{*} = \arg \min_{θ_{t}} E_{(x, y) \sim P_{X, Y}^{t}} [l (x, y, θ_{t})] (2)

where l(x, y, θ_t) is a loss function defined for the target task. Using the definition of mathematical expectation, Eq. 2 becomes:

θ_{t}^{*} = \arg \min_{θ_{t}} \int_{x {\in X}_{t}} \int_{y {\in Y}_{t}} l (x, y, θ_{t}) P_{t} (x, y) d x d y (3)

From Definition 2 follows $P_{Y | X}^{s} = P_{Y | X}^{t}$ , and after using the Bayes’ rule, we obtain $P_{t} (y | x) = \frac{P_{t} (x, y)}{P_{t} (x)} = P_{s} (y | x) = \frac{P_{s} (x, y)}{P_{s} (x)}$ , and hence $P_{t} (x, y) = \frac{P_{t} (x)}{P_{s} (x)} P_{s} (x, y)$ . Since $X_{s} {= X}_{t}$ and $Y_{s} {= Y}_{t}$ (Definition 2), Eq. 3 becomes:

\begin{align} θ_{t}^{*} & = \arg \min_{θ_{t}} \int_{x {\in X}_{s}} \int_{y {\in Y}_{s}} l (x, y, θ_{t}) \frac{P_{t} (x)}{P_{s} (x)} P_{s} (x, y) d x d y \\ = \arg \min_{θ_{t}} E_{(x, y) \sim P_{X, Y}^{s}} [\frac{P_{t} (x)}{P_{s} (x)} l (x, y, θ_{t})] \end{align} (4)

Optimal parameters of the target model cannot be found by Eq. 4 since the expectation of the joint distribution in the source population is impossible to compute. The best we can do is to apply the empirical approximation to the training data by modifying Eq. 1:

θ_{t}^{*} = \arg \min_{θ_{t}} \sum_{i = 1}^{N} [\frac{P_{t} (x_{s_{i}})}{P_{s} (x_{s_{i}})} l (x_{s_{i}}, y_{s_{i}}, θ_{t})] (5)

Equation 5 suggests why this method is called “instance-based”. Each source domain instance is weighted in the loss function with the ratio $\frac{P_{t} (x)}{P_{s} (x)}$ , meaning that if an instance is more probable to occur in the target domain, then the optimization process pays more attention to it. If the probability ratio is 1 for all source instances, then the loss function takes its standard form described with Eq. 1 – the target prediction model is equal to the model created only on source domain data (classical machine learning setting without transfer of knowledge). The probability ratio can be estimated using the rejection sampling-based method for correcting sample selection bias (Zadrozny, 2004). This method introduces a new binary random variable δ ∈ {0, 1}, which selects whether a sample will be accepted by the source domain or not: P_s(x) = P(x|δ = 1), and P_t(x) = P(x|δ = 0). The ratio can be calculated as follows:

\begin{align} \frac{P_{t} (x)}{P_{s} (x)} & = \frac{P (x | δ = 0)}{P (x | δ = 1)} = \frac{P (x) P (δ = 0 | x) P (δ = 1)}{P (x) P (δ = 1 | x) P (δ = 0)} \\ = \frac{(1 - P (δ = 1 | x)) P (δ = 1)}{P (δ = 1 | x) P (δ = 0)} \\ = \frac{P (δ = 1)}{P (δ = 0)} (\frac{1}{P (δ = 1 | x)} - 1) \end{align} (6)

Equation 6 suggests that the probability ratio is proportional to $\frac{1}{P (δ = 1 | x)}$ . Hence, one can treat the evaluation of the ratio as a binary classification problem in which a classifier is trained to predict the probability of a sample being from the source domain (P(δ = 1|x)), or the target domain (P(δ = 0|x)).

2.2.2 Bhattacharyya distance

In this research, Bhattacharyya Distance (Bhattacharyya, 1946) is used to estimate the amount of overlap between the source and target domains (distributions $P_{X}^{s}$ and $P_{X}^{t}$ ). Let P and Q are two discrete probability distributions over the same domain $X$ . The Bhattacharyya distance B(P, Q) measures the similarity between P and Q using the following equation:

B (P, Q) = - \ln \sum_{x \in X} \sqrt{P (x) Q (x)} (7)

Since $\sum_{x \in X} P (x) = 1$ and $\sum_{x \in X} Q (x) = 1$ , if P = Q then B(P, Q) = − ln 1 = 0. If P and Q are orthogonal (i.e. there exists no x for which both distributions assign non-zero probability) then B(P, Q) = − ln 0 = +∞. Distributions with lower overlap produce bigger values for B(P, Q). However, Bhattacharyya distance (BD) is not a metric since it does not obey the triangle inequality.

In our problem setting, ${(x_{s_{i}})}_{i = 1 \dots N}$ and ${(x_{t_{i}})}_{i = 1 \dots M}$ are n-dimensional continuous random variables (geochemical and physical soil properties). Hence, to apply Eq. 7 one needs to discretize the input space as described in Figure 5. The estimated BD between probability distributions depends on the choice of discretization steps: too few rectangles (hypercubes in n-dimensional space) would overestimate BD while too many would underestimate.

FIGURE 5

FIGURE 5. Discretization of a two dimensional random variable x = (x₁, x₂). The values of discretization steps d₁ and d₂ determine the size of each rectangle. Now, $B (P_{s}, P_{t}) = - \ln \sum_{i = 1}^{9} \sqrt{p_{i}^{s} p_{i}^{t}}$ , where p_i is a probability of x belonging to rectangle i (i.e. $p_{3}^{t} = 1 / 6$ ).

Apart from BD, there are other popular methods to calculate the statistical distance (or similarity between distributions) such as Mahalanobis distance (Mahalanobis, 1936), Kolmogorov-Smirnov test (Simard and L’Ecuyer, 2011), or Jensen-Shannon divergence (Lin, 1991). However, Mahalanobis distance calculates the distance between a point and a distribution, the Kolmogorov-Smirnov test works with one-dimensional random variables, and Jensen-Shannon divergence requires that, after discretizing the input space, the same hyper-cubes from both distributions cannot be both empty $(p_{i}^{s} = p_{i}^{t} = 0)$ .

2.3 Programming environment

In this research, two programming environments were used: data preprocessing and analyses were conducted using the R software environment (RCoreTeam, 2013); models were built using the Python PyTorch (Paszke et al., 2019) and ScikitLearn (Pedregosa et al., 2011) libraries. The code and the datasets used for the experiments can be downloaded from the GitHub repository [SocTransferLearning].

3 Experiments and results

The evaluation of the proposed TL-based arable cropland SOC estimation model has been conducted in a leave-one-country-out procedure on a LUCAS-12 data set. For cropland samples in each country (12 target domains), we built two estimation models: classical ML, and TL. Both models were trained on soil samples obtained by merging data from the remaining 11 countries in two different experimental settings: 1) source domain contained soil samples only from cropland class areas, and 2) source domain contained samples only from grassland class areas.

3.1 Training the proposed transfer learning model

The proposed TL model is trained in two phases. In the first phase, a neural network classifier (Aggarwal, 2018) is trained to distinguish between the source and the target domain samples. A two-layer feed-forward network uses a Rectified Linear Unit (ReLU) activation function in each of the five neurons in the hidden layer. The number of hidden neurons is equal to the number of inputs which is a common choice for models with few inputs. The output neuron performs the Sigmoid function. The network is trained to minimize the binary cross-entropy loss in a standard backpropagation procedure (Aggarwal, 2018). When trained, the network assigns the probabilities of each source sample belonging to the source class (P(δ = 1|x)). The assigned probabilities will be used in the next phase to modify the mean squared error loss of the regression model defined in Eq. 5 – note that the ratio $\frac{1}{P (δ = 1 | x)}$ , according to Eq. 6, estimates the probability ratio $\frac{P_{t} (x)}{P_{s} (x)}$ .

In the second phase, the regression model is trained only with the samples from the source domain, using both geochemical and physical variables as the inputs and related SOC values as the outputs. The model uses a two-layer, feed-forward neural network with five hidden neurons and one linear output. The network is trained in a standard backpropagation procedure using the modified, previously explained, loss function. Optimal hyperparameters (learning rate, momentum, and the number of training epochs) for the classification and regression networks are found in a standard 5-fold cross-validation procedure (Aggarwal, 2018).

When performing the experiments, classical ML models are trained using only the second phase of TL training in which the ratio $\frac{P_{t} (x)}{P_{s} (x)}$ is set to one. The schematic display of the training process is shown in Figure 6.

FIGURE 6

FIGURE 6. Two training phases for the TL model: In phase 1-a, a classifier is trained to distinguish between the source and the destination domain inputs; in phase 1-b, probabilities of source domain samples belonging to the source domain were calculated (p_i); in phase 2, a final regression model is trained only on the source domain labeled data, using the modified error function.

3.2 Model evaluation and discussion

The instance-based TL and classical models were compared using the normalized versions of Root Mean Squared Error (NRMSE) and Mean Absolute Error (NMAE), and Coefficient of Determination (R²):

\begin{align} NRMSE & = \frac{1}{\bar{y}} \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}, \\ NMAE & = \frac{1}{\bar{y} n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |, \\ R^{2} & = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}} \end{align} (8)

Due to the quadratic term in the sum, Root Mean Squared Error is more sensitive to outliers (samples in which the difference between the real and the predicted value is large) than Mean Absolute Error. Both measures are normalized over the average value of the target variable (real SOC values in the target domain). Hence, different models trained for the same target domain can be relatively compared. The Coefficient of Determination shows how the trained model improves over the one that always predicts the average value of the target variable. If the trained model is perfect, then R² = 1; if the model always predicts the average value, like one would optimally do without learning, then R² = 0; for values of R² less than zero, the model is worse than one would achieve by always predicting the average value.

3.2.1 Experimental setting cropland-to-cropland

In this experimental setting, a source domain for each country consisted of the soil samples from the other 11 countries, covering only the cropland LC class. The target domain for each country consisted of the soil samples from its cropland LC class. In this manner, we tested the capability of TL to transfer knowledge from the global to the local geographical scale. The experimental results are shown in Table 2. In most of the considered countries, the TL model provides little better results in at least one performance measure. In the case of Austria, Germany, Greece, and Italy, there is no improvement in any of those measures, while in the United Kingdom and Poland all measures are slightly improved. The only measure that is slightly worse is R² for the Czech Republic, indicating that the proposed approach is at least as good as the classical one.

TABLE 2

TABLE 2. Comparing classical (C), and transfer learning (T) approach in a Cropland-to-Cropland setting: normalized RMSE and MAE (lower is better), and R² (higher is better).

3.2.2 Experimental setting grassland-to-cropland

In this experimental setting, a source domain for each country consisted of the soil samples from the other 11 countries, covering only the grassland LC class. As in the previous experimental setting, the target domain for each country consisted of the soil samples from its cropland LC class. Hence, we tested the capability of TL to transfer knowledge from both global-to-local geographical scales and different, but related LC classes at the same time. The experimental results are shown in Table 3. The improvement in at least one of the measures was present in all the countries except for Sweden. In the case of Bulgaria, France, Germany, Italy, Poland, Romania, and Spain, all measures are improved. These improvements are significantly bigger than in the cropland-to-cropland experimental setting. For Spain the improvement is higher than 10% for each measure.

TABLE 3

TABLE 3. Comparing classical (C), and transfer learning (T) approach in a Grassland-to-Cropland setting: normalized RMSE and MAE (lower is better), and R² (higher is better), indicate the benefits of the proposed TL approach.

3.2.3 Discussion

As can be seen from Tables 2 and 3, the average improvements of the proposed cropland SOC estimation model depend on the type of the source domain. While the improvement of a global-to-local geographical transfer of knowledge is negligible for the Cropland-to-Cropland case, a global-to-local transfer across land cover classes yields significant improvement for the Grassland-to-Cropland case. In all cases, NRMSE is higher than NMAE since NMAE is more robust to the outliers in the estimation process. All error measures are lower when the regression models are trained on the labeled cropland samples, which are naturally expected. However, when trained on the labeled grassland samples using a TL approach, the model can get close to the more natural model trained on the labeled cropland samples—classic Cropland-to-Cropland vs TL Grassland-to-Cropland: 0.30 vs 0.30 for average NRMSE; 0.17 vs 0.19 for average NMAE; 0.77 vs 0.79 for average R².

To explain the improved performance of the TL model compared to its classical counterpart for most target countries, we calculated the Bhattacharyya distance (BD) between the source and the target domain distributions in both experimental settings. How the distance (overlap) between the distributions affects the achieved improvement can be seen in Table 4. The benefits of TL over the classical approach for a particular country are more expressed if the distance between the source and the target domain distribution is greater. The transfer of SOC-related knowledge from the grassland to the cropland LC class achieves much better results (the central part of Table 4) than the transfer from the global to the local cropland LC class (the left part of Table 4). This result is expected since the instance-based TL model can benefit if the source and the destination input distributions are different enough so that there will be something to transfer—see Definition 2 in Section 2.2.1. If the distributions are almost identical, as in the case with the Cropland-to-Cropland setting, then the transferred knowledge is minimal.

TABLE 4

TABLE 4. Comparing the distance between source and target input distributions per each country and the improvement of the TL model over the classical model: BD_s - BD between a source domain s ∈{crop, grass} and a cropland target domain distributions per each country; $M_{T - C}^{s}$ - the percentual improvement of performance metric M when a TL model is built on a source domain s; Δ_BD - increase in BD distance (BD_grass − BD_crop); Δ_M - the improvement in M of the TL model over the classical model when the transfer is conducted from the grassland source domain, compared to the cropland source domain $(M_{T - C}^{grass} - M_{T - C}^{crop})$ .

In the right part of Table 4, a relationship between the increase in distance for a particular country (Δ_BD = BD_grass − BD_crop) and the increase in performance M improvement of a TL model over the associated classical model $(Δ_{M} = M_{T - C}^{grass} - M_{T - C}^{crop})$ is shown. One can see that the higher values of Δ_BD correspond to higher values of Δ_M. This conclusion is confirmed by a calculated Pearson correlation coefficient between those two values: for Δ_NRMSE - 0.67 (p-value 0.02); for Δ_NMAE - 0.65 (p-value 0.02); for $Δ_{R^{2}}$ - 0.65 (p-value 0.02).

Despite the same range for corresponding variables values and discretization steps used when calculating distances between different domains, distances placed in different rows of Table 4 could not be simply compared due to a different number of cropland soil samples in a particular country.

4 Conclusion

Considering the importance of SOC in the overall terrestrial ecosystem, its estimation is a topic that occupied many researchers from the field of soil science. SOC estimation from geochemical and physical soil parameters in arable land is significant because of its permanent reduction due to tillage activities and climate changes and as a vital element for soil quality and fertility. In this study, we did not consider the classical ML models by themselves, which is the most often topic of recently published works in this area of research, but the possibility of upgrading these models using a transfer learning approach. The proposed TL methodology could be used to generate PTFs for target domains with described samples and unknown related PTF outputs if the described samples with known related PTF outputs from a different geographic or similar land class source domain are available. The assumption for the proposed methodology is that the source and the target distributions of samples are overlapping. In the case of equal distributions, a TL and a classical ML approach would be the same. If the distributions are totally different, then both classical ML and TL approaches would be inappropriate.

The proposed instance-based TL method improved SOC estimation in cropland areas of different target countries by transferring SOC-related knowledge from two global source domains: European cropland and European grassland (both data sets derived from the LUCAS 2015 survey). In both cases, an improvement over the classical ML-based model was evident. However, the benefit of applying TL was more significant when transferring from a different but related land cover class (grassland to cropland), which is in accordance with the starting assumption that the source and the target domain data come from different, but overlapping probability distributions. The effects of TL per particular country were different and could be further analyzed. The analysis should include expert knowledge about specific pedological patterns, climatic factors, and commonly applied agrotechnical practices. Nevertheless, the application of instance-based TL almost always outperformed its classical counterpart, and it could be recommended whenever additional soil data are available.

Instead of transferring knowledge from the global to the local domain, future research will investigate the efficiency of the proposed TL methodology in the inverse direction. Continuation of the study will be to examine the extrapolation of the information from detailed measured small to sparsely sampled larger areas (see Malone et al. (2016)). The other future research will include the additional covariates like climatic and remote sensing data from Sentinel satellite missions.

Data availability statement

The code and the datasets used for the experiments in this study can be downloaded from the GitHub repository [SocTransferLearning].

Author contributions

All coauthors PB, MK, and BB contributed to the conception and design of the study. PB organized the database. All coauthors PB, MK, and BB equally contributed to all aspects of the research and manuscript preparation and approved it for publication.

Funding

This research was funded by the Science Fund of the Republic of Serbia - Program for Development of Projects in the Field of Artificial Intelligence, grant number 6527073 (project acronym CERES).

Acknowledgments

The LUCAS topsoil dataset used in this work was made available by the European Commission through the European Soil Data Centre managed by the Joint Research Centre (JRC).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Aggarwal, C. (2018). Neural networks and deep learning: A textbook. Springer.

Google Scholar

Benke, K., Norng, S., Robinson, N., Chia, K., Rees, D., and Hopley, J. (2020). Development of pedotransfer functions by machine learning for prediction of soil electrical conductivity and organic carbon content. Geoderma 366, 114210. doi:10.1016/j.geoderma.2020.114210

ORIGINAL RESEARCH article

Instance-based transfer learning for soil organic carbon estimation

1 Introduction

2 Materials and methods

2.1 LUCAS-based twelve countries soil organic carbon dataset

2.1.1 Explanatory variables for soil organic carbon estimation

2.2 Methodology

2.2.1 Instance-based non-inductive transfer learning

2.2.2 Bhattacharyya distance

2.3 Programming environment

3 Experiments and results

3.1 Training the proposed transfer learning model

3.2 Model evaluation and discussion

3.2.1 Experimental setting cropland-to-cropland

3.2.2 Experimental setting grassland-to-cropland

3.2.3 Discussion

4 Conclusion

Data availability statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Publisher’s note

References

This article is part of the Research Topic

People also looked at