Elsevier

Ecological Modelling

Volume 328, 24 May 2016, Pages 108-118
Ecological Modelling

How important are choice of model selection method and spatial autocorrelation of presence data for distribution modelling by MaxEnt?

https://doi.org/10.1016/j.ecolmodel.2016.02.021Get rights and content

Highlights

  • The default MaxEnt procedure tends to produce overly complex distribution models.

  • Forward variable selection in MaxEnt provides simpler but yet not inferior models.

  • Adding spatially autocorrelated presence observations hardly improves models.

Abstract

In this paper we explore new options for model selection in the distribution modelling method MaxEnt, offered by the strict maximum likelihood explanation of the method. These options, which are collectively referred to as the ‘alternative MaxEnt procedure’ (aMp), imply manual forward stepwise selection of variables and an F-ratio test to compare nested models. We compared aMp models with models obtained by the default variable transformation and model selection procedures in the popular Maxent.jar software (‘default MaxEnt procedure’; dMp), which invokes ℓ1-regularisation. We used swamp-forest presence in SE Norway as the target for our modelling. Eleven terrain variables derived from a LiDAR-based digital elevation map, making up three groups of internally correlated variables, were used as predictors in the models. An independently collected presence/absence ‘test’ data set was used for evaluation of models. Although dMp models were much more complex than aMp models, with up to 240 variables derived from the original 11, compared to a maximum number of derived variables in aMp models of 14, no significant difference in performance between dMp and aMp models was found. This indicates that the regularisation procedure is not the main reason for the good performance of MaxEnt (as implemented in Maxent.jar) in comparative studies. Advantages of the more flexible aMp procedure, which produces simpler models, are discussed. Effects of spatial autocorrelation in the response variable on modelling results were addressed by obtaining parallel MaxEnt models for two response variables that differ in the number of presence observations: the centroid of the 121 swamp forests and the 7175 grid cells, each 5 m × 5 m, in which swamp forest was recorded as present. Our results are not conclusive with respect to the effect of spatial autocorrelation in the response variable on the predictive performance of distribution models by MaxEnt.

Introduction

Over the last 15 years distribution modelling (DM) has become established as a discipline on its own within ecology and biogeography (Guisan and Zimmermann, 2000, Austin, 2007, Franklin, 2009, Peterson et al., 2011). A wide spectre of statistical methods and software implementations have been developed for DM (Guisan and Zimmermann, 2000, Elith et al., 2006, Franklin, 2009), among which maximum entropy (MaxEnt) modelling by the free, user-friendly Maxent.jar software (Phillips et al., 2004, Phillips et al., 2006, Phillips, 2011) is currently among the most popular (Yackulic et al., 2013, Merow et al., 2013).

MaxEnt's popularity rests on consistent ranking among the best methods in comparative studies (Elith et al., 2006, Hernandez et al., 2006, Phillips et al., 2006, Wisz et al., 2008, Mateo et al., 2010, Aguirre-Gutiérrez et al., 2013). Nevertheless, concern has been expressed for MaxEnt's susceptibility to spatial autocorrelation in the response variable, i.e., a sampling procedure giving rise to unequal inclusion probabilities because of patchy distribution of presence observations (Veloz, 2009, Anderson and Gonzalez, 2011, Merckx et al., 2011). How spatial autocorrelation affects the performance of distribution models is, however, still not well known or understood (Dormann et al., 2007, Santika and Hutchinson, 2009, Crase et al., 2012). Furthermore, concern has been expressed for MaxEnt's tendency to overfit models to the data (Raes and ter Steege, 2007, Merckx et al., 2011, Halvorsen, 2013, Halvorsen et al., 2015); a model is commonly regarded as overfitted to the data when a simpler model with better predictive performance exists (Guisan and Thuiller, 2005, Merckx et al., 2011). However, the degree of overfitting is dependent on DM purpose (Jiménez-Valverde et al., 2008, Stokland et al., 2011, Halvorsen, 2012). In this context it is neccessary to distinguish between spatial prediction modelling (SPM – optimising the fit between model predictions and the true distribution of the modelled target's performance) and ecological response modelling (ERM – finding and understanding general patterns in the overall ecological response of the modelled target to supplied explanatory variables). Halvorsen (2012) defined three types of overfitting: Type I, that a more complex model has lower predictive performance on independent data than a simpler model; Type II, that a more complex model does not have significantly better predictive performance on independent data than a simpler model; and Type III, that a more complex model with higher predictive performance on independent data than a simpler model fails to fit realistic overall ecological response curves. Type III is relevant for ERM purposes only while Types I and II apply to models in general.

The complexity of the final model is determined by the choices of model selection procedure, regularisation method, and strictness of the criterion used to compare alternative models (Reineking and Schröder, 2006). MaxEnt, as implemented in Maxent.jar software, was described by Phillips et al. (2006) and Phillips and Dudík (2008) as a machine-learning method with model selection by the shrinkage method referred to as ℓ1-regularisation. While this procedure is often regarded as one of the main reasons for MaxEnt's good performance (e.g., Elith et al., 2006, Phillips and Dudík, 2008, Elith et al., 2011), there are also indications that MaxEnt models created by this default method of Maxent.jar (hereafter referred to as the default, or standard, MaxEnt practice, dMp), tend to be very complex in terms of number of parameters (Anderson and Gonzalez, 2011, Warren and Seifert, 2011, Halvorsen, 2013). Accordingly, such models may be overfit to the data (Raes and ter Steege, 2007, Merckx et al., 2011, Halvorsen et al., 2015) regardless of modelling purpose and type of overfitting addressed (Halvorsen, 2013).

Recent treatises, which demonstrate that MaxEnt can be understood as a Poisson point process (Renner and Warton, 2013) or derived from general principles of maximum likelihood estimation (Halvorsen, 2013, Halvorsen et al., 2015), have opened for an alternative MaxEnt modelling practice (aMp), by which ℓ1-regularisation is replaced by standard model selection tools such as simple forward subset selection of variables (e.g., Hastie et al., 2009). Sequential addition of variables may offer the user full control over model complexity (Halvorsen et al., 2015) and provide a key to avoidance of overfitting. Comprehensive comparisons between dMp and aMp approaches, and tuning of settings for the aMp, have, however, still not been performed.

The main aims of this study are to test the hypotheses (1) that the default MaxEnt procedure (dMp) does not result in overfitted models, and (2) that dMp and aMp models do not differ in the degree of overfitting. These aims were accomplished by model evaluation using a dataset of true presence or absence (P/A) data for the modelled target, collected independently of the data used to train the model, as strongly recommended by, e.g., Austin (2007), Edvardsen et al. (2011), Peterson et al. (2011) and Halvorsen (2012). By access to independent test data, influence of the sampling design, e.g., a bias towards easily accessible sites (Pearson et al., 2007, Wollan et al., 2008), on evaluation results was avoided. Furthermore, two additional hypotheses were tested by use of the independent test data: (3) that the ranking of models based upon performance statistics calculated from training and independent test data sets do not differ; and (4) that spatial autocorrelation in the response variable does not influence the predictive performance of MaxEnt models.

We accomplished these aims by using data for boreal swamp forests in SE Norway. Swamp forests occur under favourable hydrological and topographical conditions, such as flat depressions in the boreal forest landscape with ample water supplies (Økland et al., 2001, Rydin and Jeglum, 2013). Nutrient-rich swamp forests have considerable conservation interest (Ohlson et al., 1997, Hörnberg et al., 1998).

Section snippets

Study area

Data for this study were collected in the Østmarka Nature Reserve (59°50′ N, 11°02′ E, 190–368 m.a.s.l.), located in the southern boreal zone of SE Norway (Moen, 1999); see Økland et al. (2001) for detailed description. The topography of the 8 km2 study area is dominated by main ridges and valleys in the N-S direction, which are dissected by minor valleys resulting in a broken topography with structures on several scales. The study area was tessellated into 5 m × 5 m grid cells to address the

Correlation structure of environmental explanatory variables

The 11 EVs made up three groups of strongly correlated variables, with 1, 7 and 3 variables, respectively (Fig. 2): (1) Slope, which had no correlation coefficient |τ| > 0.2 with any other variable. (2) Topographic position, consisting of the three TPI indices, the three curvature indices and Flow accumulation. TPI3, which was strongly correlated (|τ| > 0.4) with all other variables of the group, was the ‘core variable’ of this group. (3) Terrain ruggedness, consisting of the three VRM indices.

Dependence of test AUC on MaxEnt options and settings, and on choice of response variable

Test

Overfitting of MaxEnt models

We find consistent differences in the number of environmental (EVs) and derived variables (DVs) in MaxEnt models that differ only with respect to strictness of the criterion for comparison between nested models [lambda (λ) and alpha (α) values, respectively, in dMp and aMp models]. Most notably, dMp models obtained by weak regularisation (low λ value) included numerous DVs while at the same time had the lowest test-AUC values. This substantiates results of previous studies which indicate that

Acknowledgements

We thank Vegar Bakkestuen, Lars Erikstad, Vegard Lien and Hans Ole Ørka for assistance with ArcGIS, R and ALS; Anders Bryn and Lars Østbye Hemsing with MaxEnt-related questions; and Ole Martin Bollandsås and Christian Bianchi Strømme for help with field work.

References (80)

  • B. Reineking et al.

    Constrain to perform: regularization of habitat models

    Ecol. Model.

    (2006)
  • T. Santika et al.

    The effect of species response form on species distribution model prediction and inference

    Ecol. Model.

    (2009)
  • U. Segerström et al.

    Disturbance history of a swamp forest refuge in northern Sweden

    Biol. Conserv.

    (1994)
  • J.N. Stokland et al.

    Species distribution modelling – effect of design and sample size of pseudo-absence observations

    Ecol. Model.

    (2011)
  • J. Aguirre-Gutiérrez et al.

    Fit-for-purpose: species distribution model performance depends on evaluation criteria - Dutch hoverflies as a case study

    PLoS One

    (2013)
  • Anonymous, 2008. ArcGIS, ed. 9.3. ESRI, Redlands,...
  • M.P. Austin

    Vegetation and environment: discontinuities and continuities

  • V. Braunisch et al.

    Selecting from correlated climate variables: a major source of uncertainty for predicting species distributions under climate change

    Ecography

    (2013)
  • P. Burrough

    Spatial aspects of ecological data

  • D.S. Chapman

    Weak climatic associations among British plant distributions

    Global Ecol. Biogeogr.

    (2010)
  • B. Crase et al.

    A new method for dealing with residual spatial autocorrelation in species distribution models

    Ecography

    (2012)
  • J.W. Dirksen

    Modelling presence of swamp forest and forest dwelling birds in a boreal forest reserve using airborne laser scanning

    (2013)
  • C. Dormann

    Modelling species’ distributions

  • C.F. Dormann et al.

    Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

    Ecography

    (2013)
  • C.F. Dormann et al.

    Methods to account for spatial autocorrelation in the analysis of species distributional data: a review

    Ecography

    (2007)
  • A. Edvardsen et al.

    A fine-grained spatial prediction model for the red-listed vascular plant Scorzonera humilis

    Nordic J. Bot.

    (2011)
  • J. Elith et al.

    Novel methods improve prediction of species’ distributions from occurrence data

    Ecography

    (2006)
  • J. Elith et al.

    A statistical explanation of MaxEnt for ecologists

    Diversity Distributions

    (2011)
  • L. Erikstad et al.

    Impact of scale and quality of digital terrain models on predictability of seabed terrain types

    Marine Geodesy

    (2013)
  • J. Franklin

    Mapping species distributions: spatial inference and prediction

    (2009)
  • M. Gogol-Prokurat

    Predicting habitat suitability for rare plants at local spatial scales using a species distribution model

    Ecol. Appl.

    (2011)
  • A. Guisan et al.

    Predicting species distribution: offering more than simple habitat models

    Ecol. Lett.

    (2005)
  • R. Halvorsen

    A gradient analytic perspective on distribution modelling

    Sommerfeltia

    (2012)
  • R. Halvorsen

    A strict maximum likelihood explanation of MaxEnt, and some implications for distribution modelling

    Sommerfeltia

    (2013)
  • R. Halvorsen et al.

    Opportunities for improved distribution modelling practice via a strict maximum likelihood interpretation of MaxEnt

    Ecography

    (2015)
  • J.A. Hanley et al.

    The meaning and use of the area under a Receiver Operating Characteristic (ROC) curve

    Radiology

    (1982)
  • T.J. Hastie et al.

    The Elements of Statistical Learning

    (2009)
  • P.A. Hernandez et al.

    The effect of sample size and species characteristics on performance of different species distribution modelling methods

    Ecography

    (2006)
  • R.J. Hijmans et al.

    dismo: Species Distribution Modelling package Version 1. 0-12

    (2015)
  • G. Hörnberg et al.

    Boreal swamp forests

    Bioscience

    (1998)
  • Cited by (66)

    • Mapping of suitable habitats for earthworms in China

      2023, Soil Biology and Biochemistry
    • Future simulated landscape predicts habitat loss for the Golden Langur (Trachypithecus geei): A range level analysis for an endangered primate

      2022, Science of the Total Environment
      Citation Excerpt :

      To estimate the probable covariate influence on the occurrences of golden langur within the study area, the jackknife test of the developed regularized training gain built inside the MaxEnt program was used (Phillips and Dudık, 2008). We used the area under the curve statistics (AUC) of the receiver operating characteristic (ROC) curves and the TSS (true skill statistics) in the present study to evaluate the best model with the most discriminative power for the present study (Halvorsen et al., 2016). The AUC test statistics value ranges from 0 to 1 where 0.5 < is considered discrimination between the predictive occurrence areas and absent areas are worse than random; 0.5 is considered as no better than the random prediction by the model; 0.7–0.8 is considered as an acceptable model; 0.8–0.9 is considered to be excellent and < 0.9 is considered exceptional (Johnson et al., 2016; Kamilar and Tecot, 2016).

    View all citing articles on Scopus
    1

    Present address: Vestmarkavegen 1672, NO-2233 Vestmarka, Norway.

    View full text