How important are choice of model selection method and spatial autocorrelation of presence data for distribution modelling by MaxEnt?
Introduction
Over the last 15 years distribution modelling (DM) has become established as a discipline on its own within ecology and biogeography (Guisan and Zimmermann, 2000, Austin, 2007, Franklin, 2009, Peterson et al., 2011). A wide spectre of statistical methods and software implementations have been developed for DM (Guisan and Zimmermann, 2000, Elith et al., 2006, Franklin, 2009), among which maximum entropy (MaxEnt) modelling by the free, user-friendly Maxent.jar software (Phillips et al., 2004, Phillips et al., 2006, Phillips, 2011) is currently among the most popular (Yackulic et al., 2013, Merow et al., 2013).
MaxEnt's popularity rests on consistent ranking among the best methods in comparative studies (Elith et al., 2006, Hernandez et al., 2006, Phillips et al., 2006, Wisz et al., 2008, Mateo et al., 2010, Aguirre-Gutiérrez et al., 2013). Nevertheless, concern has been expressed for MaxEnt's susceptibility to spatial autocorrelation in the response variable, i.e., a sampling procedure giving rise to unequal inclusion probabilities because of patchy distribution of presence observations (Veloz, 2009, Anderson and Gonzalez, 2011, Merckx et al., 2011). How spatial autocorrelation affects the performance of distribution models is, however, still not well known or understood (Dormann et al., 2007, Santika and Hutchinson, 2009, Crase et al., 2012). Furthermore, concern has been expressed for MaxEnt's tendency to overfit models to the data (Raes and ter Steege, 2007, Merckx et al., 2011, Halvorsen, 2013, Halvorsen et al., 2015); a model is commonly regarded as overfitted to the data when a simpler model with better predictive performance exists (Guisan and Thuiller, 2005, Merckx et al., 2011). However, the degree of overfitting is dependent on DM purpose (Jiménez-Valverde et al., 2008, Stokland et al., 2011, Halvorsen, 2012). In this context it is neccessary to distinguish between spatial prediction modelling (SPM – optimising the fit between model predictions and the true distribution of the modelled target's performance) and ecological response modelling (ERM – finding and understanding general patterns in the overall ecological response of the modelled target to supplied explanatory variables). Halvorsen (2012) defined three types of overfitting: Type I, that a more complex model has lower predictive performance on independent data than a simpler model; Type II, that a more complex model does not have significantly better predictive performance on independent data than a simpler model; and Type III, that a more complex model with higher predictive performance on independent data than a simpler model fails to fit realistic overall ecological response curves. Type III is relevant for ERM purposes only while Types I and II apply to models in general.
The complexity of the final model is determined by the choices of model selection procedure, regularisation method, and strictness of the criterion used to compare alternative models (Reineking and Schröder, 2006). MaxEnt, as implemented in Maxent.jar software, was described by Phillips et al. (2006) and Phillips and Dudík (2008) as a machine-learning method with model selection by the shrinkage method referred to as ℓ1-regularisation. While this procedure is often regarded as one of the main reasons for MaxEnt's good performance (e.g., Elith et al., 2006, Phillips and Dudík, 2008, Elith et al., 2011), there are also indications that MaxEnt models created by this default method of Maxent.jar (hereafter referred to as the default, or standard, MaxEnt practice, dMp), tend to be very complex in terms of number of parameters (Anderson and Gonzalez, 2011, Warren and Seifert, 2011, Halvorsen, 2013). Accordingly, such models may be overfit to the data (Raes and ter Steege, 2007, Merckx et al., 2011, Halvorsen et al., 2015) regardless of modelling purpose and type of overfitting addressed (Halvorsen, 2013).
Recent treatises, which demonstrate that MaxEnt can be understood as a Poisson point process (Renner and Warton, 2013) or derived from general principles of maximum likelihood estimation (Halvorsen, 2013, Halvorsen et al., 2015), have opened for an alternative MaxEnt modelling practice (aMp), by which ℓ1-regularisation is replaced by standard model selection tools such as simple forward subset selection of variables (e.g., Hastie et al., 2009). Sequential addition of variables may offer the user full control over model complexity (Halvorsen et al., 2015) and provide a key to avoidance of overfitting. Comprehensive comparisons between dMp and aMp approaches, and tuning of settings for the aMp, have, however, still not been performed.
The main aims of this study are to test the hypotheses (1) that the default MaxEnt procedure (dMp) does not result in overfitted models, and (2) that dMp and aMp models do not differ in the degree of overfitting. These aims were accomplished by model evaluation using a dataset of true presence or absence (P/A) data for the modelled target, collected independently of the data used to train the model, as strongly recommended by, e.g., Austin (2007), Edvardsen et al. (2011), Peterson et al. (2011) and Halvorsen (2012). By access to independent test data, influence of the sampling design, e.g., a bias towards easily accessible sites (Pearson et al., 2007, Wollan et al., 2008), on evaluation results was avoided. Furthermore, two additional hypotheses were tested by use of the independent test data: (3) that the ranking of models based upon performance statistics calculated from training and independent test data sets do not differ; and (4) that spatial autocorrelation in the response variable does not influence the predictive performance of MaxEnt models.
We accomplished these aims by using data for boreal swamp forests in SE Norway. Swamp forests occur under favourable hydrological and topographical conditions, such as flat depressions in the boreal forest landscape with ample water supplies (Økland et al., 2001, Rydin and Jeglum, 2013). Nutrient-rich swamp forests have considerable conservation interest (Ohlson et al., 1997, Hörnberg et al., 1998).
Section snippets
Study area
Data for this study were collected in the Østmarka Nature Reserve (59°50′ N, 11°02′ E, 190–368 m.a.s.l.), located in the southern boreal zone of SE Norway (Moen, 1999); see Økland et al. (2001) for detailed description. The topography of the 8 km2 study area is dominated by main ridges and valleys in the N-S direction, which are dissected by minor valleys resulting in a broken topography with structures on several scales. The study area was tessellated into 5 m × 5 m grid cells to address the
Correlation structure of environmental explanatory variables
The 11 EVs made up three groups of strongly correlated variables, with 1, 7 and 3 variables, respectively (Fig. 2): (1) Slope, which had no correlation coefficient |τ| > 0.2 with any other variable. (2) Topographic position, consisting of the three TPI indices, the three curvature indices and Flow accumulation. TPI3, which was strongly correlated (|τ| > 0.4) with all other variables of the group, was the ‘core variable’ of this group. (3) Terrain ruggedness, consisting of the three VRM indices.
Dependence of test AUC on MaxEnt options and settings, and on choice of response variable
Test
Overfitting of MaxEnt models
We find consistent differences in the number of environmental (EVs) and derived variables (DVs) in MaxEnt models that differ only with respect to strictness of the criterion for comparison between nested models [lambda (λ) and alpha (α) values, respectively, in dMp and aMp models]. Most notably, dMp models obtained by weak regularisation (low λ value) included numerous DVs while at the same time had the lowest test-AUC values. This substantiates results of previous studies which indicate that
Acknowledgements
We thank Vegar Bakkestuen, Lars Erikstad, Vegard Lien and Hans Ole Ørka for assistance with ArcGIS, R and ALS; Anders Bryn and Lars Østbye Hemsing with MaxEnt-related questions; and Ole Martin Bollandsås and Christian Bianchi Strømme for help with field work.
References (80)
- et al.
Species-specific tuning increases robustness to sampling bias in models of species distributions: an implementation with Maxent
Ecol. Model.
(2011) Species distribution models and ecological theory: a critical assessment and some possible new approaches
Ecol. Model.
(2007)- et al.
Evaluation of statistical models used for predicting plant species distributions: role of artificial data and theory
Ecol. Model.
(2006) - et al.
Predictive habitat distribution models in ecology
Ecol. Model.
(2000) - et al.
MIAT: Modular R-wrappers for flexible implementation of MaxEnt distribution modelling
Ecol. Informatics
(2015) - et al.
Null models reveal preferential sampling, spatial autocorrelation and overfitting in habitat suitability modelling
Ecol. Model.
(2011) - et al.
Habitat qualities versus long-term continuity as determinants of biodiversity in boreal old-growth swamp forests
Biol. Conserv.
(1997) - et al.
Continuum theory revisited: what shape are species responses along ecological gradients?
Ecol. Model.
(2002) - et al.
Evaluating the predictive performance of habitat models developed using logistic regression
Ecol. Model.
(2000) - et al.
Maximum entropy modelling of species geographic distributions
Ecol. Model.
(2006)
Constrain to perform: regularization of habitat models
Ecol. Model.
The effect of species response form on species distribution model prediction and inference
Ecol. Model.
Disturbance history of a swamp forest refuge in northern Sweden
Biol. Conserv.
Species distribution modelling – effect of design and sample size of pseudo-absence observations
Ecol. Model.
Fit-for-purpose: species distribution model performance depends on evaluation criteria - Dutch hoverflies as a case study
PLoS One
Vegetation and environment: discontinuities and continuities
Selecting from correlated climate variables: a major source of uncertainty for predicting species distributions under climate change
Ecography
Spatial aspects of ecological data
Weak climatic associations among British plant distributions
Global Ecol. Biogeogr.
A new method for dealing with residual spatial autocorrelation in species distribution models
Ecography
Modelling presence of swamp forest and forest dwelling birds in a boreal forest reserve using airborne laser scanning
Modelling species’ distributions
Collinearity: a review of methods to deal with it and a simulation study evaluating their performance
Ecography
Methods to account for spatial autocorrelation in the analysis of species distributional data: a review
Ecography
A fine-grained spatial prediction model for the red-listed vascular plant Scorzonera humilis
Nordic J. Bot.
Novel methods improve prediction of species’ distributions from occurrence data
Ecography
A statistical explanation of MaxEnt for ecologists
Diversity Distributions
Impact of scale and quality of digital terrain models on predictability of seabed terrain types
Marine Geodesy
Mapping species distributions: spatial inference and prediction
Predicting habitat suitability for rare plants at local spatial scales using a species distribution model
Ecol. Appl.
Predicting species distribution: offering more than simple habitat models
Ecol. Lett.
A gradient analytic perspective on distribution modelling
Sommerfeltia
A strict maximum likelihood explanation of MaxEnt, and some implications for distribution modelling
Sommerfeltia
Opportunities for improved distribution modelling practice via a strict maximum likelihood interpretation of MaxEnt
Ecography
The meaning and use of the area under a Receiver Operating Characteristic (ROC) curve
Radiology
The Elements of Statistical Learning
The effect of sample size and species characteristics on performance of different species distribution modelling methods
Ecography
dismo: Species Distribution Modelling package Version 1. 0-12
Boreal swamp forests
Bioscience
Cited by (66)
Predicting the geographical distribution of the Persian leopard, Panthera pardus tulliana, a rare and endangered species
2023, Journal for Nature ConservationMapping cropland suitability in China using optimized MaxEnt model
2023, Field Crops ResearchMapping of suitable habitats for earthworms in China
2023, Soil Biology and BiochemistryWill citrus geographical indications face different climate change challenges in China?
2022, Journal of Cleaner ProductionFuture simulated landscape predicts habitat loss for the Golden Langur (Trachypithecus geei): A range level analysis for an endangered primate
2022, Science of the Total EnvironmentCitation Excerpt :To estimate the probable covariate influence on the occurrences of golden langur within the study area, the jackknife test of the developed regularized training gain built inside the MaxEnt program was used (Phillips and Dudık, 2008). We used the area under the curve statistics (AUC) of the receiver operating characteristic (ROC) curves and the TSS (true skill statistics) in the present study to evaluate the best model with the most discriminative power for the present study (Halvorsen et al., 2016). The AUC test statistics value ranges from 0 to 1 where 0.5 < is considered discrimination between the predictive occurrence areas and absent areas are worse than random; 0.5 is considered as no better than the random prediction by the model; 0.7–0.8 is considered as an acceptable model; 0.8–0.9 is considered to be excellent and < 0.9 is considered exceptional (Johnson et al., 2016; Kamilar and Tecot, 2016).
- 1
Present address: Vestmarkavegen 1672, NO-2233 Vestmarka, Norway.