The effects of model and data complexity on predictions from species distributions models
Introduction
Understanding why species distribute as they do is a central problem in ecology. Current methods for studying species distributions often involve statistical or numerical models that relate the distributions of species with layers of environmental information. The use of correlative species distributions models (also known as bioclimatic envelope models, habitat suitability models, and ecological niche models; for definitions of these seemingly related terms see Araújo and Peterson, 2012) is currently the most widespread approach due to their versatility, ease of use, and modest data requirements (e.g., Guisan and Zimmermann, 2000, Elith and Leathwick, 2009). Yet, despite widespread use of these models, the debate as to what is the best modelling approach is far from settled (Araújo and Rahbek, 2006), and predictions from alternative models can be markedly different in the context of spatial (e.g., Randin et al., 2006, Duncan et al., 2009, Heikkinen et al., 2012) and temporal transferability (e.g., Thuiller, 2004, Araújo et al., 2005a; Pearson et al., 2006; Zanini et al., 2009). Previous tests of performance of species distributions models have led to the conclusion that more complex models were generally better than simpler ones (e.g., Segurado and Araújo, 2004, Elith et al., 2006). However, model performance is typically inflated when test data that are not independent from data is used to train the models, such as when data are randomly split between training and test sets (Araújo et al., 2005b; but see Madon et al., 2013). In the few cases in which models have been tested for transferability using independent data (from another region or another time), no clear relationship between the perceived complexity of the models and their performance was found (Araújo et al., 2005b, Randin et al., 2006, Dobrowski et al., 2011, Heikkinen et al., 2012, Smith et al., 2013).
Models perceived as ‘simple’ usually have procedures for fitting the data that are easier to grasp, and/or perform fewer and/or simpler operations with the data. In contrast, models perceived as ‘complex’ involve procedure for fitting the data that are more difficult to comprehend while usually performing a significant number of operations in order to produce the desired outcome. It is implicitly assumed that this loose definition of complexity is related to the capacity of different models to produce either ‘simple’ or ‘complex’ response curves (e.g., Elith et al., 2006, Merow et al., 2014). When selecting the best model for a given problem, it is expected that parsimony should lead to selecting models that minimize overall prediction error by finding an optimal balance between the error in fitting the training data (also referred to as parameter estimation error or bias) and the error in generalizing to new datasets (also referred to as approximation error, or variance). Models that are too simple would fit training data poorly (high bias), while overly complex models would generate low bias and high variance as they would capture random error or biases in the data.
The concept of model complexity is central to the endeavour of finding optimal models for predictive purposes. Yet, measuring model complexity is not straightforward. Here, we attempt to formalize one of the key aspects of model complexity (algorithmic or computational complexity, see Section 1.1: computational complexity), and test whether the principle of parsimony can guide identification of the optimal model complexity for predicting species distributions in space and time. In addition, we quantify structural complexity in the response data (i.e., presence and absence of species) and examine how data complexity affects the predictive abilities of models. To overcome problems of data availability, we simulated virtual species across a realistic geographical domain with different sets of environmental predictors, and compare some of the results with empirical presence–absence records within the same geographical domain.
The computational complexity of an algorithm is defined by the amount of computational resources it requires to produce an output (Arora and Barak, 2009). This definition stems from the idea that an algorithm processes an input via a certain number of elementary operations, and these operations consume varying amounts of computing time. The computing time spent by the algorithm is, thus, an approximation to the complexity of the operations performed on the input. Complex algorithms inevitably perform more complex operations on their input than simple ones, thereby requiring more computation time to solve a particular task. Numerical analyses of computational complexity treat algorithms as black boxes, disregarding their internal structure, functional form or any other specificity. Such analyses are, therefore, suitable when the goal is to compare different methodologies on equal footing.
Computational complexity is also referred to as time complexity or algorithmic complexity, and it is commonly expressed by the O notation (read ‘big o’). This notation identifies the time complexity of an algorithm by the highest-order term of its growth rate as a function of input size, suppressing lower-order terms and constants. It is an asymptotic measure of complexity; as input size increases, so does the importance of the dominant term in characterizing computation time. For example, an algorithm may take 3x2 + 2x time units in solving a problem of size x. As x approaches infinity, the higher-order term (x2) will tend to take over computation time, and the lower-order terms, as well as the multiplicative coefficients, will become irrelevant. This particular algorithm is thus said to have a time complexity of O(n2). If the computation time of an algorithm is independent of the dataset size, it is said to be a constant time algorithm, expressed as a time complexity of O(1). As this methodology aims to estimate the asymptotical behaviour of a given algorithm, it also bypasses the issue of comparing algorithms written in different programming languages: the computational cost of a given algorithm implemented in two different languages is assumed to be proportional, up to a multiplicative constant that will become irrelevant asymptotically. The chief assumption of the method is that the algorithms being compared are efficiently programmed, i.e., there are no spurious tasks within the algorithms consuming computation time. A full treatment of computational complexity is out of the scope of this study (but see Arora and Barak, 2009, Papadimitriou, 2003).
Consistent with the principle of parsimony and assuming that algorithmic complexity is a good proxy for overall model complexity, the highest predictive capacity should be expected in models of intermediate algorithmic complexity. It is worth noting that the quantification of algorithmic complexity is independent of the modelling methodology. That is, the framework implemented herein with correlative species distributions models, could also be easily implemented with alternative mechanistic approaches for modelling species distributions (e.g., Fordham et al., 2013, García-Valdés et al., 2015).
Estimating the ecological niche of a species is an instance of the broad class of problems in which a set of points (in environmental space) must be classified into one of two opposing classes (presence/absence) according to some relationship between the dimensions of the space and the class to which each point belongs. The difficulty in estimating the ecological niche of a species can be assessed by evaluating the geometrical structure of the boundary between classes in the training data. Aside from deficiencies and biases in the data collection (Barry and Elith, 2006; Araújo et al., 2009), the internal structure of the data and its relationship with models predictive capacity has never been formally explored (but see Blonder et al., 2014). It has been, though, extensively addressed in other scientific fields; particularly within the machine learning community where the concept of geometrical complexity has been developed. Given a dataset with a two-class categorical response and N predictors, the geometrical complexity is defined as an approximation of the structural characteristics of the N-dimensional boundary separating the response classes (Basu and Ho, 2006). It is a general measure defined by a set of complementary metrics (see Section 2). When analyzed together, these metrics help differentiate datasets with geometrically simple class boundaries from those with complex and/or random class boundaries.
We predict that data complexity is related to the predictive capacity of the models. Specifically, simpler datasets will tend to reflect simpler occurrence–environment relationships thus being easier to model and yielding comparatively better performance than models trained with more complex datasets.
Section snippets
Virtual species generation
We created three different types of virtual species. Their distributions were projected across mainland Spain by defining environmental suitability landscapes based on different sets of environmental covariates resampled within a 1 km × 1 km grid (Table 1 and Appendix A). Following Valladares et al. (2014), we assumed that the suitability of the environment for species followed nonlinear functional forms, and the overall environmental suitability was the product of the partial suitabilities for
Computational complexity
The asymptotical behaviour of all techniques was best approximated by a linear relationship between the dataset size and computation time, but the slope of the linear regression differed several orders of magnitude among the algorithms used to model species distributions (Table 2). The only exception was BIOCLIM that proved insensitive to the size of the dataset and performed equally fast, on average, for any potential number of data points. Among the other methodologies, the Random Forest
Discussion
We asked how computational complexity of species distributions models and the geometrical complexity of the species distributions data affected the performance of species distributions modelling techniques with and without temporal transferability. The starting assumption was that models of intermediate complexity would tend to show increased performance, while simpler data would be easier to model. Consistent with our hypothesis, we found that data complexity was inversely related with model
Conclusions
We evaluated different aspects of complexity related to (1) the computational cost of eight SDM algorithms, and (2) the geometric characteristics of species distributions data. MARS, MaxEnt, BRT and GAM fared equally well as Random Forest with much less computational costs while BIOCLIM performed worse than these five methods but better than GLM and SVM, which were the worst-performing methods with and without temporal transferability. Consistent with previous studies, the capacity of models to
Acknowledgements
We thank Babak Naimi, Francisco Ferri-Yáñez, Manuel Mendoza, Alejandro Rozenfeld and an anonymous reviewer for discussion. This study was funded through the Integrated Program of IC&DT Call No 1/SAESCTN/ALENT-07-0224-FEDER-001755. D.G.-C. acknowledges additional support from the Spanish Ministry of Education (FPU fellowship).
References (56)
Spatial prediction of species distribution: an interface between ecological theory and statistical modelling
Ecol. Model.
(2002)- et al.
Effects of climate, species interactions, and dispersal on decadal colonization and extinction rates of Iberian tree species
Ecol. Model.
(2015) - et al.
Predictive habitat distribution models in ecology
Ecol. Model.
(2000) - et al.
Uses and misuses of bioclimatic envelope modeling
Ecology
(2012) - et al.
How does climate change affect biodiversity?
Science
(2006) - et al.
Reducing uncertainty in projections of extinction risk from climate change
Glob. Ecol. Biogeogr.
(2005) - et al.
Validation of species–climate impact models under climate change
Glob. Change Biol.
(2005) - et al.
Reopening the climate envelope reveals macroscale associations with climate in European birds
Proc. Natl. Acad. Sci.
(2009) - et al.
Computational Complexity: A Modern Approach
(2009) - et al.
Error and uncertainty in habitat models
J. Appl. Ecol.
(2006)
The n-dimensional hypervolume
Glob. Ecol. Biogeogr.
Bioclim: the first species distribution modelling package, its early applications and relevance to most current MaxEnt studies
Divers. Distrib.
Random forests
Integrating bioclimate with population models to improve forecasts of species extinctions under climate change
Biol. Lett.
Presence–absence versus presence-only modelling methods for predicting bird habitat suitability
Ecography
Random forests for classification in ecology
Ecology
Modeling plant ranges over 75 years of climate change in California USA: temporal transferability and species traits
Ecol. Monogr.
Do climate envelope models transfer? A manipulative test using dung beetle introductions
Proc. R. Soc. B: Biol. Sci.
Species distribution models: ecological explanation and prediction across space and time
Annu. Rev. Ecol. Evol. Syst.
Novel methods improve prediction of species’ distributions from occurrence data
Ecography
A working guide to boosted regression trees
J. Anim. Ecol.
A statistical explanation of MaxEnt for ecologists
Divers. Distrib.
Impacts of imperfect reference data on the apparent accuracy of species presence–absence models and their predictions
Glob. Ecol. Biogeogr.
Tools for integrating range change, extinction risk and climate change information into conservation management
Ecography
Multivariate adaptive regression splines
Ann. Stat.
A Kenward-Roger approximation and parametric bootstrap methods for tests in linear mixed models – the R package pbkrtest
J. Stat. Softw.
Generalized Additive Models
Cited by (61)
Filling the data gaps: Transferring models from data-rich to data-poor deep-sea areas to support spatial management
2023, Journal of Environmental ManagementAssessing the habitat suitability of 10 serious weed species in global croplands
2020, Global Ecology and ConservationComparison of two commonly used methods for identifying water quality thresholds in freshwater ecosystems using field and synthetic data
2020, Science of the Total EnvironmentCitation Excerpt :Spurious threshold detection can result from unrealistic statistical assumptions e.g. uniformly distributed samples across environmental gradient (Daily et al., 2012). Similarly, the specific shape and distribution of species' responses can affect model performance (García-Callejas and Araújo, 2016; Santika and Hutchinson, 2009). These features have been used to evaluate statistical models for predicting plant distributions (Austin et al., 2006), estuarine fish species distribution (França and Cabral, 2016; Large et al., 2015) and for ocean management (Hunsicker et al., 2016), but have received less attention for evaluating thresholds, particularly in freshwater systems.
Mapping suitability for rice production in inland valley landscapes in Benin and Togo using environmental niche modeling
2020, Science of the Total EnvironmentCitation Excerpt :Thus, the end-use of model outputs should guide in the selection of thresholds (Jarnevich et al., 2015). Various threshold selection methods have been applied in various ecological niche modeling (Merow et al., 2013; Cordeiro et al., 2016; García-Callejas and Araújo, 2016). Nine threshold methods were implemented in SAHM: threshold = 0.50, sensitivity = specificity, maximizes (sensitivity + specificity)/2, maximizes Cohen's Kappa, maximizes PCC (percent correctly classified), predicted prevalence = observed prevalence, observed prevalence, mean predicted probability and minimizes distance between ROC plot and (0,1).