Update: The paper is now available (free) from Methods in Ecology and Evolution.
Update 2: In the tutorial on how to use this method, we refer to the file “fit.JSDM.r”. This is the R script on the journal website named, somewhat opaquely, “mee312180-sup-0004-Rcode.R”. So save that R script, and re-name it before running the tutorial.
Species might tend to occur together, or they might tend to occur apart. Factors driving these patterns can include environmental variables or species interactions. Species distribution models can predict the probability of occurrence of species, but they rarely account for the joint occurrence of multiple species.
I had been working on this idea of modelling co-occurrence with Kirsten Parris using her frog data, and also with Laura Pollock and Peter Vesk using Laura’s eucalypt occurrence data. We were aiming to model co-occurrence within species distribution models. Well, it has taken a few years to figure out (to be precise, it took a few years to find out that someone else had figured it out), but we now have species distribution models that account for the joint occurrence of species.
To help understand what we are trying to do, imagine a simple case with two species each having a 50% chance of being at a site. If the species occur independently of each other, then the probability they are both present would be 25% (50% × 50%), and the probability that both are absent would also be 25%. The remaining 50% is made up of one species being present and the other absent or vice versa.
However, if the species only occurred together, then the probability they are both present would be 50%, as would the probability they are both absent. One species would never occur without the other.
In the other extreme (of “exclusion”), the probability the first species is present and the other absent is 50%, while the probability the first is absent and the other present is also 50%; the species would never occur together and never both be absent from a site.
Intermediate levels of correlation lie between these extremes of association. Some degree of association can be explained by relationships with shared explanatory variables (e.g., two species of tree might tend to be found on rocky hilltops), but residual correlation might exist. Species might exclude each other competitively, or by other processes. Alternatively, other factors that have not been considered might make the species more positively associated with each other than predicted by their response to measured variables. If factors that influence co-occurrence are not included in the model, then residual correlation in the occurrence of species will arise.
On top of these residual correlations, we also need to model spatial variation in the occurrence of species. We don’t want to assume the probability of occurrence of each species is 50% everywhere! It seems to be getting complicated, but a solution exists.
A few years ago, I worked on simulating correlated events – in that case it was a model of spatial correlation in the occurrence of fire (McCarthy and Lindenmayer 1998, 2000). The approach worked by simulating correlated normal variates, and then converting these to Bernoulli variates (zeroes and ones). This was achieved by setting the Bernoulli variate to one (fire occurred at the site) if the normal variate exceeded zero and setting the Bernoulli variate to zero otherwise (there was no fire at the site). The means of the normal distribution reflected the required probability of the event occurring.
I realised that we could use this same framework to model the occurrence of species. The only difference was that we wanted to estimate the correlation structure, while previously I had assumed a particular correlation structure and then simulated it.
Here’s how the method works. But before considering multiple species, it is first necessary to understand how the occurrence of species can be modelled using a latent normal random variate. Rather than modelling the probability of occurrence of a species, and then determining the presence or absence as a draw from a Bernoulli distribution (1 for presence, 0 for absence), we can model the species as being present at a site when a draw from a normal distribution is greater than zero.
Assuming the normal distribution has a standard deviation of 1, the mean of the normal distribution solely controls the probability of occurrence. A small mean translates to a low probability of occurrence, and a large mean translates to a high probability of occurrence. This is simply a representation of probit regression using a latent variable (probit regression is very similar to logistic regression but with a different link function in the generalized linear model).
Considering co-occurrence of n species requires an n-dimensional normal distribution. I’ll illustrate the approach with the simplest case of two species, which then requires a bivariate normal distribution. The probability of occurrence of each species is again controlled by the mean of its underlying normal distribution. If this latent variate is greater than zero, then the species is present, and it is absent otherwise. Residual correlation in the occurrence of the two species is controlled by the correlation coefficient of the bivariate normal.
Thus, the location (mean) of the bivariate normal distribution controls the mean occurrence of each species, while the correlation in the distribution controls the patterns of residual co-occurrence. Regression equations with estimated coefficients are used to model the means for each species (hence influence the probability of occurrence), and the residual correlation is controlled by the correlation matrix, which is also estimated.
However, the difficulty in using this approach lay in estimating the correlation structure. Enter Bayesian methods. MCMC methods allow estimation of multivariate normal distributions. So, that was a logical way of approaching this particular problem. Multivariate normal distributions are defined by a set of means (one mean for each variate), and a variance-covariance matrix.
However, the sticking point was that when converting the normal distribution to a Bernoulli, I needed to use a multivariate distribution with standard deviations equal to one (the variance-covariance matrix needed to be a correlation matrix) – or at least so I thought. I couldn’t figure out how to constrain the matrix to be a correlation matrix and still estimate the parameters for a useful size of problem in a reasonable amount of time.
I muddled away on this problem for a few years, without much success. Eventually, however, I googled “multi-variate probit regression”, after realising that would be an appropriate name for my model. Lo and behold, there it was – Chib and Greenberg (1998) simply re-scaled the regression coefficients rather than re-scaling the variance-covariance matrix. The problem was solved; essentially we needed to re-scale the linear predictor of the probit regression (the mean of the normal distribution) rather than re-scaling the covariances. The approach even has a Wikipedia page (although that needs some work, including a reference to Chib and Greenberg 1998!).
You can see why this rescaling works by comparing the figure on the left to the previous one above. In the figure to the left, the standard deviation and the mean for the latent variable for the species have both been doubled, so the probability that the latent variable exceeds zero is the same in both cases.
Having found a solution, a group of us from QAECO banded together to write a paper. However, we first needed to see if anyone else in ecology had discovered this idea. Well, Bob O’Hara had, referencing Chib and Greenberg. So, I contacted Bob, and it turned out that Nick Golding had also been working on the same model, and had developed an R package. So we all combined forces, driven by Laura Pollock and Reid Tingley with Will Morris and his R programming grunt, to write a paper.
Our paper aims to make the approach of Chib and Greenberg (1998) accessible to ecologists, providing R and BUGS code to run the analyses. And I’m pleased to say that it has just been accepted in Methods in Ecology and Evolution (Pollock et al. in press).
In the meantime, Clarke et al. (in press) published a paper in Ecological Applications that used the method, and we had already found a variant based on logistic regression by Otso Ovaskainen (2010). Plus, Dave Harris is working on the same topic using a different approach. It seems everyone is converging on the same idea. We hope you like it; you can
read the submitted version of the paper here; it will be open access once it is in print read it here (no subscription required).
Chib, S. & Greenberg, E. (1998) Analysis of multivariate probit models. Biometrika, 85, 347-361.
Clark, J.S., Gelfand, A.E., Woodall, C.W. & Zhu, K. (in press) More than the sum of the parts: Forest climate response from Joint Species Distribution Models. Ecological Applications, http://dx.doi.org/10.1890/13-1015.1.
McCarthy, M.A. & Lindenmayer, D.B. (1998) Multi-aged mountain ash forest, wildlife conservation and timber harvesting. Forest Ecology and Management, 104, 43-56.
McCarthy, M.A. & Lindenmayer, D.B. (2000) Spatially-correlated extinction in a metapopulation of Leadbeater’s possum. Biodiversity and Conservation, 9, 47-63.
Ovaskainen, O., Hottola, J. & Siitonen, J. (2010) Modeling species co-occurrence by multivariate logistic regression generates new hypotheses on fungal interactions. Ecology, 91, 2514-2521.
Pollock, L.J., Tingley, R., Morris, W.K., Golding, N., O’Hara, R.B., Parris, K.M., Vesk, P.A., and McCarthy, M.A. (in press) Understanding co-occurrence by modelling species simultaneously with a Joint Species Distribution Model (JSDM). Methods in Ecology and Evolution. http://dx.doi.org/10.1111/2041-210X.12180