With the recent online publication of a new paper, here’s a blog post about how the research arose – a fun confluence of mathematical and cognitive collaboration across two sides of the world. And some of it was achieved while I was sleeping…
I’ve been interested in imperfect detection during ecological surveys for a while now. It began in earnest when I worked with Brendan Wintle on the topic during his PhD. He was considering the question of required survey effort, and was building occupancy-detection models around the same time that Darryl MacKenzie, Drew Tyre, and Howard Stauffer were also working on this same model. Prior to that, my interest had been pique by Kirsten Parris’ work on imperfect detection of frogs.
Since then, one branch of my research has considered the question of maximizing detections of species given a search budget. A key paper here is my work with Cindy Hauser on optimizing detections of species across landscapes. It is central to my new research, so if you want some background, see here. One of the key features of that paper is that the probability of failed detection at a site is modelled by the function:
q = exp(-bx),
with search effort x, and detection rate b.
Now, the model Cindy and I developed has two clear limitations. First, it ignores the cost of travelling to sites. Secondly, it assumes that the detection rate of species when present at a site is known precisely, and doesn’t vary from visit to visit (although it can vary from site to site).
However, we know that these assumptions are not likely to be met. Travel costs can be substantial in at least some surveys – if driving to a site that is far away, we might spend as much time in the car as actually searching for a species in the field. And the detection rate of species can vary from visit to visit at a site depending on the activity level of the animal, the flowering intensity of the plant, etc.
The combination of travel costs and variation in detection raises an interesting trade-off. If we want to maximize the chance of detecting a species at least once when it is present at a site, then we would want to visit the site once when the detection rate is highest if the detection rate varies. But if the variation in the detection rate is unpredictable (i.e., the rate is stochastic), then we would want to visit the site as many times as possible to increase the chance of visiting the site at least once when the detection rate was high.
However, travel costs impose a competing constraint; multiple visits to a site will impose a travel costs for each visit. Travel costs will eat into our budget of time for actually surveying the site, and we will want to visit a site as few times as possible, and spend more time at the site during each visit.
So, Alana Moore and I set out to investigate this issue. We first tackled question: “If we had a budget of time to spend surveying one site, how should we split that time over multiple visits?” We assumed that detection rates varied according to a log-normal distribution (with a particular mean and standard deviation), and then found the number of visits that optimized the expected probability of detecting the species at least once at the site.
The solution to this optimization was not particularly simple. We needed to find the number of visits (n) that minimized this lovely integral:
While it is not possible to find the optimal value of n analytically, one can find an analytical approximation. This is achieved by approximating the integral using Laplace’s method, and then optimizing that approximation. (As an aside, I like the idea of 18th century mathematics coming to play with 21st century ecology.) We found that the coefficient of variation in the detection rate (σ/μ), the search budget scaled by the mean detection rate (x/μ), and the scaled travel cost (c/μ) drove the optimal number of visits, with the coefficient of variation (σ/μ) and the ratio of the budget to the travel cost (x/c) being most influential. You can get the details of the model here (free – open access). We even tested how well our optimization worked using data from a field experiment – it seemed to work quite well.
Now, I really liked that paper. In fact, the idea that we can find a simple solution that approximately optimizes a rather complicated integral is appealing. However, it doesn’t get us on the path to optimize effort over space and time; it is only restricted to finding the optimal solution when surveying one site with a given budget. For the multiple site case, we need to determine the optimal budget for each site, and determine for which sites that optimal budget is zero – the sites we shouldn’t bother visiting. The solution above results in equations that seem much too complicated for that task if we want to find analytical solutions.
However, a much simpler equation arises when the detection rate follows a gamma distribution, another distribution that is commonly used to model variables that are constrained to be positive, and which is somewhat substitutable for a log-normal.
At this point, Alana Moore and I started working on the following solution in unison. I was in Melbourne, and as I worked on it, Alana was sleeping in Toulouse. I emailed Alana about my progress before bed, and then Alana tackled the problem as I slept. It was incredibly efficient, and we cracked the solution in a matter of days. Here’s how it works…
Now, if detection rates follow a gamma distribution, that rather complicated integral that we see above is replaced by an integral that can be simplified to a hyperbolic function.
E[q] = (1+θ(x/n–c))−nk.
Here, θ and k are parameters of the gamma distribution that depend on the mean and standard deviation of the detection rate (θ = σ2/μ; and k = σ2/μ2), while the search budget (x), travel cost (c) and number of visits (n) are the same as defined previously.
With that expression, we can derive the number of visits that minimizes the probability of failed detection, given a particular search budget x. I won’t give that function here – it isn’t really needed, but you can see it in the paper.
But if we substitute that optimal number of visits back into the hyperbolic function, we can derive the smallest possible probability of failed detection. That is equal to:
E[q]min = exp(-hμx),
where h is a scaling factor. This is interesting in two ways. Firstly, the probability of failed detection, when optimized for the number of surveys, essentially has the same functional form as the model in Hauser and McCarthy (2009).
Secondly, the functional form of the scaling factor h is also interesting. It is simply a declining function of the product of c and the variance:mean ratio of the detection rate (θ).
Therefore, we can see that the combination of travel costs and variation in detection rate acts to reduce the average rate of detection (μ) and that this influence is simply a function of the product of two terms (c and θ).
This outcome was particularly exciting – look at the equation again for the expected number of failed detections:
E[q]min = exp(-hμx).
Alana noticed this functional form first – it is exciting because this is the same functional form as the model in Hauser and McCarthy (2009). Given that, it seemed that we might simply plug the modified function into the optimization machinery of Hauser and McCarthy (2009), and we could optimize searches over both space and time.
At this point, I felt like this:
However, we had a couple of snags to overcome first. The first one – a relatively minor issue – was that some solutions for the optimal number of searches led to results that were less than one; these results are untenable, so we needed to add a constraint that the smallest number of searches was zero or one.
The next issue was more important. The optimization of Hauser and McCarthy (2009) relies on a simple ranking scheme to choose which sites were surveyed. This ranking is possible, in part, because any search effort to a site is spent immediately on searching, and not travelling; the optimal solution is either to not search the site or to spend some positive amount of effort on searching. However, in the case of travel costs, the initial part of search effort allocated to a site will be spent on travelling to a site, so we shouldn’t bother searching a site unless the search effort is greater than the travel cost.
This consideration introduces a discontinuity into the solution – we either don’t search a site (x*=0), or we spend effort such that we allocate more than the travel cost on the site (x*>c). That discontinuity means that we cannot use the ranking approach of Hauser and McCarthy (2009) to determine which sites to survey.
Determining which sites to survey, in the absence of a simplifying algorithm, is a somewhat daunting task. Consider the case where we need to decide which of 300 sites to survey. In that case, there are 2300 different combinations of sites that can be surveyed.
Now 2 to the power of 300 is a very large number. In fact, it is approximately 2× 1090. To give you an idea of how large 2× 1090 is, we can compare it to the number of protons in the observable universe, which is approximately 1080 (apparently).
That is, searching through the combination of possible search strategies is approximately 20 billion times more involved than counting every proton in the observable universe. That is more difficult than I would have hoped.
At this point, I was feeling more like this:
Surely there had to be a better way. It seemed unlikely that this problem hadn’t been tackled before. Indeed, it turns out that the original solution that Cindy and I obtained is mathematically identical to the problem of searching for submarines, something that was solved by Bernard Koopman during World War II, and declassified in the 1950s. Surely the issue of fixed costs had been added already.
So, we started searching the Operations Research literature. The key was finding this paper by Vaibhav Srivastava and Francesco Bullo. I’ll admit that I didn’t understand it on first reading. In fact, I thought I didn’t understand it at all. But after a restless night in which I mulled over the problem in semi-sleep, I developed an algorithm that is essentially the same as in the Srivastava and Bullo paper. Brains sometimes work in strange ways (or at least mine does sometimes) – I must have gleaned enough from the Srivastava and Bullo paper on first reading to essentially replicate the idea, but I still don’t quite know how (the cases are a little differ, because we have a functional form that is not quite the same shape as imagined by Srivastava and Bullo).
The algorithm works by realising that the optimal solution occurs such that the marginal benefit of searching is the same across all sites that are searched. If we chose a particular marginal benefit, then we can select sites on the basis of the search efficiency; search efficiency can be measured by the reduction in the probability of failed detection divided by the total search effort for that site. Using a knapsack optimization approach, we rank sites according to this criterion, and then find the set of sites that fills the search budget (fills the knapsack). This ranking of sites will not find the optimal solution perfectly (because the items in the knapsack are discrete), but it works well.
Then, it is simply a matter of searching over the range of possible marginal benefits to find the optimal solution – we find the marginal benefit that leads to the largest reduction in missed detections when summed across all sites.This reduces the mind-bogglingly-large m-dimensional optimization problem (m being the number of sites that might be surveyed) to a one-dimensional problem – much more tractable.
In our new paper, we compared the solution from this algorithm to the true optimal case for a wide range of different parameter values, and show that the algorithm works well – it finds solutions for which the outcomes are very close to optimal.
So, that means we can determine the optimal allocation of search effort in space and time that finds the most occurrences of species in a landscape where sites vary in:
the probabilities of presence (or abundance) of the species;
the mean and standard deviation of the rate of detection; and
the cost of travel to the sites.
We are currently working on a few more tweaks where we relax another assumption or two of the Hauser and McCarthy model. But that is for further research…