Election fever hits again

In the 2013 election, I took some interest in the election result in Indi, a seat located in the north-east of Victoria. My interest was spurred by the chance that Sophie Mirabella, who was flagged to be the next Science Minister if the Liberal-National coalition won government, might be usurped by Cathy McGowan, an independent candidate. Also, I have some relatives in that part of the world, so I was interested to know who would be their local representative in what turned out to be a very close election.

I enjoyed trying to predict the outcome of the election in Indi, as counting continued over a matter of days. You can see an account of my efforts here. (As an aside, this is the most-read post on my blog – I have an alternative career option should I give up ecology!).

In predicting the winner of the election, there are two main unknowns that need to be determined – how the preferences are flowing to the two leading candidates, and  whether the swing in votes is sufficient to usurp the sitting member.

Australia uses a preferential voting system. Voters select their preferred candidate in the seat for the House of Representatives, then their second preference, third preference, etc, until the voter has indicated their preferences for all candidates in the seat.

The initial counting of votes tallies these first preferences for each candidate. Then, the ballot papers of the candidate with the fewest votes are distributed to the other candidates based on the second preferences on those ballot papers. So if we had five candidates initially, the possible winners are narrowed down to four, and the ballot papers of the fifth candidate are then allocated to the remaining four candidates based on the second preferences.

Then, the ballot papers of the candidate with the fewest votes are distributed among the other three. This process continues until we have only two candidates remaining, at which point we have the two-candidate-preferred vote. After this point, the candidate with the most votes wins.

In trying to predict the winner of an election, a key part is predicting how the preferences will flow to the two leading candidates. The Australian Electoral Commission provides updates on first preference counts initially, and then two-candidate-preferred counts as they are completed. Because the two-candidate-preferred counts lag behind the first preference counts, it would be useful  to predict preference flows. If preferences have been counted for a sample of booths, it is possible to model the flow of preferences – here is one way to do that.

Let’s look at the first preference counts and two-candidate-preferred counts for a few booths in the seat of Indi from the 2013 election:

Candidate Alex-
Badda-ginnie Barra-
Barna-wartha Beech-
R. Dudley (Rise Up Aust) 13 6 6 8 12
C. McGowan (Independent) 216 39 381 248 785
R. Leeworthy (Family First) 30 5 19 11 11
S. Mirabella (Liberal) 715 52 366 230 420
H Aschenbrenner (Sex Party) 23 3 11 5 12
W. Hayes (Bullet Train) 6 0 5 7 1
R. Walsh (ALP) 251 7 76 52 145
J. O’Connor (The Greens) 63 2 23 8 105
P. Rourke (Katter) 2 2 3 2 8
R. Murphy (PUP) 54 4 31 14 16
J. Podesta 6 0 8 5 13
2CP McGowan (Independent) 567 54 519 339 1,069
2CP Mirabella (Liberal) 812 66 410 251 459
Total preference flows 448 29 182 112 323
Fraction to McGowan 0.783482 0.517241 0.758242 0.8125 0.879257

We can see that in the Alexandra booth, Cathy McGowan only won 216 first preference votes, compared to Sophie Mirabella’s 715. But the 448 votes of remaining candidates flowed distinctly towards McGowan – on more than 78% of those ballot papers, McGowan was preferenced ahead of Mirabella, so she collected those preferences.

The flow of preferences was even stronger in Beechworth, where McGowan won almost 88% of the distributed preferences, but she got less than 52% of the preferences in Baddaginnie. You might notice a big difference between Beechworth and Baddaginnie in the first preferences. For example, the Greens won almost 7% of first preferences in Beechworth but less than 2% of first preferences in Baddaginnie.

We can model this flow of preferences as a function of the first preferences to predict the two-candidate-preferred vote from first preferences. Here, we are essentially aiming to predict the fraction of votes that flow from the first preferences of the other candidates to the two leading candidates.

We can build this model using linear regression, but we would like to constrain the model coefficients such that they are between zero and one; the coefficients estimate the proportion of voters whose preferenced one of the 2CP candidates ahead of the other.

If we take the data from all of Indi’s 103 booths (and also the postal, early, provisional, and absentee votes), then our model results look like this:

Observed preference flows to Cathy McGowan versus fitted preference flows based on the 2013 federal election results.

Observed preference flows to Cathy McGowan versus fitted preference flows for each of the booths (and the non-ordinary votes) based on the 2013 federal election results.

Let’s look at the model coefficients:

0.623 : DUDLEY, Robert ( Rise Up Australia Party ), 985 votes
0.662 : LEEWORTHY, Rick ( Family First Party ), 1330 votes
0.383 : ASCHENBRENNER, Helma ( Sex Party ), 1402 votes
0.000 : HAYES, William ( Bullet Train For Australia ), 489 votes
0.992 :  WALSH, Robyn ( Australian Labor Party ), 10375 votes
0.700 : O’CONNOR, Jenny ( The Greens ), 3041
0.008 : ROURKE, Phil ( Katter’s Australian Party ), 615 votes
0.680 : MURPHY, Robert Denis ( Palmer United Party ), 2417 votes
1.000 : PODESTA, Jennifer ( Independent ), 841 votes

These coefficients estimate that McGowan was preferenced behind Mirabella by almost all voters who put Hayes first (the estimated coefficient is zero), but she was placed ahead of Mirabella by almost everyone who put Walsh first (the estimated coefficient is 0.992).

The graph shows that this pattern of preference flows as a function of first preferences is very consistent, at least in Indi. In some other electorates, it is not so consistent. Here are the model results for the seat of Batman, which will be hotly contested in 2016 between the Greens and the ALP:


Observed preference flows to Alexandra Bhathal (a Greens candidate) versus fitted preference flows for each of the booths in the seat of Batman (and the non-ordinary votes) based on the 2013 federal election results.

The model for Batman doesn’t work quite as well, largely because Bhathal received a greater flow of preferences from the non-ordinary votes (orange symbols in the figure) than from the ordinary votes. These non-ordinary votes are the postal votes (Bhathal received almost 1700 of the flowing preferences), absent votes (Bhathal received over 1000 of the flowing preferences), early votes (Bhathal received just under 1000 of the flowing preferences), and provisional votes (there were very few of these).

Interestingly, a similar pattern occurred in Wills, which is another inner Melbourne seat with a Greens candidate – it seems the Greens garnered strong preference flows from the non-ordinary votes in 2013. Whether that will be borne out in 2016 remains to be seen, but strong preference flows will be needed by the Greens if they are to prevail in Batman.

If you’d like to look at preference flows for yourself for different seats in the 2013 election, then you are welcome to use my R code that I wrote – it scrapes the data from the AEC website, runs the model and prints out the result.

The code is best run using the source command in R so that you are prompted to select the seat from the list of lower house seats (or you can just specify the seat number directly from within the code). And please excuse my R coding – I know it is clumsy in places, I am learning, and am yet to figure out R’s data structures properly to do vectorized operations (among other things I don’t understand).

Also, I haven’t checked that this works on all seats – there might be some anomalies that I haven’t accounted for.


seatsite="http://results.aec.gov.au/17496/Website/HouseDivisionMenu-17496-NAT.htm" # seats listed here
# seatsite="http://results.aec.gov.au/15508/Website/HouseDivisionMenu-15508-NAT.htm" # 2010 seat

seat.table = readHTMLTable(seatsite, header=F, which=1, stringsAsFactors=F, as.data.frame=TRUE)  # extract the seats

# arranged in 3 columns, with various white spaces, so trim
x <- gsub("\t", "", seat.table[6,]$V1, fixed = TRUE)
x <- gsub("\r", "", x, fixed = TRUE)
x <- gsub("\n\n\n\n\n", "\n", x, fixed = TRUE)
V1 <- strsplit(x, "\n")

x <- gsub("\t", "", seat.table[6,]$V2, fixed = TRUE)
x <- gsub("\r", "", x, fixed = TRUE)
x <- gsub("\n\n\n\n\n", "\n", x, fixed = TRUE)
V2 <- strsplit(x, "\n")

x <- gsub("\t", "", seat.table[6,]$V3, fixed = TRUE)
x <- gsub("\r", "", x, fixed = TRUE)
x <- gsub("\n\n\n\n\n", "\n", x, fixed = TRUE)
V3 <- strsplit(x, "\n")

seat.names <- c(V1[[1]], V2[[1]], V3[[1]])  # combine the three columns into one

seatlinks <- getHTMLLinks(seatsite, relative=FALSE)  # get the links to the sites with seat specific data
prefix <- "http://results.aec.gov.au/17496/Website/"  
# prefix <- "http://results.aec.gov.au/15508/Website/" #2010 results

seatlinks <- paste(prefix, seatlinks[12:161], sep="")  # paste on the rest of the website
seatlinks <- gsub("FirstPrefs", "PollingPlaces", seatlinks)  # we will need polling booth data, so change names a little

nseats <- length(seat.names)

print(seat.names) # print out the seat names, and prompt user to select one

prompt <- paste("Enter number of seat (between 1 and ", nseats, "): ", sep="")
chosen <- as.numeric(readline(prompt))

cat("Selected seat is ", seat.names[chosen], "\n")

places = seatlinks[chosen]  # this is the link for teh chosen seat

places.table = readHTMLTable(places, header=F, which=1, stringsAsFactors=F, as.data.frame=TRUE, skip.rows=c(1,2,3,4,5,6))

places.names <- places.table$V1  # gets the list of booths

placeslinks <- getHTMLLinks(places, relative=FALSE) # get links to booth data
placeslinks <- placeslinks[grep("HousePollingPlaceFirstPrefs", placeslinks)]  # trim off redundant info

nplaces <- length(placeslinks)
places.names <- places.names[1:nplaces]

placeslinks <- paste(prefix, placeslinks, sep="")  # paste on the prefix to booth links

skippedrows <- 1:8  # Need 9 for 2010, header=F; 8 for 2013, header=T

# get info for first booth
firstpref.table = readHTMLTable(placeslinks[1], header=T, which=1, stringsAsFactors=F, skip.rows=skippedrows)

# find number of candidates
ncandidates <- pmatch("......", firstpref.table$V1)-1
  ncandidates <- pmatch("FORMAL", firstpref.table$V1)-1

# get candidate names and parties
candidate.names <- firstpref.table$V1[1:ncandidates]
candidate.parties <- firstpref.table$V2[1:ncandidates]

# get two candidate preferred names
twopp.names <- firstpref.table$V1[(nrow(firstpref.table)-2):(nrow(firstpref.table)-1)]

# get arrays ready for data scraping
firstpref <- array(-999, dim=c(nplaces+4, ncandidates))
twopp <- array(-999, dim=c(nplaces+4, 2))
places.total <- array(dim=nplaces+4)

for(i in 1:nplaces)  # for each booth
  firstpref.table = readHTMLTable(placeslinks[i], header=T, which=1, stringsAsFactors=F, skip.rows=skippedrows)
  for(j in 1:ncandidates)  # get first preference count for each candidate
    firstpref[i, j] <- as.numeric(gsub(",","", firstpref.table[j, 3]))  # gsub removes commas from string
  places.total[i] <- sum(firstpref[i, 1:ncandidates])  # get total first prefs for each booth

  twopp[i, 1] <- as.numeric(gsub(",","", firstpref.table[(nrow(firstpref.table)-2), 3]))  # get 2CP data
  twopp[i, 2] <- as.numeric(gsub(",","", firstpref.table[(nrow(firstpref.table)-1), 3]))

# Now get non-ordinary votes
othervotesite <- gsub("HouseDivisionPollingPlaces", "HouseDivisionFirstPrefsByVoteType", places)
othervotes.table = readHTMLTable(othervotesite, header=T, which=1, stringsAsFactors=F, as.data.frame=TRUE, skip.rows=c(1,2,3,4,5,6))
absent <- as.numeric(gsub(",","", othervotes.table$V5[1:ncandidates]))
provisional <- as.numeric(gsub(",","", othervotes.table$V7[1:ncandidates]))
early <- as.numeric(gsub(",","", othervotes.table$V9[1:ncandidates]))
postal <- as.numeric(gsub(",","", othervotes.table$V11[1:ncandidates]))

firstpref[nplaces+1, ] <- absent
firstpref[nplaces+2, ] <- provisional 
firstpref[nplaces+3, ] <- early 
firstpref[nplaces+4, ] <- postal 

twopp[nplaces+1, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 5]))  # absent
twopp[nplaces+2, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 7]))  # provisional 
twopp[nplaces+3, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 9]))  # early 
twopp[nplaces+4, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 11])) # postal 

twopp[nplaces+1, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 5]))
twopp[nplaces+2, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 7]))
twopp[nplaces+3, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 9])) 
twopp[nplaces+4, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 11])) 

totalfirstprefs <- array(-999, dim=ncandidates)
for(j in 1:ncandidates)
  totalfirstprefs[j] <- sum(firstpref[, j]) # count total first prefs for each candidate across all booths

twopp.id <- pmatch(twopp.names, candidate.names)  # get id's of 2CP people
twopp.parties <- candidate.parties[twopp.id] # and their parties

nflowed <- twopp[,1] - firstpref[, twopp.id[1]] # number of preferences flowing to 2pp candidate number 1

otherfirst <- firstpref[, -twopp.id]  # get first pref votes for candidates other than 2pp candidates (we know where they go)
othernames <- candidate.names[-twopp.id]
otherparties <- candidate.parties[-twopp.id]
othertotalfirstprefs <- totalfirstprefs[-twopp.id]

# set up model specification for flow of preferences 
starter <- structure(rep(0.5,(ncandidates-2)), names=letters[1:(ncandidates-2)])
lowers <- structure(rep(0,(ncandidates-2)), names=letters[1:(ncandidates-2)])
uppers <- structure(rep(1,(ncandidates-2)), names=letters[1:(ncandidates-2)])

formula <- "nflowed ~ a*otherfirst[, 1]"

for(i in 2:(ncandidates-2))
  formula <- paste(formula, " + ", letters[i], "*otherfirst[, ", i, "]", sep="")

model <- nls(formula, algorithm="port", start=starter, lower=lowers, upper=uppers)
modelsum <- summary(model)

cat("Estimated flow to: ", twopp.names[1], "(", twopp.parties[1], ")\n")
for(i in 1:(ncandidates-2))
  cat(modelsum$parameters[i,1], " of ", othertotalfirstprefs[i], "votes from ", othernames[i], "(", otherparties[i], ")\n")

flows <- modelsum$parameters[,1] * othertotalfirstprefs
totflow <- sum(flows)
totflow2 <- sum(othertotalfirstprefs) - totflow

cat("\nEstimated votes to: ", twopp.names[1], "(", twopp.parties[1], "): ", totalfirstprefs[twopp.id[1]]+totflow, "\n")
cat("\nEstimated votes to: ", twopp.names[2], "(", twopp.parties[2], "): ", totalfirstprefs[twopp.id[2]]+totflow2, "\n")

RSS <- sum(residuals(model)^2)  # residual sum of squares
TSS <- sum((nflowed - mean(nflowed ))^2)  # total sum of squares
RSq <- 1 - (RSS/TSS)  # R-squared measure

fitted <- fitted(model)
maxi <- max(nflowed, max(fitted))

plot(fitted[1:nplaces], nflowed[1:nplaces], col="blue", xlab="Fitted preference flow", ylab="Observed preference flow", xlim=c(0,maxi), ylim=c(0,maxi))
points(fitted[(nplaces+1):(nplaces+4)], nflowed[(nplaces+1):(nplaces+4)], col="orange", pch=8)
abline(a=0, b=1)
rsqtxt <- paste("Flow to ", twopp.names[1], "\n", seat.names[chosen], "\nR-squared = ", round(RSq, digits = 4), "\nBlue are ordinary votes for each booth\nOrange are non-ordinary votes", sep="")
text(x=0, y=0.9*(max(nflowed)), labels=rsqtxt, pos=4)

Well, I hope you found that interesting. We’ll see what happens in the 2016 election… I might do something about swings in a second post if I have time.

Posted in Communication, Probability and Bayesian analysis | Tagged , , , , , , , , , , , , , , | 1 Comment

15 Forms of Invasiveness

You might think they should be easy to identify – invasive species are seemingly everywhere we look, even carried inadvertently by people to the polar regions. And as one of the biggest threats to biodiversity, identifying which species are likely to become invasive is important.

Australia has some of the strictest quarantine measures in the world, based partly on trying to identify which species will be permitted to enter the country because the risk of invasiveness is low. Yet the number of naturalized species in Australia continues to rise, and the species are becoming more phylogenetically diverse, with increasing diversity of trading partners.

Why can’t we do a better job of identifying which species will become invasive? Well, a new paper led by Jane Catford identifies one problem.

Attempts to identify invasive species have usually compared their traits to those of non-invasive species. Yet invasive species are classified using all sorts of criteria. Firstly, they might be exotic species that reach high abundance when present in an area. Or they might be species that have wide geographic ranges (because the environment in which they occur is common), or have wide environmental tolerances. Or they might be species that spread rapidly. Or they might be species with various combinations of these criteria.


Are orange hawkweeds regarded as being invasive because they spread quickly, potentially reach high local abundance, have broad environmental tolerances, or because they might spread across large geographic areas? Or is it some combination of these? Attempts to identify the traits that make species “invasive” should work better by examining these biological dimensions separately.

If we just think of these four criteria, then a species might have any one of 16 possible combinations. And only one of these sets of criteria would represent a species that would categorically be “non-invasive”. So we might have (at least) 15 forms of invasiveness (24 −1 = 15). And if you add a fifth category (e.g., species having adverse impact), then we would end up with 31 possible forms of invasiveness (25 −1 = 31).


The 15 possible forms of invasive species (and the one form of non-invasives species). The size of the pie slice identifies the proportion of studies that classified invasive species by these criteria (from Catford et al., 2016).

Our paper examines the literature (112 papers) to determine how many of these forms of invasiveness have been used to classify invasive species. What do you know? All 15 forms of invasiveness appear in the literature.

This wide range of ways in which invasive species are classified will obscure efforts to identify a consistent set of traits of “invasive species”. Because traits that imbue species with broad environmental tolerances, for example, will be different from those that permit high rates of spread.

Examining traits linked with the four demographic dimensions of invasiveness (rather than “invasiveness” per se) should help highlight species at risk of becoming dominant, spreading quickly or occupying large ranges. That seems like a more fruitful way to identify invasive species because the meaning of “invasiveness” is defined more clearly.

The new paper:

Catford, J. A., Baumgartner, J. B., Vesk, P. A., White, M., Buckley, Y. M. and McCarthy, M. A. (2016). Disentangling the four demographic dimensions of species invasiveness. Journal of Ecology. doi:10.1111/1365-2745.12627 [Online] [Submitted version of manuscript]

Posted in CEED, Communication, New research | Tagged , , , , , ,

CSI: Ecology. Efficiency of eDNA sampling

Detecting a species from the DNA it left behind seems so much like CSI: Ecology. DNA deposited in the environment (eDNA for the cool kids), which can then be collected and identified, is increasingly advocated for ecological studies (Ficetola et al. 2008; Bohmann et al. 2014).


Adam Smart: doing a bit of sampling.

Detection rates from eDNA sampling are often much higher than more traditional survey methods, which is a potentially large advantage. For example, Adam Smart, a recently-completed MSc student in QAECO with Reid Tingley as his lead supervisor, showed that invasive newts in Melbourne were detected much more frequently in a single water sample compared to a bottle trap (Smart et al. 2015).


While eDNA sampling can have higher detection rates than more traditional methods, it can also be more expensive. How do we determine whether eDNA sampling is cost-effective? This is the question Adam set out to answer with a second paper from his MSc, which has just been posted online (Smart et al. in press).

Adam compiled various cost data for bottle trapping and eDNA surveys. These costs included materials, travel to sites, time spent at sites, generating primers for the DNA analyses, and conducting the lab-based DNA analyses. He optimised the detection efficiency of each of the two survey methods, using our method to optimise allocation of effort among visits and effort per visit (Moore et al. 2014, but also see the latest in Moore and McCarthy in press, about which I am quite excited). He then compared the performance of the two methods as a function of different total search budgets.


Probability of detection of newts at a site in Melbourne as a function of total budget  for one of the scenarios examined in Smart et al. (in press). Results are shown for high cost and low cost eDNA scenarios, and for a standard bottle trapping scenario.

It turns out that for the cases we examined, eDNA sampling and the bottle sampling had similar detection efficiencies when accounting for costs. The choice of the best method depended on the cost structures, but regardless, the efficiencies were quite similar.


While the two methods were similar, the cost efficiency of eDNA sampling should improve. Firstly, eDNA costs will decline over time, while the costs of traditional sampling methods are unlikely to decrease if person hours represent the main expense. Secondly, eDNA promises to detect many species simultaneously, especially for species that might be otherwise hard to detect (Bohmann et al. 2014). Thirdly, eDNA sampling avoids some of the ethical concerns arising from the effects of trapping animals.

Of course, eDNA sampling has some drawbacks. The actual source of the DNA cannot be guaranteed. Could it have arrived on the feet of ducks? Might it have arisen from contamination? Procedures exist to reduce contamination. However, DNA evidence can be nullified in law courts, so it will not be 100% reliable in ecology. Further, physical specimens might be required, in which case eDNA will not be sufficient.

However, it seems that eDNA is only set to become more prevalent. We are continuing to work on improving and evaluating eDNA sampling via an ARC Linkage Grant – CSI: Ecology here we come!

Read Adam’s paper in Methods in Ecology and Evolution, or access the submitted version here.


Bohmann, K. Evans, A. Gilbert, M.T.P. Carvalho, G.R. Creer, S. Knapp, M. Yu, D.W. de Bruyn M. (2014). Environmental DNA for wildlife biology and biodiversity monitoring.Trends in Ecology & Evolution, 29, 358-367. [Online]

Ficetola, G.F., Miaud, C., Pompanon, F. & Taberlet, P. (2008). Species detection using environmental DNA from water samples. Biology Letters, 4, 423–425. [Online]

Moore, A.L., McCarthy, M.A., Parris, K.M., Moore, J.L. (2014). The optimal number of surveys when detectability varies. PLoS One 9(12): e115345. doi: 10.1371/journal.pone.0115345 [Blog] [Online – open access]

Moore, A.M., and McCarthy, M.A. (in press). Optimising ecological survey effort over space and time. Methods in Ecology and Evolution. [Online] [Submitted version of manuscript] [Blog]

Smart, A.S., Tingley, R., Weeks, A.R., van Rooyen, A.R., and McCarthy, M.A. (2015). Environmental DNA sampling is more sensitive than a traditional survey technique for detecting an aquatic invader. Ecological Applications 25:1944-1952. [Online]

Smart, A.S., Weeks, A.R., van Rooyen, A.R., Moore, A.L., McCarthy, M.A., and Tingley, R. (in press). Assessing the cost-efficiency of environmental DNA sampling. Methods in Ecology and Evolution. [Online] [Submitted version of manuscript]

Posted in CEED, Detectability, Ecological models, New research, Probability and Bayesian analysis | Tagged , , , ,

The Ideas Boom goes bust


Prime Minister Malcolm Turnbull delivering his speech that received a standing ovation from (most of) the audience. Even my hair was getting right into it.

He had just about everyone on their feet. The mood at the Prime Minister’s Prizes for Science was upbeat. We’d been treated to a stellar array of talent – science that could clearly lead to economic windfalls, science to help manage the country and its natural environment better, and outstanding teachers to help foster a more scientifically-literate community.

Malcolm Turnbull, the newly appointed Prime Minister, received a standing ovation for his speech that emphasized science as underpinning both the current and future prosperity of Australia. Standing ovations are a rare accolade from a scientific audience; rarer still when directed toward a politician, especially one from a party that had disparaged and cut funding to science so very recently.


The mood was buoyant, as we celebrated science and great scientists such as QAECO’s own Dr Jane Elith.

That was six months ago. And not everyone had stood to applaud. Some wanted more tangible proof that science really was going to be placed at the heart of Australia’s prosperity. But even as I stood, I wondered two things.

Could the Prime Minister sway his cabinet to follow his call?

And for all the talk about science and innovation, what would it mean for funding? Perhaps more importantly, what would it mean for planning and better integration of research and higher education?

Six months would be telling. In particular, what would the Turnbull government’s first budget reveal? I was prepared to wait.

For too long now, funding of universities and major research institutes such as CSIRO has been distributed without a coherent plan for research and researchers. We can’t simply turn funding on and off like a tap and see changes in research performance. The human capital employed by that funding doesn’t change that quickly. It takes months, even years, of lost productivity to shift one’s research environment. Training and establishing a research career can take a decade or more.

We can’t have a research fellowship scheme that sees 200 fellowships one year, but only 50 in another. Or a fellowship scheme that is an expendable hostage in a game of political brinkmanship. Perhaps more importantly than the number of fellowships, we need capacity within our research institutions to accommodate that talent in the longer term. There is little point investing in fellowships if those recipients have limited opportunities once their fellowships expire.

If we want people to continue on research careers, they need to be able to see pathways. The instability and lack of planning means pathways are obscure for many young researchers. At the moment, they are living on hope, on skepticism, or on disappointment.

And we’re living on a slogan. The Ideas Boom. Let me tell you – there won’t be an Ideas Boom in Australia without a properly funded plan. Up until now, the Ideas Boom was simply a $28M advertising campaign. It still is.

The Ideas Boom will happen elsewhere, and some of Australia’s most innovative scientists will move there. Why? Because the budget does very little to lay the foundation for any coherent plan for science and innovation in Australia. In fact, the budget entrenches the Abbott government’s 20% funding cut to universities without providing further capacity to increase income.

Government expenditure on research is going south. As  a proportion of each country’s GDP, Australia spends less than half of the research expenditure of Iceland, Finland and Denmark. If this were an Olympic medal tally, the public would demand an inquiry – we’re being beaten by New Zealand! If we want an Ideas Boom, we’ll need more and smarter investment.


Government spending on R&D for major advanced economies. Sourced from the ABC FactCheck.

University income has been increasingly derived from sources other than the Federal government for several years now. Below are the values reported by The University of Melbourne – the majority of its funding will soon be from private sources, it seems, given further cuts to university funding. Other Australian universities are likely on a similar trend.


The University of Melbourne has increasingly sourced its income from private sources, particularly fees. With further funding cuts entrenched, that trend will continue.

The budget plugs some holes in research funding. Notably, various specialized pieces of infrastructure will be funded, supporting a small fraction of Australia’s research capacity. And extra funding for GeoSciences Australia will help us find more resources to mine. Antarctic science is set to benefit from an expanded Antarctic program, including a new boat just over the horizon. But the announced support is haphazard against a background of cuts.

This is not an Ideas Boom. It’s business as usual – reduced funding across the board, and sprinkles of funding in pockets.

The standing ovation is over. Everyone has resumed their seats, with arms folded. The Ideas Boom is busted. Well, to be honest it can’t really be busted because it was never built.

Posted in Uncategorized | 5 Comments

An election looms, and the temperature continues to rise…

In one of my subjects, I used data on the relationship between the global temperature and CO2 concentrations to teach how variability can mask relationships, making inference uncertain. The example was based on an Australian political debate in 2009 about whether the temperature of the Earth had stopped increasing despite an increase in atmospheric CO2.

With an Australian election looming in 2016, it seems timely to return to that debate, and evaluate the positions of the different protagonists now that we have some more data. It is nice to hold our politicians to account.

Let’s first look at the data. I will use the latest HadCRUT4 data for global surface temperature, and the average annual CO2 concentration data measured at Mauna Loa (both downloaded on 20 April 2016). We have the CO2 data from 1959 to 2015, so I will use that period. You can see that the CO2 concentration has continued to increase since 2008, largely unabated it would seem.


Atmospheric CO2 concentration as measured at Mauna Loa by NOAA. Data from 2009 onwards, available after the Senators’ debate, are shown in blue.

Temperature increases have been noticeably more variable, although that variation is consistent with annual variations of the past.


The HadCRUT4 measures of the global annual surface temperature anomaly, as measured by the Hadley Centre and the Climatic Research Unit.

Now, back to the debate.

SenatorWongSenator Wong believed that the Earth’s temperature would continue to increase in line with increases in CO2 concentrations. We can characterize that by a regression model that has a linear relationship with atmospheric CO2 over the entire time period. Of course, Senator Wong’s position was not just influenced by the data, but also by the overwhelming majority of climate scientists who agreed with her position given their understanding the Earth’s climate system. Nevertheless, we will ignore that extra information here.

We can estimate the best fitting line for such a relationship, and predict the uncertainty around it – this uncertainty represents the range of predicted variation given that the recorded temperature at the Earth’s surface only approximates the heat content of the world. In the graphs I show here, the uncertainty bounds are 95% prediction intervals – we expect 95% of the observations to fall within the bound represented by the dashed lines.

Subsequently, we can determine how well that relationship fits the data that have been collected from 2009 onwards. With the CO2 concentration increasing, the Earth’s temperature has continued to increase. However, that increase has been somewhat below the average trend, with the notable exception of the data in 2015, which has exhibited a large spike.


Relationship between the global temperature anomaly and CO2 concentration measured at Mauna Loa. The solid line is a linear regression, and the dashed lines are the 95% prediction interval. The regression was estimated on data up until 2008 (black dots), and the predictions are compared to data for 2009-2015 (blue dots).

However, we can say that Senator Wong’s position is largely consistent with the data collected after 2009; her predictions fall within the uncertainty bounds. Also, it is worth noting that the large spike in temperature in 2015 might well have been larger given the magnitude of increase in the CO2 concentration.

SenatorFieldingSenator Fielding believed that the Earth’s temperature had stopped increasing after 1998. We can characterize that by a regression model that has a linear relationship with atmospheric CO2 up until 1998, and then a flat line after that period. Again, we can estimate the best fitting model for such a relationship, characterize the uncertainty around the predictions, and compare those predictions to the data collected from 2009 onwards.



Relationship between the global temperature anomaly and CO2 concentrations measured at Mauna Loa. The solid line is a piecewise linear regression that fits a sloped line to data up until 1998, and then a flat line from 1999 to 2008. Data for 1998 were removed to reduce the influence of cherry picking. The regression was estimated for data up until 2008 (black dots), and the predictions are compared to data for 2009-2015 (blue dots).

Senator Fielding’s position was also largely in line with the data up until 2014. While the model characterizing Senator Wong’s position was a touch on the high side of the data, the model characterizing Senator Fielding’s position notably under-predicted the temperature.

And Senator Fielding’s model severely under-predicts the temperature of 2015 – the temperature spike of 2015 is well outside the expected bounds, even when considering the variation in the data.

So, here we are 7 years and several governments after this political debate. The data  support Senator Wong’s position, and are now contradicting Senator Fielding’s position; the temperature continues to increase.

Note, the average temperature anomaly for 2016 so far is 0.979 C (based only on 2 months of HadCRUT4 data, so this is an uncertain estimate of the entirety of 2016). If that anomaly holds, we can conclude that Senator Fielding was sorely wrong, and even the model characterizing Senator Wong’s position might under-predict the temperate increase.

Seeing how atmospheric CO2 concentrations continue to increase, with concomitant increases in the heat content of the Earth, I can only conclude that the global community really should be acting much more rapidly. And I would hope that Australia would take the lead with this. I wonder the extent to which this will feature in the election campaign.

Posted in Communication, Probability and Bayesian analysis | Tagged , , , , , , , ,

While I was sleeping: optimising ecological surveys over space and time

With the recent online publication of a new paper, here’s a blog post about how the research arose – a fun confluence of mathematical and cognitive collaboration across two sides of the world. And some of it was achieved while I was sleeping…

Moore, A.L., and McCarthy, M.A. (in press). Optimising ecological survey effort over space and time. Methods in Ecology and Evolution. DOI: 10.1111/2041-210X.12564. An open access (author-submitted) version is via here.

I’ve been interested in imperfect detection during ecological surveys for a while now. It began in earnest when I worked with Brendan Wintle on the topic during his PhD. He was considering the question of required survey effort, and was building occupancy-detection models around the same time that Darryl MacKenzie, Drew Tyre, and Howard Stauffer were also working on this same model. Prior to that, my interest had been pique by Kirsten Parris’ work on imperfect detection of frogs.


If searching this site for the cascade treefrog (Litoria pearsoniana), how many times should you go, realizing that activity varies from night to night so multiple visits would be useful, but we don’t want to incur multiple travel costs unnecessarily?

Since then, one branch of my research has considered the question of maximizing detections of species given a search budget. A key paper here is my work with Cindy Hauser on optimizing detections of species across landscapes. It is central to my new research, so if you want some background, see here. One of the key features of that paper is that the probability of failed detection at a site is modelled by the function:

q = exp(-bx),

with search effort x, and detection rate b.

Now, the model Cindy and I developed has two clear limitations. First, it ignores the cost of travelling to sites. Secondly, it assumes that the detection rate of species when present at a site is known precisely, and doesn’t vary from visit to visit (although it can vary from site to site).

However, we know that these assumptions are not likely to be met. Travel costs can be substantial in at least some surveys – if driving to a site that is far away, we might spend as much time in the car as actually searching for a species in the field. And the detection rate of species can vary from visit to visit at a site depending on the activity level of the animal, the flowering intensity of the plant, etc.

The combination of travel costs and variation in detection raises an interesting trade-off. If we want to maximize the chance of detecting a species at least once when it is present at a site, then we would want to visit the site once when the detection rate is highest if the detection rate varies. But if the variation in the detection rate is unpredictable (i.e., the rate is stochastic), then we would want to visit the site as many times as possible to increase the chance of visiting the site at least once when the detection rate was high.

However, travel costs impose a competing constraint; multiple visits to a site will impose a travel costs for each visit. Travel costs will eat into our budget of time for actually surveying the site, and we will want to visit a site as few times as possible, and spend more time at the site during each visit.

So, Alana Moore and I set out to investigate this issue. We first tackled question: “If we had a budget of time to spend surveying one site, how should we split that time over multiple visits?” We assumed that detection rates varied according to a log-normal distribution (with a particular mean and standard deviation), and then found the number of visits that optimized the expected probability of detecting the species at least once at the site.

The solution to this optimization was not particularly simple. We needed to find the number of visits (n) that minimized this lovely integral:


This integral is the expected probability of failed detection when detection rate varies according to a log-normal distribution with a given mean (μ) and standard deviation (σ). The model assumes a search budget of x, a per-visit cost of travel of c, and n surveys. We then find the number of surveys (the value of n) that minimizes this function.

While it is not possible to find the optimal value of n analytically, one can find an analytical approximation. This is achieved by approximating the integral using Laplace’s method, and then optimizing that approximation. (As an aside, I like the idea of 18th century mathematics coming to play with 21st century ecology.) We found that the coefficient of variation in the detection rate (σ/μ), the search budget scaled by the mean detection rate (x/μ), and the scaled travel cost (c/μ) drove the optimal number of visits, with the coefficient of variation (σ/μ) and the ratio of the budget to the travel cost (x/c) being most influential. You can get the details of the model here (free – open access). We even tested how well our optimization worked using data from a field experiment – it seemed to work quite well.

Now, I really liked that paper. In fact, the idea that we can find a simple solution that approximately optimizes a rather complicated integral is appealing. However, it doesn’t get us on the path to optimize effort over space and time; it is only restricted to finding the optimal solution when surveying one site with a given budget. For the multiple site case, we need to determine the optimal budget for each site, and determine for which sites that optimal budget is zero – the sites we shouldn’t bother visiting. The solution above results in equations that seem much too complicated for that task if we want to find analytical solutions.

However, a much simpler equation arises when the detection rate follows a gamma distribution, another distribution that is commonly used to model variables that are constrained to be positive, and which is somewhat substitutable for a log-normal.

At this point, Alana Moore and I started working on the following solution in unison. I was in Melbourne, and as I worked on it, Alana was sleeping in Toulouse. I emailed Alana about my progress before bed, and then Alana tackled the problem as I slept. It was incredibly efficient, and we cracked the solution in a matter of days. Here’s how it works…

Now, if detection rates follow a gamma distribution, that rather complicated integral that we see above is replaced by an integral that can be simplified to a hyperbolic function.

E[q] = (1+θ(x/nc))nk.

Here, θ and k are parameters of the gamma distribution that depend on the mean and standard deviation of the detection rate (θ = σ2/μ; and k = σ2/μ2), while the search budget (x), travel cost (c) and number of visits (n) are the same as defined previously.

With that expression, we can derive the number of visits that minimizes the probability of failed detection, given a particular search budget x. I won’t give that function here – it isn’t really needed, but you can see it in the paper.

But if we substitute that optimal number of visits back into the hyperbolic function, we can derive the smallest possible probability of failed detection. That is equal to:

E[q]min = exp(-x),

where h is a scaling factor. This is interesting in two ways. Firstly, the probability of failed detection, when optimized for the number of surveys, essentially has the same functional form as the model in Hauser and McCarthy (2009).

Secondly, the functional form of the scaling factor h is also interesting. It is simply a declining function of the product of c and the variance:mean ratio of the detection rate (θ).


The relationship between the scaling factor in the exponential model (h) and the product of the travel cost (c) and the variance:mean ratio of the detection rate (θ). Both c and θ act to reduce the effective detection rate (i.e., h decreases).

Therefore, we can see that the combination of travel costs and variation in detection rate acts to reduce the average rate of detection (μ) and that this influence is simply a function of the product of two terms (c and θ).

This outcome was particularly exciting – look at the equation again for the expected number of failed detections:

E[q]min = exp(-x).

Alana noticed this functional form first – it is exciting because this is the same functional form as the model in Hauser and McCarthy (2009). Given that, it seemed that we might simply plug the modified function into the optimization machinery of Hauser and McCarthy (2009), and we could optimize searches over both space and time.

At this point, I felt like this:


However, we had a couple of snags to overcome first. The first one – a relatively minor issue – was that some solutions for the optimal number of searches led to results that were less than one; these results are untenable, so we needed to add a constraint that the smallest number of searches was zero or one.

The next issue was more important. The optimization of Hauser and McCarthy (2009) relies on a simple ranking scheme to choose which sites were surveyed. This ranking is possible, in part, because any search effort to a site is spent immediately on searching, and not travelling; the optimal solution is either to not search the site or to spend some positive amount of effort on searching. However, in the case of travel costs, the initial part of search effort allocated to a site will be spent on travelling to a site, so we shouldn’t bother searching a site unless the search effort is greater than the travel cost.

This consideration introduces a discontinuity into the solution – we either don’t search a site (x*=0), or we spend effort such that we allocate more than the travel cost on the site (x*>c). That discontinuity means that we cannot use the ranking approach of Hauser and McCarthy (2009) to determine which sites to survey.

Determining which sites to survey, in the absence of a simplifying algorithm, is a somewhat daunting task. Consider the case where we need to decide which of 300 sites to survey. In that case, there are 2300 different combinations of sites that can be surveyed.

Now 2 to the power of 300 is a very large number. In fact, it is approximately 2× 1090. To give you an idea of how large 2× 1090 is, we can compare it to the number of protons in the observable universe, which is approximately 1080 (apparently).

That is, searching through the combination of possible search strategies is approximately 20 billion times more involved than counting every proton in the observable universe. That is more difficult than I would have hoped.

At this point, I was feeling more like this:


Surely there had to be a better way. It seemed unlikely that this problem hadn’t been tackled before. Indeed, it turns out that the original solution that Cindy and I obtained is mathematically identical to the problem of searching for submarines, something that was solved by Bernard Koopman during World War II, and declassified in the 1950s. Surely the issue of fixed costs had been added already.

So, we started searching the Operations Research literature. The key was finding this paper by Vaibhav Srivastava and Francesco Bullo. I’ll admit that I didn’t understand it on first reading. In fact, I thought I didn’t understand it at all. But after a restless night in which I mulled over the problem in semi-sleep, I developed an algorithm that is essentially the same as in the Srivastava and Bullo paper. Brains sometimes work in strange ways (or at least mine does sometimes) – I must have gleaned enough from the Srivastava and Bullo paper on first reading to essentially replicate the idea, but I still don’t quite know how (the cases are a little differ, because we have a functional form that is not quite the same shape as imagined by Srivastava and Bullo).

The algorithm works by realising that the optimal solution occurs such that the marginal benefit of searching is the same across all sites that are searched. If we chose a particular marginal benefit, then we can select sites on the basis of the search efficiency; search efficiency can be measured by the reduction in the probability of failed detection divided by the total search effort for that site. Using a knapsack optimization approach, we rank sites according to this criterion, and then find the set of sites that fills the search budget (fills the knapsack). This ranking of sites will not find the optimal solution perfectly (because the items in the knapsack are discrete), but it works well.

Then, it is simply a matter of searching over the range of possible marginal benefits to find the optimal solution – we find the marginal benefit that leads to the largest reduction in missed detections when summed across all sites.This reduces the mind-bogglingly-large m-dimensional optimization problem (m being the number of sites that might be surveyed) to a one-dimensional problem – much more tractable.

In our new paper, we compared the solution from this algorithm to the true optimal case for a wide range of different parameter values, and show that the algorithm works well – it finds solutions for which the outcomes are very close to optimal.

So, that means we can determine the optimal allocation of search effort in space and time that finds the most occurrences of species in a landscape where sites vary in:

the probabilities of presence (or abundance) of the species;

the mean and standard deviation of the rate of detection; and

the cost of travel to the sites.

We are currently working on a few more tweaks where we relax another assumption or two of the Hauser and McCarthy model. But that is for further research…

If you want to read the paper that describes our new approach, see here for the journal website (requires subscription), or you can get a copy of the author-submitted version for free here.

Posted in CEED, Detectability, Ecological models, New research, Probability and Bayesian analysis | Tagged , , , , , , , , , , , , | 1 Comment

Postdoctoral Opportunities with the NESP Clean Air and Urban Landscapes Hub (CAUL)

The University of Melbourne node of the NESP Clean Air and Urban Landscapes Hub is advertising four Research Fellow (Level A) positions.  They are seeking candidates with expertise in amphibian ecology, urban ecology, urban greening and environmental psychology to contribute to research within the hub.  The amphibian ecology position forms part of a collaboration with the NESP Threatened Species Recovery Hub, and will be jointly based in the School of Ecosystem and Forest Sciences and the School of BioSciences.  The other three positions will be based in the School of Ecosystem and Forest Sciences.  The positions are fixed term until December 2017, and applications close on January 6th 2016.  More information and full position descriptions can be found at:





Posted in Jobs | Tagged , , , ,

Ecology and Environmental Sciences star in ERA15

The results of the latest assessment of research excellence in Australia have been released. Now, every university will spin the results to suit their own purpose*. While we can leave universities to report their results so they appear to shine in the best possible light, it would be interesting to see how different research fields performed.

Which are Australia’s strongest research fields? The broad research fields with the most universities rated above world standard are “Medical and Health Sciences” and “Environmental Science”.


The number of Australian universities “above” or “well above” world standard as rated by the Excellence in Research for Australia process in 2015 for each of the 22 broad research fields.

Following them are “Chemical Sciences”, “Biological Sciences”, and “Engineering”, with “Mathematical Sciences”, “Agricultural and Veterinary Sciences”, and “History and Archaeology” not far behind.

Within the two strong fields in which I am involved (Environmental Sciences and Biological Sciences), the strongest areas are “Environmental Science and Management” and “Ecology”. Somewhat perversely, “Ecological Applications” is separated from “Ecology” – many of the publications assigned to one during the ERA process could just have easily be assigned to the other.  However, it is clear that Environmental Sciences and Ecology are two of Australia’s strongest research fields.


The number of Australian universities “above” or “well above” world standard as assessed by the Excellence in Research for Australia process in 2015 for the disciplines within the Environmental Sciences and Biological Sciences fields.

This strength of Environmental Sciences and Ecology in Australia is also reflected in Australia’s representation in the list of the most highly-cited authors in the Thomson-Reuters list, something I’ve noted previously.


The proportion of the world’s most highly-cited scientists within each of Thomson-Reuters’ 21 research categories who have their primary affiliation in Australia. The field of Environment/Ecology tops the list for Australia – approximately 1 in 12 of the world’s most highly-cited ecologists/environmental scientists are Australian.

Some other interesting data exist in ERA15, such as total research funding (see below). With that much research funding, you’d hope medical research in Australia would perform well!

But in terms of bang for buck, it is hard to go past some other fields, such as mathematics, environmental sciences and history.


Research funding to different fields for the three years 2011-2013 as reported for ERA15. Ecology makes up about 15% of the research income in Biological Sciences – a touch under $50 million annually.

So, while the Environmental Sciences and Ecology are not the most heavily funded, they are two of Australia’s strongest research fields. Not only that, this research, conducted across many of Australia’s universities, has a large impact, helping manage Australia’s and the world’s environment more efficiently.

So people, let’s recognize the excellence of environmental and ecological research that occurs across Australia!

* Footnote: ANU even devised their own method for ranking institutions, and you won’t be surprised to know that (judged by their own criteria) ANU won. That outcome was parroted by Campus Review, which failed to note that at least one other university out-performed ANU on at least one of their criteria (the proportion of broad research fields rated above word standard). Universities love playing the ranking game, but I’m surprised a news outlet would publish claims without checking them.


Posted in Communication | Tagged , , , , , , | 3 Comments

Lecturer in Ecological Modelling

Come and work with us!We’re looking for an outstanding academic to join the School of BioSciences within QAECO. We are particularly interested in applicants with expertise in modelling the distributions of species or biodiversity, or more generally in spatial modelling, working with Dr Jane Elith and others within QAECO.

The closing date for applications is 25 August. Information about the position and how to apply is available at:

Please consider applying. Also, please spread the word by drawing this opportunity to the attention of potential applicants.

Posted in Communication, Jobs | Tagged , ,

Alpine Grazing Update

My blog has been a little quiet lately. But I’ve written a few things elsewhere. The latest is an update on the issue of cattle grazing in the Alpine National Park. You can read that over at The Conversation – an article I wrote with Libby Rumpff and Georgia Garrard.

Posted in Cattle grazing in the Alpine National Park | Tagged , , , ,