Election fever hits again

In the 2013 election, I took some interest in the election result in Indi, a seat located in the north-east of Victoria. My interest was spurred by the chance that Sophie Mirabella, who was flagged to be the next Science Minister if the Liberal-National coalition won government, might be usurped by Cathy McGowan, an independent candidate. Also, I have some relatives in that part of the world, so I was interested to know who would be their local representative in what turned out to be a very close election.

I enjoyed trying to predict the outcome of the election in Indi, as counting continued over a matter of days. You can see an account of my efforts here. (As an aside, this is the most-read post on my blog – I have an alternative career option should I give up ecology!).

In predicting the winner of the election, there are two main unknowns that need to be determined – how the preferences are flowing to the two leading candidates, and  whether the swing in votes is sufficient to usurp the sitting member.

Australia uses a preferential voting system. Voters select their preferred candidate in the seat for the House of Representatives, then their second preference, third preference, etc, until the voter has indicated their preferences for all candidates in the seat.

The initial counting of votes tallies these first preferences for each candidate. Then, the ballot papers of the candidate with the fewest votes are distributed to the other candidates based on the second preferences on those ballot papers. So if we had five candidates initially, the possible winners are narrowed down to four, and the ballot papers of the fifth candidate are then allocated to the remaining four candidates based on the second preferences.

Then, the ballot papers of the candidate with the fewest votes are distributed among the other three. This process continues until we have only two candidates remaining, at which point we have the two-candidate-preferred vote. After this point, the candidate with the most votes wins.

In trying to predict the winner of an election, a key part is predicting how the preferences will flow to the two leading candidates. The Australian Electoral Commission provides updates on first preference counts initially, and then two-candidate-preferred counts as they are completed. Because the two-candidate-preferred counts lag behind the first preference counts, it would be useful  to predict preference flows. If preferences have been counted for a sample of booths, it is possible to model the flow of preferences – here is one way to do that.

Let’s look at the first preference counts and two-candidate-preferred counts for a few booths in the seat of Indi from the 2013 election:

Booth
Candidate Alex-
andra
Badda-ginnie Barra-
duda
Barna-wartha Beech-
worth
R. Dudley (Rise Up Aust) 13 6 6 8 12
C. McGowan (Independent) 216 39 381 248 785
R. Leeworthy (Family First) 30 5 19 11 11
S. Mirabella (Liberal) 715 52 366 230 420
H Aschenbrenner (Sex Party) 23 3 11 5 12
W. Hayes (Bullet Train) 6 0 5 7 1
R. Walsh (ALP) 251 7 76 52 145
J. O’Connor (The Greens) 63 2 23 8 105
P. Rourke (Katter) 2 2 3 2 8
R. Murphy (PUP) 54 4 31 14 16
J. Podesta 6 0 8 5 13
2CP McGowan (Independent) 567 54 519 339 1,069
2CP Mirabella (Liberal) 812 66 410 251 459
Total preference flows 448 29 182 112 323
Fraction to McGowan 0.783482 0.517241 0.758242 0.8125 0.879257

We can see that in the Alexandra booth, Cathy McGowan only won 216 first preference votes, compared to Sophie Mirabella’s 715. But the 448 votes of remaining candidates flowed distinctly towards McGowan – on more than 78% of those ballot papers, McGowan was preferenced ahead of Mirabella, so she collected those preferences.

The flow of preferences was even stronger in Beechworth, where McGowan won almost 88% of the distributed preferences, but she got less than 52% of the preferences in Baddaginnie. You might notice a big difference between Beechworth and Baddaginnie in the first preferences. For example, the Greens won almost 7% of first preferences in Beechworth but less than 2% of first preferences in Baddaginnie.

We can model this flow of preferences as a function of the first preferences to predict the two-candidate-preferred vote from first preferences. Here, we are essentially aiming to predict the fraction of votes that flow from the first preferences of the other candidates to the two leading candidates.

We can build this model using linear regression, but we would like to constrain the model coefficients such that they are between zero and one; the coefficients estimate the proportion of voters whose preferenced one of the 2CP candidates ahead of the other.

If we take the data from all of Indi’s 103 booths (and also the postal, early, provisional, and absentee votes), then our model results look like this:

Observed preference flows to Cathy McGowan versus fitted preference flows based on the 2013 federal election results.

Observed preference flows to Cathy McGowan versus fitted preference flows for each of the booths (and the non-ordinary votes) based on the 2013 federal election results.

Let’s look at the model coefficients:

0.623 : DUDLEY, Robert ( Rise Up Australia Party ), 985 votes
0.662 : LEEWORTHY, Rick ( Family First Party ), 1330 votes
0.383 : ASCHENBRENNER, Helma ( Sex Party ), 1402 votes
0.000 : HAYES, William ( Bullet Train For Australia ), 489 votes
0.992 :  WALSH, Robyn ( Australian Labor Party ), 10375 votes
0.700 : O’CONNOR, Jenny ( The Greens ), 3041
0.008 : ROURKE, Phil ( Katter’s Australian Party ), 615 votes
0.680 : MURPHY, Robert Denis ( Palmer United Party ), 2417 votes
1.000 : PODESTA, Jennifer ( Independent ), 841 votes

These coefficients estimate that McGowan was preferenced behind Mirabella by almost all voters who put Hayes first (the estimated coefficient is zero), but she was placed ahead of Mirabella by almost everyone who put Walsh first (the estimated coefficient is 0.992).

The graph shows that this pattern of preference flows as a function of first preferences is very consistent, at least in Indi. In some other electorates, it is not so consistent. Here are the model results for the seat of Batman, which will be hotly contested in 2016 between the Greens and the ALP:

BatmanPrefFlow

Observed preference flows to Alexandra Bhathal (a Greens candidate) versus fitted preference flows for each of the booths in the seat of Batman (and the non-ordinary votes) based on the 2013 federal election results.

The model for Batman doesn’t work quite as well, largely because Bhathal received a greater flow of preferences from the non-ordinary votes (orange symbols in the figure) than from the ordinary votes. These non-ordinary votes are the postal votes (Bhathal received almost 1700 of the flowing preferences), absent votes (Bhathal received over 1000 of the flowing preferences), early votes (Bhathal received just under 1000 of the flowing preferences), and provisional votes (there were very few of these).

Interestingly, a similar pattern occurred in Wills, which is another inner Melbourne seat with a Greens candidate – it seems the Greens garnered strong preference flows from the non-ordinary votes in 2013. Whether that will be borne out in 2016 remains to be seen, but strong preference flows will be needed by the Greens if they are to prevail in Batman.

If you’d like to look at preference flows for yourself for different seats in the 2013 election, then you are welcome to use my R code that I wrote – it scrapes the data from the AEC website, runs the model and prints out the result.

The code is best run using the source command in R so that you are prompted to select the seat from the list of lower house seats (or you can just specify the seat number directly from within the code). And please excuse my R coding – I know it is clumsy in places, I am learning, and am yet to figure out R’s data structures properly to do vectorized operations (among other things I don’t understand).

Also, I haven’t checked that this works on all seats – there might be some anomalies that I haven’t accounted for.

suppressWarnings(library(XML))
suppressWarnings(library(nnls))

seatsite="http://results.aec.gov.au/17496/Website/HouseDivisionMenu-17496-NAT.htm" # seats listed here
# seatsite="http://results.aec.gov.au/15508/Website/HouseDivisionMenu-15508-NAT.htm" # 2010 seat

seat.table = readHTMLTable(seatsite, header=F, which=1, stringsAsFactors=F, as.data.frame=TRUE)  # extract the seats

# arranged in 3 columns, with various white spaces, so trim
x <- gsub("\t", "", seat.table[6,]$V1, fixed = TRUE)
x <- gsub("\r", "", x, fixed = TRUE)
x <- gsub("\n\n\n\n\n", "\n", x, fixed = TRUE)
V1 <- strsplit(x, "\n")

x <- gsub("\t", "", seat.table[6,]$V2, fixed = TRUE)
x <- gsub("\r", "", x, fixed = TRUE)
x <- gsub("\n\n\n\n\n", "\n", x, fixed = TRUE)
V2 <- strsplit(x, "\n")

x <- gsub("\t", "", seat.table[6,]$V3, fixed = TRUE)
x <- gsub("\r", "", x, fixed = TRUE)
x <- gsub("\n\n\n\n\n", "\n", x, fixed = TRUE)
V3 <- strsplit(x, "\n")

seat.names <- c(V1[[1]], V2[[1]], V3[[1]])  # combine the three columns into one

seatlinks <- getHTMLLinks(seatsite, relative=FALSE)  # get the links to the sites with seat specific data
prefix <- "http://results.aec.gov.au/17496/Website/"  
# prefix <- "http://results.aec.gov.au/15508/Website/" #2010 results

seatlinks <- paste(prefix, seatlinks[12:161], sep="")  # paste on the rest of the website
seatlinks <- gsub("FirstPrefs", "PollingPlaces", seatlinks)  # we will need polling booth data, so change names a little

nseats <- length(seat.names)

print(seat.names) # print out the seat names, and prompt user to select one

prompt <- paste("Enter number of seat (between 1 and ", nseats, "): ", sep="")
chosen <- as.numeric(readline(prompt))

cat("Selected seat is ", seat.names[chosen], "\n")

places = seatlinks[chosen]  # this is the link for teh chosen seat

places.table = readHTMLTable(places, header=F, which=1, stringsAsFactors=F, as.data.frame=TRUE, skip.rows=c(1,2,3,4,5,6))

places.names <- places.table$V1  # gets the list of booths

placeslinks <- getHTMLLinks(places, relative=FALSE) # get links to booth data
placeslinks <- placeslinks[grep("HousePollingPlaceFirstPrefs", placeslinks)]  # trim off redundant info

nplaces <- length(placeslinks)
places.names <- places.names[1:nplaces]

placeslinks <- paste(prefix, placeslinks, sep="")  # paste on the prefix to booth links

skippedrows <- 1:8  # Need 9 for 2010, header=F; 8 for 2013, header=T

# get info for first booth
firstpref.table = readHTMLTable(placeslinks[1], header=T, which=1, stringsAsFactors=F, skip.rows=skippedrows)

# find number of candidates
ncandidates <- pmatch("......", firstpref.table$V1)-1
if(is.na(ncandidates))
{
  ncandidates <- pmatch("FORMAL", firstpref.table$V1)-1
}

# get candidate names and parties
candidate.names <- firstpref.table$V1[1:ncandidates]
candidate.parties <- firstpref.table$V2[1:ncandidates]

# get two candidate preferred names
twopp.names <- firstpref.table$V1[(nrow(firstpref.table)-2):(nrow(firstpref.table)-1)]

# get arrays ready for data scraping
firstpref <- array(-999, dim=c(nplaces+4, ncandidates))
twopp <- array(-999, dim=c(nplaces+4, 2))
places.total <- array(dim=nplaces+4)

for(i in 1:nplaces)  # for each booth
{
  firstpref.table = readHTMLTable(placeslinks[i], header=T, which=1, stringsAsFactors=F, skip.rows=skippedrows)
  for(j in 1:ncandidates)  # get first preference count for each candidate
  {
    firstpref[i, j] <- as.numeric(gsub(",","", firstpref.table[j, 3]))  # gsub removes commas from string
  }
  places.total[i] <- sum(firstpref[i, 1:ncandidates])  # get total first prefs for each booth

  twopp[i, 1] <- as.numeric(gsub(",","", firstpref.table[(nrow(firstpref.table)-2), 3]))  # get 2CP data
  twopp[i, 2] <- as.numeric(gsub(",","", firstpref.table[(nrow(firstpref.table)-1), 3]))
}

# Now get non-ordinary votes
othervotesite <- gsub("HouseDivisionPollingPlaces", "HouseDivisionFirstPrefsByVoteType", places)
othervotes.table = readHTMLTable(othervotesite, header=T, which=1, stringsAsFactors=F, as.data.frame=TRUE, skip.rows=c(1,2,3,4,5,6))
absent <- as.numeric(gsub(",","", othervotes.table$V5[1:ncandidates]))
provisional <- as.numeric(gsub(",","", othervotes.table$V7[1:ncandidates]))
early <- as.numeric(gsub(",","", othervotes.table$V9[1:ncandidates]))
postal <- as.numeric(gsub(",","", othervotes.table$V11[1:ncandidates]))

firstpref[nplaces+1, ] <- absent
firstpref[nplaces+2, ] <- provisional 
firstpref[nplaces+3, ] <- early 
firstpref[nplaces+4, ] <- postal 

twopp[nplaces+1, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 5]))  # absent
twopp[nplaces+2, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 7]))  # provisional 
twopp[nplaces+3, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 9]))  # early 
twopp[nplaces+4, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 11])) # postal 

twopp[nplaces+1, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 5]))
twopp[nplaces+2, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 7]))
twopp[nplaces+3, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 9])) 
twopp[nplaces+4, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 11])) 

totalfirstprefs <- array(-999, dim=ncandidates)
for(j in 1:ncandidates)
{
  totalfirstprefs[j] <- sum(firstpref[, j]) # count total first prefs for each candidate across all booths
}

twopp.id <- pmatch(twopp.names, candidate.names)  # get id's of 2CP people
twopp.parties <- candidate.parties[twopp.id] # and their parties

nflowed <- twopp[,1] - firstpref[, twopp.id[1]] # number of preferences flowing to 2pp candidate number 1

otherfirst <- firstpref[, -twopp.id]  # get first pref votes for candidates other than 2pp candidates (we know where they go)
othernames <- candidate.names[-twopp.id]
otherparties <- candidate.parties[-twopp.id]
othertotalfirstprefs <- totalfirstprefs[-twopp.id]

# set up model specification for flow of preferences 
starter <- structure(rep(0.5,(ncandidates-2)), names=letters[1:(ncandidates-2)])
lowers <- structure(rep(0,(ncandidates-2)), names=letters[1:(ncandidates-2)])
uppers <- structure(rep(1,(ncandidates-2)), names=letters[1:(ncandidates-2)])

formula <- "nflowed ~ a*otherfirst[, 1]"

for(i in 2:(ncandidates-2))
{
  formula <- paste(formula, " + ", letters[i], "*otherfirst[, ", i, "]", sep="")
}

model <- nls(formula, algorithm="port", start=starter, lower=lowers, upper=uppers)
modelsum <- summary(model)

cat("Estimated flow to: ", twopp.names[1], "(", twopp.parties[1], ")\n")
for(i in 1:(ncandidates-2))
{
  cat(modelsum$parameters[i,1], " of ", othertotalfirstprefs[i], "votes from ", othernames[i], "(", otherparties[i], ")\n")
}

flows <- modelsum$parameters[,1] * othertotalfirstprefs
totflow <- sum(flows)
totflow2 <- sum(othertotalfirstprefs) - totflow

cat("\nEstimated votes to: ", twopp.names[1], "(", twopp.parties[1], "): ", totalfirstprefs[twopp.id[1]]+totflow, "\n")
cat("\nEstimated votes to: ", twopp.names[2], "(", twopp.parties[2], "): ", totalfirstprefs[twopp.id[2]]+totflow2, "\n")

RSS <- sum(residuals(model)^2)  # residual sum of squares
TSS <- sum((nflowed - mean(nflowed ))^2)  # total sum of squares
RSq <- 1 - (RSS/TSS)  # R-squared measure

fitted <- fitted(model)
maxi <- max(nflowed, max(fitted))

plot(fitted[1:nplaces], nflowed[1:nplaces], col="blue", xlab="Fitted preference flow", ylab="Observed preference flow", xlim=c(0,maxi), ylim=c(0,maxi))
points(fitted[(nplaces+1):(nplaces+4)], nflowed[(nplaces+1):(nplaces+4)], col="orange", pch=8)
abline(a=0, b=1)
rsqtxt <- paste("Flow to ", twopp.names[1], "\n", seat.names[chosen], "\nR-squared = ", round(RSq, digits = 4), "\nBlue are ordinary votes for each booth\nOrange are non-ordinary votes", sep="")
text(x=0, y=0.9*(max(nflowed)), labels=rsqtxt, pos=4)

Well, I hope you found that interesting. We’ll see what happens in the 2016 election… I might do something about swings in a second post if I have time.

About Michael McCarthy

I conduct research on environmental decision making and quantitative ecology. My teaching is mainly at post-grad level at The University of Melbourne.
This entry was posted in Communication, Probability and Bayesian analysis and tagged , , , , , , , , , , , , , , . Bookmark the permalink.