In the 2013 election, I took some interest in the election result in Indi, a seat located in the north-east of Victoria. My interest was spurred by the chance that Sophie Mirabella, who was flagged to be the next Science Minister if the Liberal-National coalition won government, might be usurped by Cathy McGowan, an independent candidate. Also, I have some relatives in that part of the world, so I was interested to know who would be their local representative in what turned out to be a very close election.

I enjoyed trying to predict the outcome of the election in Indi, as counting continued over a matter of days. You can see an account of my efforts here. (As an aside, this is the most-read post on my blog – I have an alternative career option should I give up ecology!).

In predicting the winner of the election, there are two main unknowns that need to be determined – how the preferences are flowing to the two leading candidates, and whether the swing in votes is sufficient to usurp the sitting member.

Australia uses a preferential voting system. Voters select their preferred candidate in the seat for the House of Representatives, then their second preference, third preference, etc, until the voter has indicated their preferences for all candidates in the seat.

The initial counting of votes tallies these first preferences for each candidate. Then, the ballot papers of the candidate with the fewest votes are distributed to the other candidates based on the second preferences on those ballot papers. So if we had five candidates initially, the possible winners are narrowed down to four, and the ballot papers of the fifth candidate are then allocated to the remaining four candidates based on the second preferences.

Then, the ballot papers of the candidate with the fewest votes are distributed among the other three. This process continues until we have only two candidates remaining, at which point we have the two-candidate-preferred vote. After this point, the candidate with the most votes wins.

In trying to predict the winner of an election, a key part is predicting how the preferences will flow to the two leading candidates. The Australian Electoral Commission provides updates on first preference counts initially, and then two-candidate-preferred counts as they are completed. Because the two-candidate-preferred counts lag behind the first preference counts, it would be useful to predict preference flows. If preferences have been counted for a sample of booths, it is possible to model the flow of preferences – here is one way to do that.

Let’s look at the first preference counts and two-candidate-preferred counts for a few booths in the seat of Indi from the 2013 election:

Booth |
|||||

Candidate | Alex- andra |
Badda-ginnie | Barra- duda |
Barna-wartha | Beech- worth |

R. Dudley (Rise Up Aust) | 13 | 6 | 6 | 8 | 12 |

C. McGowan (Independent) | 216 | 39 | 381 | 248 | 785 |

R. Leeworthy (Family First) | 30 | 5 | 19 | 11 | 11 |

S. Mirabella (Liberal) | 715 | 52 | 366 | 230 | 420 |

H Aschenbrenner (Sex Party) | 23 | 3 | 11 | 5 | 12 |

W. Hayes (Bullet Train) | 6 | 0 | 5 | 7 | 1 |

R. Walsh (ALP) | 251 | 7 | 76 | 52 | 145 |

J. O’Connor (The Greens) | 63 | 2 | 23 | 8 | 105 |

P. Rourke (Katter) | 2 | 2 | 3 | 2 | 8 |

R. Murphy (PUP) | 54 | 4 | 31 | 14 | 16 |

J. Podesta | 6 | 0 | 8 | 5 | 13 |

2CP McGowan (Independent) | 567 | 54 | 519 | 339 | 1,069 |

2CP Mirabella (Liberal) | 812 | 66 | 410 | 251 | 459 |

Total preference flows | 448 | 29 | 182 | 112 | 323 |

Fraction to McGowan | 0.783482 | 0.517241 | 0.758242 | 0.8125 | 0.879257 |

We can see that in the Alexandra booth, Cathy McGowan only won 216 first preference votes, compared to Sophie Mirabella’s 715. But the 448 votes of remaining candidates flowed distinctly towards McGowan – on more than 78% of those ballot papers, McGowan was preferenced ahead of Mirabella, so she collected those preferences.

The flow of preferences was even stronger in Beechworth, where McGowan won almost 88% of the distributed preferences, but she got less than 52% of the preferences in Baddaginnie. You might notice a big difference between Beechworth and Baddaginnie in the first preferences. For example, the Greens won almost 7% of first preferences in Beechworth but less than 2% of first preferences in Baddaginnie.

We can model this flow of preferences as a function of the first preferences to predict the two-candidate-preferred vote from first preferences. Here, we are essentially aiming to predict the fraction of votes that flow from the first preferences of the other candidates to the two leading candidates.

We can build this model using linear regression, but we would like to constrain the model coefficients such that they are between zero and one; the coefficients estimate the proportion of voters whose preferenced one of the 2CP candidates ahead of the other.

If we take the data from all of Indi’s 103 booths (and also the postal, early, provisional, and absentee votes), then our model results look like this:

Let’s look at the model coefficients:

0.623 : DUDLEY, Robert ( Rise Up Australia Party ), 985 votes

0.662 : LEEWORTHY, Rick ( Family First Party ), 1330 votes

0.383 : ASCHENBRENNER, Helma ( Sex Party ), 1402 votes

0.000 : HAYES, William ( Bullet Train For Australia ), 489 votes

0.992 : WALSH, Robyn ( Australian Labor Party ), 10375 votes

0.700 : O’CONNOR, Jenny ( The Greens ), 3041

0.008 : ROURKE, Phil ( Katter’s Australian Party ), 615 votes

0.680 : MURPHY, Robert Denis ( Palmer United Party ), 2417 votes

1.000 : PODESTA, Jennifer ( Independent ), 841 votes

These coefficients estimate that McGowan was preferenced behind Mirabella by almost all voters who put Hayes first (the estimated coefficient is zero), but she was placed ahead of Mirabella by almost everyone who put Walsh first (the estimated coefficient is 0.992).

The graph shows that this pattern of preference flows as a function of first preferences is very consistent, at least in Indi. In some other electorates, it is not so consistent. Here are the model results for the seat of Batman, which will be hotly contested in 2016 between the Greens and the ALP:

The model for Batman doesn’t work quite as well, largely because Bhathal received a greater flow of preferences from the non-ordinary votes (orange symbols in the figure) than from the ordinary votes. These non-ordinary votes are the postal votes (Bhathal received almost 1700 of the flowing preferences), absent votes (Bhathal received over 1000 of the flowing preferences), early votes (Bhathal received just under 1000 of the flowing preferences), and provisional votes (there were very few of these).

Interestingly, a similar pattern occurred in Wills, which is another inner Melbourne seat with a Greens candidate – it seems the Greens garnered strong preference flows from the non-ordinary votes in 2013. Whether that will be borne out in 2016 remains to be seen, but strong preference flows will be needed by the Greens if they are to prevail in Batman.

If you’d like to look at preference flows for yourself for different seats in the 2013 election, then you are welcome to use my R code that I wrote – it scrapes the data from the AEC website, runs the model and prints out the result.

The code is best run using the source command in R so that you are prompted to select the seat from the list of lower house seats (or you can just specify the seat number directly from within the code). And please excuse my R coding – I know it is clumsy in places, I am learning, and am yet to figure out R’s data structures properly to do vectorized operations (among other things I don’t understand).

Also, I haven’t checked that this works on all seats – there might be some anomalies that I haven’t accounted for.

suppressWarnings(library(XML)) suppressWarnings(library(nnls)) seatsite="http://results.aec.gov.au/17496/Website/HouseDivisionMenu-17496-NAT.htm" # seats listed here # seatsite="http://results.aec.gov.au/15508/Website/HouseDivisionMenu-15508-NAT.htm" # 2010 seat seat.table = readHTMLTable(seatsite, header=F, which=1, stringsAsFactors=F, as.data.frame=TRUE) # extract the seats # arranged in 3 columns, with various white spaces, so trim x <- gsub("\t", "", seat.table[6,]$V1, fixed = TRUE) x <- gsub("\r", "", x, fixed = TRUE) x <- gsub("\n\n\n\n\n", "\n", x, fixed = TRUE) V1 <- strsplit(x, "\n") x <- gsub("\t", "", seat.table[6,]$V2, fixed = TRUE) x <- gsub("\r", "", x, fixed = TRUE) x <- gsub("\n\n\n\n\n", "\n", x, fixed = TRUE) V2 <- strsplit(x, "\n") x <- gsub("\t", "", seat.table[6,]$V3, fixed = TRUE) x <- gsub("\r", "", x, fixed = TRUE) x <- gsub("\n\n\n\n\n", "\n", x, fixed = TRUE) V3 <- strsplit(x, "\n") seat.names <- c(V1[[1]], V2[[1]], V3[[1]]) # combine the three columns into one seatlinks <- getHTMLLinks(seatsite, relative=FALSE) # get the links to the sites with seat specific data prefix <- "http://results.aec.gov.au/17496/Website/" # prefix <- "http://results.aec.gov.au/15508/Website/" #2010 results seatlinks <- paste(prefix, seatlinks[12:161], sep="") # paste on the rest of the website seatlinks <- gsub("FirstPrefs", "PollingPlaces", seatlinks) # we will need polling booth data, so change names a little nseats <- length(seat.names) print(seat.names) # print out the seat names, and prompt user to select one prompt <- paste("Enter number of seat (between 1 and ", nseats, "): ", sep="") chosen <- as.numeric(readline(prompt)) cat("Selected seat is ", seat.names[chosen], "\n") places = seatlinks[chosen] # this is the link for teh chosen seat places.table = readHTMLTable(places, header=F, which=1, stringsAsFactors=F, as.data.frame=TRUE, skip.rows=c(1,2,3,4,5,6)) places.names <- places.table$V1 # gets the list of booths placeslinks <- getHTMLLinks(places, relative=FALSE) # get links to booth data placeslinks <- placeslinks[grep("HousePollingPlaceFirstPrefs", placeslinks)] # trim off redundant info nplaces <- length(placeslinks) places.names <- places.names[1:nplaces] placeslinks <- paste(prefix, placeslinks, sep="") # paste on the prefix to booth links skippedrows <- 1:8 # Need 9 for 2010, header=F; 8 for 2013, header=T # get info for first booth firstpref.table = readHTMLTable(placeslinks[1], header=T, which=1, stringsAsFactors=F, skip.rows=skippedrows) # find number of candidates ncandidates <- pmatch("......", firstpref.table$V1)-1 if(is.na(ncandidates)) { ncandidates <- pmatch("FORMAL", firstpref.table$V1)-1 } # get candidate names and parties candidate.names <- firstpref.table$V1[1:ncandidates] candidate.parties <- firstpref.table$V2[1:ncandidates] # get two candidate preferred names twopp.names <- firstpref.table$V1[(nrow(firstpref.table)-2):(nrow(firstpref.table)-1)] # get arrays ready for data scraping firstpref <- array(-999, dim=c(nplaces+4, ncandidates)) twopp <- array(-999, dim=c(nplaces+4, 2)) places.total <- array(dim=nplaces+4) for(i in 1:nplaces) # for each booth { firstpref.table = readHTMLTable(placeslinks[i], header=T, which=1, stringsAsFactors=F, skip.rows=skippedrows) for(j in 1:ncandidates) # get first preference count for each candidate { firstpref[i, j] <- as.numeric(gsub(",","", firstpref.table[j, 3])) # gsub removes commas from string } places.total[i] <- sum(firstpref[i, 1:ncandidates]) # get total first prefs for each booth twopp[i, 1] <- as.numeric(gsub(",","", firstpref.table[(nrow(firstpref.table)-2), 3])) # get 2CP data twopp[i, 2] <- as.numeric(gsub(",","", firstpref.table[(nrow(firstpref.table)-1), 3])) } # Now get non-ordinary votes othervotesite <- gsub("HouseDivisionPollingPlaces", "HouseDivisionFirstPrefsByVoteType", places) othervotes.table = readHTMLTable(othervotesite, header=T, which=1, stringsAsFactors=F, as.data.frame=TRUE, skip.rows=c(1,2,3,4,5,6)) absent <- as.numeric(gsub(",","", othervotes.table$V5[1:ncandidates])) provisional <- as.numeric(gsub(",","", othervotes.table$V7[1:ncandidates])) early <- as.numeric(gsub(",","", othervotes.table$V9[1:ncandidates])) postal <- as.numeric(gsub(",","", othervotes.table$V11[1:ncandidates])) firstpref[nplaces+1, ] <- absent firstpref[nplaces+2, ] <- provisional firstpref[nplaces+3, ] <- early firstpref[nplaces+4, ] <- postal twopp[nplaces+1, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 5])) # absent twopp[nplaces+2, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 7])) # provisional twopp[nplaces+3, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 9])) # early twopp[nplaces+4, 1] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-2), 11])) # postal twopp[nplaces+1, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 5])) twopp[nplaces+2, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 7])) twopp[nplaces+3, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 9])) twopp[nplaces+4, 2] <- as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)-1), 11])) totalfirstprefs <- array(-999, dim=ncandidates) for(j in 1:ncandidates) { totalfirstprefs[j] <- sum(firstpref[, j]) # count total first prefs for each candidate across all booths } twopp.id <- pmatch(twopp.names, candidate.names) # get id's of 2CP people twopp.parties <- candidate.parties[twopp.id] # and their parties nflowed <- twopp[,1] - firstpref[, twopp.id[1]] # number of preferences flowing to 2pp candidate number 1 otherfirst <- firstpref[, -twopp.id] # get first pref votes for candidates other than 2pp candidates (we know where they go) othernames <- candidate.names[-twopp.id] otherparties <- candidate.parties[-twopp.id] othertotalfirstprefs <- totalfirstprefs[-twopp.id] # set up model specification for flow of preferences starter <- structure(rep(0.5,(ncandidates-2)), names=letters[1:(ncandidates-2)]) lowers <- structure(rep(0,(ncandidates-2)), names=letters[1:(ncandidates-2)]) uppers <- structure(rep(1,(ncandidates-2)), names=letters[1:(ncandidates-2)]) formula <- "nflowed ~ a*otherfirst[, 1]" for(i in 2:(ncandidates-2)) { formula <- paste(formula, " + ", letters[i], "*otherfirst[, ", i, "]", sep="") } model <- nls(formula, algorithm="port", start=starter, lower=lowers, upper=uppers) modelsum <- summary(model) cat("Estimated flow to: ", twopp.names[1], "(", twopp.parties[1], ")\n") for(i in 1:(ncandidates-2)) { cat(modelsum$parameters[i,1], " of ", othertotalfirstprefs[i], "votes from ", othernames[i], "(", otherparties[i], ")\n") } flows <- modelsum$parameters[,1] * othertotalfirstprefs totflow <- sum(flows) totflow2 <- sum(othertotalfirstprefs) - totflow cat("\nEstimated votes to: ", twopp.names[1], "(", twopp.parties[1], "): ", totalfirstprefs[twopp.id[1]]+totflow, "\n") cat("\nEstimated votes to: ", twopp.names[2], "(", twopp.parties[2], "): ", totalfirstprefs[twopp.id[2]]+totflow2, "\n") RSS <- sum(residuals(model)^2) # residual sum of squares TSS <- sum((nflowed - mean(nflowed ))^2) # total sum of squares RSq <- 1 - (RSS/TSS) # R-squared measure fitted <- fitted(model) maxi <- max(nflowed, max(fitted)) plot(fitted[1:nplaces], nflowed[1:nplaces], col="blue", xlab="Fitted preference flow", ylab="Observed preference flow", xlim=c(0,maxi), ylim=c(0,maxi)) points(fitted[(nplaces+1):(nplaces+4)], nflowed[(nplaces+1):(nplaces+4)], col="orange", pch=8) abline(a=0, b=1) rsqtxt <- paste("Flow to ", twopp.names[1], "\n", seat.names[chosen], "\nR-squared = ", round(RSq, digits = 4), "\nBlue are ordinary votes for each booth\nOrange are non-ordinary votes", sep="") text(x=0, y=0.9*(max(nflowed)), labels=rsqtxt, pos=4)

Well, I hope you found that interesting. We’ll see what happens in the 2016 election… I might do something about swings in a second post if I have time.