In the 2013 election, I took some interest in the election result in Indi, a seat located in the northeast of Victoria. My interest was spurred by the chance that Sophie Mirabella, who was flagged to be the next Science Minister if the LiberalNational coalition won government, might be usurped by Cathy McGowan, an independent candidate. Also, I have some relatives in that part of the world, so I was interested to know who would be their local representative in what turned out to be a very close election.
I enjoyed trying to predict the outcome of the election in Indi, as counting continued over a matter of days. You can see an account of my efforts here. (As an aside, this is the mostread post on my blog – I have an alternative career option should I give up ecology!).
In predicting the winner of the election, there are two main unknowns that need to be determined – how the preferences are flowing to the two leading candidates, and whether the swing in votes is sufficient to usurp the sitting member.
Australia uses a preferential voting system. Voters select their preferred candidate in the seat for the House of Representatives, then their second preference, third preference, etc, until the voter has indicated their preferences for all candidates in the seat.
The initial counting of votes tallies these first preferences for each candidate. Then, the ballot papers of the candidate with the fewest votes are distributed to the other candidates based on the second preferences on those ballot papers. So if we had five candidates initially, the possible winners are narrowed down to four, and the ballot papers of the fifth candidate are then allocated to the remaining four candidates based on the second preferences.
Then, the ballot papers of the candidate with the fewest votes are distributed among the other three. This process continues until we have only two candidates remaining, at which point we have the twocandidatepreferred vote. After this point, the candidate with the most votes wins.
In trying to predict the winner of an election, a key part is predicting how the preferences will flow to the two leading candidates. The Australian Electoral Commission provides updates on first preference counts initially, and then twocandidatepreferred counts as they are completed. Because the twocandidatepreferred counts lag behind the first preference counts, it would be useful to predict preference flows. If preferences have been counted for a sample of booths, it is possible to model the flow of preferences – here is one way to do that.
Let’s look at the first preference counts and twocandidatepreferred counts for a few booths in the seat of Indi from the 2013 election:

Booth 




Candidate 
Alex
andra 
Baddaginnie 
Barra
duda 
Barnawartha 
Beech
worth 
R. Dudley (Rise Up Aust) 
13 
6 
6 
8 
12 
C. McGowan (Independent) 
216 
39 
381 
248 
785 
R. Leeworthy (Family First) 
30 
5 
19 
11 
11 
S. Mirabella (Liberal) 
715 
52 
366 
230 
420 
H Aschenbrenner (Sex Party) 
23 
3 
11 
5 
12 
W. Hayes (Bullet Train) 
6 
0 
5 
7 
1 
R. Walsh (ALP) 
251 
7 
76 
52 
145 
J. O’Connor (The Greens) 
63 
2 
23 
8 
105 
P. Rourke (Katter) 
2 
2 
3 
2 
8 
R. Murphy (PUP) 
54 
4 
31 
14 
16 
J. Podesta 
6 
0 
8 
5 
13 






2CP McGowan (Independent) 
567 
54 
519 
339 
1,069 
2CP Mirabella (Liberal) 
812 
66 
410 
251 
459 






Total preference flows 
448 
29 
182 
112 
323 
Fraction to McGowan 
0.783482 
0.517241 
0.758242 
0.8125 
0.879257 
We can see that in the Alexandra booth, Cathy McGowan only won 216 first preference votes, compared to Sophie Mirabella’s 715. But the 448 votes of remaining candidates flowed distinctly towards McGowan – on more than 78% of those ballot papers, McGowan was preferenced ahead of Mirabella, so she collected those preferences.
The flow of preferences was even stronger in Beechworth, where McGowan won almost 88% of the distributed preferences, but she got less than 52% of the preferences in Baddaginnie. You might notice a big difference between Beechworth and Baddaginnie in the first preferences. For example, the Greens won almost 7% of first preferences in Beechworth but less than 2% of first preferences in Baddaginnie.
We can model this flow of preferences as a function of the first preferences to predict the twocandidatepreferred vote from first preferences. Here, we are essentially aiming to predict the fraction of votes that flow from the first preferences of the other candidates to the two leading candidates.
We can build this model using linear regression, but we would like to constrain the model coefficients such that they are between zero and one; the coefficients estimate the proportion of voters whose preferenced one of the 2CP candidates ahead of the other.
If we take the data from all of Indi’s 103 booths (and also the postal, early, provisional, and absentee votes), then our model results look like this:
Observed preference flows to Cathy McGowan versus fitted preference flows for each of the booths (and the nonordinary votes) based on the 2013 federal election results.
Let’s look at the model coefficients:
0.623 : DUDLEY, Robert ( Rise Up Australia Party ), 985 votes
0.662 : LEEWORTHY, Rick ( Family First Party ), 1330 votes
0.383 : ASCHENBRENNER, Helma ( Sex Party ), 1402 votes
0.000 : HAYES, William ( Bullet Train For Australia ), 489 votes
0.992 : WALSH, Robyn ( Australian Labor Party ), 10375 votes
0.700 : O’CONNOR, Jenny ( The Greens ), 3041
0.008 : ROURKE, Phil ( Katter’s Australian Party ), 615 votes
0.680 : MURPHY, Robert Denis ( Palmer United Party ), 2417 votes
1.000 : PODESTA, Jennifer ( Independent ), 841 votes
These coefficients estimate that McGowan was preferenced behind Mirabella by almost all voters who put Hayes first (the estimated coefficient is zero), but she was placed ahead of Mirabella by almost everyone who put Walsh first (the estimated coefficient is 0.992).
The graph shows that this pattern of preference flows as a function of first preferences is very consistent, at least in Indi. In some other electorates, it is not so consistent. Here are the model results for the seat of Batman, which will be hotly contested in 2016 between the Greens and the ALP:
Observed preference flows to Alexandra Bhathal (a Greens candidate) versus fitted preference flows for each of the booths in the seat of Batman (and the nonordinary votes) based on the 2013 federal election results.
The model for Batman doesn’t work quite as well, largely because Bhathal received a greater flow of preferences from the nonordinary votes (orange symbols in the figure) than from the ordinary votes. These nonordinary votes are the postal votes (Bhathal received almost 1700 of the flowing preferences), absent votes (Bhathal received over 1000 of the flowing preferences), early votes (Bhathal received just under 1000 of the flowing preferences), and provisional votes (there were very few of these).
Interestingly, a similar pattern occurred in Wills, which is another inner Melbourne seat with a Greens candidate – it seems the Greens garnered strong preference flows from the nonordinary votes in 2013. Whether that will be borne out in 2016 remains to be seen, but strong preference flows will be needed by the Greens if they are to prevail in Batman.
If you’d like to look at preference flows for yourself for different seats in the 2013 election, then you are welcome to use my R code that I wrote – it scrapes the data from the AEC website, runs the model and prints out the result.
The code is best run using the source command in R so that you are prompted to select the seat from the list of lower house seats (or you can just specify the seat number directly from within the code). And please excuse my R coding – I know it is clumsy in places, I am learning, and am yet to figure out R’s data structures properly to do vectorized operations (among other things I don’t understand).
Also, I haven’t checked that this works on all seats – there might be some anomalies that I haven’t accounted for.
suppressWarnings(library(XML))
suppressWarnings(library(nnls))
seatsite="http://results.aec.gov.au/17496/Website/HouseDivisionMenu17496NAT.htm" # seats listed here
# seatsite="http://results.aec.gov.au/15508/Website/HouseDivisionMenu15508NAT.htm" # 2010 seat
seat.table = readHTMLTable(seatsite, header=F, which=1, stringsAsFactors=F, as.data.frame=TRUE) # extract the seats
# arranged in 3 columns, with various white spaces, so trim
x < gsub("\t", "", seat.table[6,]$V1, fixed = TRUE)
x < gsub("\r", "", x, fixed = TRUE)
x < gsub("\n\n\n\n\n", "\n", x, fixed = TRUE)
V1 < strsplit(x, "\n")
x < gsub("\t", "", seat.table[6,]$V2, fixed = TRUE)
x < gsub("\r", "", x, fixed = TRUE)
x < gsub("\n\n\n\n\n", "\n", x, fixed = TRUE)
V2 < strsplit(x, "\n")
x < gsub("\t", "", seat.table[6,]$V3, fixed = TRUE)
x < gsub("\r", "", x, fixed = TRUE)
x < gsub("\n\n\n\n\n", "\n", x, fixed = TRUE)
V3 < strsplit(x, "\n")
seat.names < c(V1[[1]], V2[[1]], V3[[1]]) # combine the three columns into one
seatlinks < getHTMLLinks(seatsite, relative=FALSE) # get the links to the sites with seat specific data
prefix < "http://results.aec.gov.au/17496/Website/"
# prefix < "http://results.aec.gov.au/15508/Website/" #2010 results
seatlinks < paste(prefix, seatlinks[12:161], sep="") # paste on the rest of the website
seatlinks < gsub("FirstPrefs", "PollingPlaces", seatlinks) # we will need polling booth data, so change names a little
nseats < length(seat.names)
print(seat.names) # print out the seat names, and prompt user to select one
prompt < paste("Enter number of seat (between 1 and ", nseats, "): ", sep="")
chosen < as.numeric(readline(prompt))
cat("Selected seat is ", seat.names[chosen], "\n")
places = seatlinks[chosen] # this is the link for teh chosen seat
places.table = readHTMLTable(places, header=F, which=1, stringsAsFactors=F, as.data.frame=TRUE, skip.rows=c(1,2,3,4,5,6))
places.names < places.table$V1 # gets the list of booths
placeslinks < getHTMLLinks(places, relative=FALSE) # get links to booth data
placeslinks < placeslinks[grep("HousePollingPlaceFirstPrefs", placeslinks)] # trim off redundant info
nplaces < length(placeslinks)
places.names < places.names[1:nplaces]
placeslinks < paste(prefix, placeslinks, sep="") # paste on the prefix to booth links
skippedrows < 1:8 # Need 9 for 2010, header=F; 8 for 2013, header=T
# get info for first booth
firstpref.table = readHTMLTable(placeslinks[1], header=T, which=1, stringsAsFactors=F, skip.rows=skippedrows)
# find number of candidates
ncandidates < pmatch("......", firstpref.table$V1)1
if(is.na(ncandidates))
{
ncandidates < pmatch("FORMAL", firstpref.table$V1)1
}
# get candidate names and parties
candidate.names < firstpref.table$V1[1:ncandidates]
candidate.parties < firstpref.table$V2[1:ncandidates]
# get two candidate preferred names
twopp.names < firstpref.table$V1[(nrow(firstpref.table)2):(nrow(firstpref.table)1)]
# get arrays ready for data scraping
firstpref < array(999, dim=c(nplaces+4, ncandidates))
twopp < array(999, dim=c(nplaces+4, 2))
places.total < array(dim=nplaces+4)
for(i in 1:nplaces) # for each booth
{
firstpref.table = readHTMLTable(placeslinks[i], header=T, which=1, stringsAsFactors=F, skip.rows=skippedrows)
for(j in 1:ncandidates) # get first preference count for each candidate
{
firstpref[i, j] < as.numeric(gsub(",","", firstpref.table[j, 3])) # gsub removes commas from string
}
places.total[i] < sum(firstpref[i, 1:ncandidates]) # get total first prefs for each booth
twopp[i, 1] < as.numeric(gsub(",","", firstpref.table[(nrow(firstpref.table)2), 3])) # get 2CP data
twopp[i, 2] < as.numeric(gsub(",","", firstpref.table[(nrow(firstpref.table)1), 3]))
}
# Now get nonordinary votes
othervotesite < gsub("HouseDivisionPollingPlaces", "HouseDivisionFirstPrefsByVoteType", places)
othervotes.table = readHTMLTable(othervotesite, header=T, which=1, stringsAsFactors=F, as.data.frame=TRUE, skip.rows=c(1,2,3,4,5,6))
absent < as.numeric(gsub(",","", othervotes.table$V5[1:ncandidates]))
provisional < as.numeric(gsub(",","", othervotes.table$V7[1:ncandidates]))
early < as.numeric(gsub(",","", othervotes.table$V9[1:ncandidates]))
postal < as.numeric(gsub(",","", othervotes.table$V11[1:ncandidates]))
firstpref[nplaces+1, ] < absent
firstpref[nplaces+2, ] < provisional
firstpref[nplaces+3, ] < early
firstpref[nplaces+4, ] < postal
twopp[nplaces+1, 1] < as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)2), 5])) # absent
twopp[nplaces+2, 1] < as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)2), 7])) # provisional
twopp[nplaces+3, 1] < as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)2), 9])) # early
twopp[nplaces+4, 1] < as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)2), 11])) # postal
twopp[nplaces+1, 2] < as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)1), 5]))
twopp[nplaces+2, 2] < as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)1), 7]))
twopp[nplaces+3, 2] < as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)1), 9]))
twopp[nplaces+4, 2] < as.numeric(gsub(",","", othervotes.table[(nrow(othervotes.table)1), 11]))
totalfirstprefs < array(999, dim=ncandidates)
for(j in 1:ncandidates)
{
totalfirstprefs[j] < sum(firstpref[, j]) # count total first prefs for each candidate across all booths
}
twopp.id < pmatch(twopp.names, candidate.names) # get id's of 2CP people
twopp.parties < candidate.parties[twopp.id] # and their parties
nflowed < twopp[,1]  firstpref[, twopp.id[1]] # number of preferences flowing to 2pp candidate number 1
otherfirst < firstpref[, twopp.id] # get first pref votes for candidates other than 2pp candidates (we know where they go)
othernames < candidate.names[twopp.id]
otherparties < candidate.parties[twopp.id]
othertotalfirstprefs < totalfirstprefs[twopp.id]
# set up model specification for flow of preferences
starter < structure(rep(0.5,(ncandidates2)), names=letters[1:(ncandidates2)])
lowers < structure(rep(0,(ncandidates2)), names=letters[1:(ncandidates2)])
uppers < structure(rep(1,(ncandidates2)), names=letters[1:(ncandidates2)])
formula < "nflowed ~ a*otherfirst[, 1]"
for(i in 2:(ncandidates2))
{
formula < paste(formula, " + ", letters[i], "*otherfirst[, ", i, "]", sep="")
}
model < nls(formula, algorithm="port", start=starter, lower=lowers, upper=uppers)
modelsum < summary(model)
cat("Estimated flow to: ", twopp.names[1], "(", twopp.parties[1], ")\n")
for(i in 1:(ncandidates2))
{
cat(modelsum$parameters[i,1], " of ", othertotalfirstprefs[i], "votes from ", othernames[i], "(", otherparties[i], ")\n")
}
flows < modelsum$parameters[,1] * othertotalfirstprefs
totflow < sum(flows)
totflow2 < sum(othertotalfirstprefs)  totflow
cat("\nEstimated votes to: ", twopp.names[1], "(", twopp.parties[1], "): ", totalfirstprefs[twopp.id[1]]+totflow, "\n")
cat("\nEstimated votes to: ", twopp.names[2], "(", twopp.parties[2], "): ", totalfirstprefs[twopp.id[2]]+totflow2, "\n")
RSS < sum(residuals(model)^2) # residual sum of squares
TSS < sum((nflowed  mean(nflowed ))^2) # total sum of squares
RSq < 1  (RSS/TSS) # Rsquared measure
fitted < fitted(model)
maxi < max(nflowed, max(fitted))
plot(fitted[1:nplaces], nflowed[1:nplaces], col="blue", xlab="Fitted preference flow", ylab="Observed preference flow", xlim=c(0,maxi), ylim=c(0,maxi))
points(fitted[(nplaces+1):(nplaces+4)], nflowed[(nplaces+1):(nplaces+4)], col="orange", pch=8)
abline(a=0, b=1)
rsqtxt < paste("Flow to ", twopp.names[1], "\n", seat.names[chosen], "\nRsquared = ", round(RSq, digits = 4), "\nBlue are ordinary votes for each booth\nOrange are nonordinary votes", sep="")
text(x=0, y=0.9*(max(nflowed)), labels=rsqtxt, pos=4)
Well, I hope you found that interesting. We’ll see what happens in the 2016 election… I might do something about swings in a second post if I have time.
Like this:
Like Loading...