Monthly Archives: March 2014

Guardian data blog — UK general election analysis in R

The Guardian newspaper has for a few years been running a data blog and has built up a massive repository of (often) well-curated datasets on a huge number of topics. They even have an indexed list of all data sets they’ve put together or reused in their articles.

It’s a great repository of interesting data for exploratory analysis, and there’s a low barrier to entry in terms of getting the data into a useful form. Here’s an example using UK election polling data collected over the last thirty years.

ICM polling data

The Guardian and ICM research have conducted monthly polls on voting intentions since 1984, usually with a sample size of between 1,000 and 1,500 people. It’s not made obvious how these polls are conducted (cold-calling?) but for what it’s worth ICM is a member of the British Polling Council, and so hopefully tries to monitor and correct for things like the “Shy Tory Factor“—the observation that Conservative voters supposedly have (or had prior to ’92)  a greater tendency to conceal their voting intentions than Labour supporters.

Preprocessing

The data is made available from The Guardian as a .csv file via Google spreadsheets here and requires minimal cleanup, cut the source information from the end of the file and you can open it up in R.


sop <- read.csv("StateOfTheParties.csv", stringsAsFactors=F)

## Data cleanup
sop[,2:5] <- apply(sop[,2:5], 2, function(x) as.numeric(gsub("%", "", x)))
sop[,1] <- as.Date(sop[,1], format="%d-%m-%Y")
colnames(sop)[1] <- "Date"

# correct for some rounding errors leading to 101/99 %
sop$rsum <- apply(sop[,2:5], 1, sum)
table(sop$rsum)
sop[,2:5] <- sop[,2:5] / sop$rsum

Then after melting the data.frame down (full code at the end of the post), you can get a quick overview with ggplot2.

UK general election overview 1984-2014

Outlines (stacked bars) represent general election results

Election breakdown

The area plot is a nice overview but not that useful quantitatively. Given that the dataset includes general election results as well as opinion polling, it’s straightforward to split the above plot by this important factor. I also found it useful to convert absolute dates to be relative to the election they precede. R has an object class, difftime, which makes this easy to accomplish and calling as.numeric() on a difftime object converts it to raw number of days (handily accounting for things like leap years).

These processing steps lead to a clearer graph with more obvious stories, such as the gradual and monotonic decline of support for Labour during the Blair years.

UK general election data split by election period

NB Facet headers show the election year and result of the election with which the (preceding) points are plotted relative to.

Next election’s result

I originally wanted to look at this data to get a feel for how things are looking before next year’s (2015) general election, maybe even running some predictive models (obviously I’m no fivethirtyeight.com).

However, graphing the trends of public support for the two main UK parties hints it’s unlikely to be a fruitful endeavour at this point, and with the above graphs showing an ominous increasing support for “other” parties (not accidentally coloured purple), it looks like with about 400 days to go the 2015 general election is still all to play for.

lab_con


Advertisements

2 Comments

Filed under R

What are the most common RNG seeds used in R scripts on Github?

In the R programming language, the random number generator (RNG) is seeded each session using the current time and process ID. Via the magic of the popular Mersenne Twister PRNG, the values stored in .Random.seed are used sequentially each time “randomness” is invoked in a function. This means, of course, that the same function run in different R sessions can produce varying results, and in the case of modelling a system sensitive to initial conditions the observed differences could be huge.

For this reason it’s common to manually set the PRNG seed (using set.seed() in R), thereby creating the same .Random.seed vector which can be drawn from in your analysis to produce reproducible results. The actual value passed to this function is irrelevant for practical purposes — for whatever reason I generally user the same number across projects (42) — so this made me wonder: what values do the major R developers tend to pick?

To Github

The Github API is currently in a transitional period between versions 2 and 3 and has (annoyingly) limited code search results to specific users or organisations. This means to perform a code search programmatically, I’ll need a list of R users.

Google BigQuery

One way of building a list is through Github archive. The dev (Ilya Grigorik) has put up a public dataset with Google BigQuery, which is a neat cloud-based platform for querying huge datasets. My SQL isn’t all that, but the Google BigQuery interface is really functional (e.g. it autocompletes table fields) and makes it easy to get the data you’re looking for.

Fishing for prolific R users via Google BigQuery.

Fishing for prolific R users via Google BigQuery.

In this case I pulled out R code repositories ordered by repo pushes (as a heuristic for codebase size and activity, I guess) with their owner’s username. It was this list of names I then used for the API query.

Github API

It looks like there’s a decent set of R bindings for the Github API, but it’s not clear how they work with code search, so I opted for the messier option of calling curl through system(). To build the command to search the API per user:

getCMD <- function(user){
  cmd <- paste0("curl -H 'Accept: application/vnd.github.v3.text-match+json Authorization: token ",
                   oauth, "' 'https://api.github.com/search/code?q=set.seed+in:file+language:R+user:",
                   user,"&page=1&per_page=500' | grep 'fragment' -")
  return(cmd)
}

As you can see this is pretty rough and ready, there may be a pagination issue if someone sets PRNG seeds everywhere but it’ll do.

You can then run the command and pull out the returned string matches for the query, in this case I searched for “set.seed” and then used Haldey Wickham’s stringr R package to regexp out the number (if any) passed to the function.

getPRNGseeds <- function(user){
  print(user)
  api.result <- system(getCMD(user), intern=T)
  if(is.null(attr(api.result, "status"))) {
    seeds <- cbind(user, 
        str_match(api.result, "set\\.seed\\((\\d+)\\)")[,2])
    Sys.sleep(10)
    return(seeds)
  } else {
    cat(user, " failed")
    return(cbind(user, NA))
  }
}

The Github API spits out JSON data (ignored and just grepped out in the above) so I looked into a couple of smarter ways of parsing it. Firstly there’s the jsonlite R package which offers the fromJSON() function to import JSON data into what resembles a sometimes-hard-to-work-with nested R object. It seems like the Github API query results return too many nested layers to produce a useful object in this case. Another option is jq, a command-line program which has a neat syntax for dealing specifically with the JSON data structure — I’ll definitely be using it for more complex JSON wrangling in the future.

The data

Despite the harsh search limitations I ended up with 27 users who owned the top 100 R repositories, and of those 15 used set.seed() somewhere (or at least something like it). However the regex fails in some cases — where a variable is being passed to the function instead of an integer, for one. Long story short, I scraped together 187 lines of set.seed(\d+) from some of the big names in the R community and here’s how the counts looked:

RNG seeds

So plain old 1 is the stand-out winner!

There’s a few sequences in there (123, 321, 1234, 12345) and some date references (2011, 20051028), but surprisingly few programmer in-jokes or web-culture references, save a lone 1337 and I guess some binary.

1410 (or 1014 in less-sensible countries) and 141079 look like they could be a certain R developer’s birthday and birth year, but that’s pure speculation 🙂

Here’s one of those awful wordle / wordcloud things too.

wordle_crop

Hopefully as the roll-out of the v3 Github API progresses the current search restriction will be lifted, still this was a fun glimpse at other programmer’s conventions!


R script

Leave a comment

Filed under R