Tag Archives: ggplot2

Recreating the vaccination heatmaps in R

In February the WSJ graphics team put together a series of interactive visualisations on the impact of vaccination that blew up on twitter and facebook, and were roundly lauded as great-looking and effective dataviz. Some of these had enough data available to look particularly good, such as for the measles vaccine:

Credit to the WSJ and creators: Tynan DeBold and Dov Friedman

Credit to the WSJ and creators: Tynan DeBold and Dov Friedman

How hard would it be to recreate an R version?

Base R version

Quite recently Mick Watson, a computational biologist based here in Edinburgh, put together a base R version of this figure using heatmap.2 from the gplots package.

If you’re interested in the code for this, I suggest you check out his blog post where he walks the reader through creating the figure, beginning from heatmap defaults.

However, it didn’t take long for someone to pipe up asking for a ggplot2 version (3 minutes in fact…) and that’s my preference too, so I decided to have a go at putting one together.

ggplot2 version

Thankfully the hard work of tracking down the data had already been done for me, to get at it follow these steps:

  1. Register and login to “Project Tycho
  2. Go to level 1 data, then Search and retrieve data
  3. Now change a couple of options: geographic level := state; disease outcome := incidence
  4. Add all states (highlight all at once with Ctrl+A (or Cmd+A on Macs)
  5. Hit submit and scroll down to Click here to download results to excel
  6. Open in excel and export to CSV

Simple right!

Now all that’s left to do is a bit of tidying. The data comes in wide format, so can be melted to our ggplot2-friendly long format with:

measles <- melt(measles, id.var=c("YEAR", "WEEK"))

After that we can clean up the column names and use dplyr to aggregate weekly incidence rates into an annual measure:

colnames(measles) <- c("year", "week", "state", "cases")
mdf <- measles %>% group_by(state, year) %>% 
       summarise(c=if(all(is.na(cases))) NA else 
                 sum(cases, na.rm=T))
mdf$state <- factor(mdf$state, levels=rev(levels(mdf$state)))

It’s a bit crude but what I’m doing is summing the weekly incidence rates and leaving NAs if there’s no data for a whole year. This seems to match what’s been done in the WSJ article, though a more intepretable method could be something like average weekly incidence, as used by Robert Allison in his SAS version.

After trying to match colours via the OS X utility “digital colour meter” without much success, I instead grabbed the colours and breaks from the original plot’s javascript to make them as close as possible.

In full, the actual ggplot2 command took a fair bit of tweaking:

ggplot(mdf, aes(y=state, x=year, fill=c)) + 
  geom_tile(colour="white", linewidth=2, 
            width=.9, height=.9) + theme_minimal() +
    scale_fill_gradientn(colours=cols, limits=c(0, 4000),
                        breaks=seq(0, 4e3, by=1e3), 
                        na.value=rgb(246, 246, 246, max=255),
                        labels=c("0k", "1k", "2k", "3k", "4k"),
                        guide=guide_colourbar(ticks=T, nbin=50,
                               barheight=.5, label=T, 
                               barwidth=10)) +
  scale_x_continuous(expand=c(0,0), 
                     breaks=seq(1930, 2010, by=10)) +
  geom_segment(x=1963, xend=1963, y=0, yend=51.5, size=.9) +
  labs(x="", y="", fill="") +
  ggtitle("Measles") +
  theme(legend.position=c(.5, -.13),
        legend.direction="horizontal",
        legend.text=element_text(colour="grey20"),
        plot.margin=grid::unit(c(.5,.5,1.5,.5), "cm"),
        axis.text.y=element_text(size=6, family="Helvetica", 
                                 hjust=1),
        axis.text.x=element_text(size=8),
        axis.ticks.y=element_blank(),
        panel.grid=element_blank(),
        title=element_text(hjust=-.07, face="bold", vjust=1, 
                           family="Helvetica"),
        text=element_text(family="URWHelvetica")) +
  annotate("text", label="Vaccine introduced", x=1963, y=53, 
           vjust=1, hjust=0, size=I(3), family="Helvetica")

Result

measles_incidence_heatmap_2I’m pretty happy with the outcome but there are a few differences: the ordering is out (someone pointed out the original is ordered by two letter code rather than full state name) and the fonts are off (as far as I can tell they use “Whitney ScreenSmart” among others).

Obviously the original is an interactive chart which works great with this data. It turns out it was built with the highcharts library, which actually has R bindings via the rCharts package, so in theory the original chart could be entirely recreated in R! However, for now at least, that’ll be left as an exercise for the reader…


Full code to reproduce this graphic is on github.

22 Comments

Filed under R

Celebrity twitter followers by gender

The most popular accounts on twitter have millions of followers, but what are their demographics like? Twitter doesn’t collect or release this kind of information, and even things like name and location are only voluntarily added to people’s profiles. Unlike Google+ and Facebook, twitter has no real name policy, they don’t care what you call yourself, because they can still divine out useful information from your account activity.

For example, you can optionally set your location on your twitter profile. Should you choose not to, twitter can still just geolocate your IP. If you use an anonymiser or VPN, they could use the timing of your account activity to infer a timezone. This could then be refined to a city or town using the topics you tweet about and the locations of friends and services you mention most.

I chose to look at one small aspect of demographics: gender, and used a cheap heuristic based on stated first name to estimate the male:female ratios in a sample of followers from these very popular accounts.

Top 100 twitter accounts by followers

A top 100 list is made available by Twitter Counter. It’s not clear that they have made this list available through their API, but thanks to the markup, a quick hack is to scrape the usernames using RCurl and some regex:

require("RCurl")
top.100 <- getURL("http://twittercounter.com/pages/100")

# split into lines
top.100 <- unlist(strsplit(top.100, "\n"))
# Get only those lines with an @
top.100 <- top.100[sapply(top.100, grepl, pattern="@")]

# Grep out anchored usernames: <a ...>@username</a>
top.100 <- gsub(".*>@(.+)<.*", "\\1", top.100)[2:101]
head(top.100)
# [1] "katyperry"  "justinbieber"  "BarackObama"  ...

R package twitteR

Getting data from the twitter API is made simple by the twitteR package. I made use of Dave Tang’s worked example for the initial OAuth setup, once that’s complete the twitteR package is really easy to use.

The difficulty getting data from the API, as ever, is to do with rate limits. Twitter allows 15 requests for follower information per 15 minute window. (Number of followers can be queried by a much more generous 180 requests per window.) This means that to get a sample of followers for each of the top 100 twitter accounts, it’ll take at a minimum 1 hour 40 mins to stay on the right side of the rate limit. I ended up using 90 second sleep windows between requests to be safe, making a total query time of two and a half hours!

Another issue is possibly to do with strange characters being returned and breaking the JSON import. This error crops up a lot and meant that I had to lower the sample size of followers to avoid including these problem accounts. After some highly unscientific tests, I settled on about 1000 followers per account which seemed a good trade-off between maximising sample size but minimising failure rate.

# Try to sample 3000 followers for a user:
username$getFollowers(n=3000)
# Error in twFromJSON(out) :
#  Error: Malformed response from server, was not JSON.
# The most likely cause of this error is Twitter returning
# a character which can't be properly parsed by R. Generally
# the only remedy is to wait long enough for the offending
# character to disappear from searches.

Gender inference

Here I used a relatively new R package, rOpenSci’s gender (kudos for resisting gendR and the like). This uses U.S. social security data to probabilistically link first names with genders, e.g.:

devtools::install_github("ropensci/gender")
require("gender")
gender("ben")
#   name proportion_male proportion_female gender
# 1  ben          0.9962            0.0038   male

So chances are good that I’m male. But the package also returns proportional data based on the frequency of appearances in the SSA database. Naively these can be interpreted as the probability a given name is either male or female. So in terms of converting a list of 1000 first names to genders, there are a few options:

  1. Threshold: if  >.98 male or female, assign gender, else ignore.
  2. Probabilistically: use random number generation to assign each case, if a name is .95 male and .05 female, on average assign that name to females 5% of the time.
  3. Bayesian-ish: threshold for almost certain genders (e.g. .99+) and use this as a prior belief of gender ratios when assigning gender to the other followers for a given user. This would probably lower bias when working with heavily skewed accounts.

I went with #2 here. Anecdotal evidence suggests it’s reasonably accurate anyway, with twitter analytics (using bag of words, sentiment analysis and all sorts of clever tricks to unearth gender) estimating my account has 83% male followers (!), with probabilistic first name assignment estimating 79% (and that’s with a smaller sample). Method #3 may correct this further but the implementation tripped me up.

Results

Celebrity twitter followers by gender

So boys prefer football (soccer) and girls prefer One Direction, who knew? Interestingly Barack Obama appears to have a more male following (59%), as does Bill Gates with 67%.

At the other end of the spectrum, below One Direction, Simon Cowell is a hit with predominantly female twitter users (70%), as is Kanye West (67%) and Khloe Kardashian (72%).

Another surprise is that Justin Bieber, famed as teen girl heartthrob, actually has a more broad gender appeal with 41 / 59 male-female split.

Interactive charts

Click for an interactive version.

Click for an interactive version.

Using the fantastic rCharts library, I’ve put together some interactive graphics that let you explore the above results further. These use the NVD3 graphing library, as opposed to my previous effort which used dimple.js.

The first of these is ordered by number of followers, and the second by gender split. The eagle-eyed among you will see that one account from the top 100 is missing from all these charts due to the JSON error I discuss above, thankfully it’s a boring one (sorry @TwitPic).

Where would your account be on these graphs? Somehow I end up alongside Wayne Rooney in terms of gender diversity :s

Caveats

  • A lot of the time genders can’t be called from an account’s first name. Maybe they haven’t given a first name, maybe it’s a business account or some pretty unicode symbols, maybe it’s a spammy egg account. This means my realised sample size is <<1000, sometimes the majority of usernames had no gender (e.g. @UberSoc, fake followers?).

    This (big) chart includes % for those that couldn't be assigned (NA)

    This (big) chart includes % for those that couldn’t be assigned (NA)

  • The SSA data is heavily biased towards Western (esp. US) and non-English names are likely to not be assigned a gender throughout. This is a shame, if you know of a more international gender DB please let me know.
  • I’m sampling most recent followers, so maybe accounts like Justin Bieber have a much higher female ratio in earlier followers than those which have only just hit the follow button.
  • The sample size of 1000 followers per account is smaller than I’d like, especially for accounts with 50 million followers.

If you have other ideas of what to do with demographics data, or have noticed additional caveats of this study, please let me know in the comments!


Full code to reproduce this analysis is available on Github.

21 Comments

Filed under R, Unrelated

What are the most overrated films?

“Overrated” and “underrated” are slippery terms to try to quantify. An interesting way of looking at this, I thought, would be to compare the reviews of film critics with those of Joe Public, reasoning that a film which is roundly-lauded by the Hollywood press but proved disappointing for the real audience would be “overrated” and vice versa.

To get some data for this I turned to the most prominent review aggregator: Rotten Tomatoes. All this analysis was done in the R programming language, and full code to reproduce it will be attached at the end.

Rotten Tomatoes API

This API is nicely documented, easy to access and permissive with rate limits, as well as being cripplingly restrictive in what data is presents. Want a list of all films in the database? Nope. Most reviewed? Top rated? Highest box-office takings? Nope.

The related forum is full of what seem like simple requests that should be available through the API but aren’t: top 100 lists? Search using mulitple IDs at once? Get audience reviews? All are unanswered or not currently implemented.

So the starting point (a big list of films) is actually kinda hard to get at. The Rube Golbergian method I eventually used was this:

  1. Get the “Top Rentals” list of movie details (max: 50)
  2. Search each one for “Similar films” (max: 5)
  3. Get the unique film IDs from step 2 and iterate

(N.B. This wasn’t my idea but one from a post in the API forums, unfortunately didn’t save the link.)

In theory this grows your set of films at a reasonable pace, but in reality the number of unique films being returned was significantly lower (shown below). I guess this was due to pulling in “walled gardens” to my dataset, e.g. if a Harry Potter film was hit, each further round would pull in the 5 other films as most similar.

Films returned

Results

Here’s an overview of the critic and audience scores I collected through the Rotten Tomatoes API, with some outliers labelled.

Most over- and underrated films

On the whole it should be noted that critics and audience agree most of the time, as shown by the Pearson correlation coefficient between the two scores (0.71 across >1200 films).

Click for interactive version.

Update:

I’ve put together an interactive version of the same plot here using the rCharts R package. It’ll show film title and review scores when you hover over a point so you know what you’re looking at. Also I’ve more than doubled the size of the film dataset by repeating the above method for a couple more iterations — take a look!

Most underrated films

Using our earlier definition it’s easy to build a table of those films where the audience ending up really liking a film that was panned by critics.

Scores are shown out of 100 for both aggregated critics and members of Rotten Tomatoes.

Scores are shown out of 100 for both aggregated critics and members of Rotten Tomatoes.

Somewhat surprisingly, the top of the table is Facing the Giants (2006), an evangelical Christian film. I guess non-Christians might have stayed away, and presumably it struck a chord within its target demographic — but after watching the trailer, I’d probably agree with the critics on this one.

This showed that some weighting of the difference might be needed, at the very least weighting by number of reviews, but the Rotten Tomatoes API doesn’t provide that data.

In addition the Rotten Tomatoes page for the film, shows a “want to see” percentage, rather than an audience score. This came up a few times and I’ve seen no explanation for it, presumably “want to see” rating is for unreleased films, but the API returns a separate (and undisclosed?) audience score for these films also.

Above shows a "want to see" rating, different to the "liked it" rating returned by the API and shown below

Above shows a “want to see” rating, different to the “liked it” rating returned by the API and shown below. Note: these screenshots from RottenTomatoes.com are not CC licensed and is shown here under a claim of Fair Use, reproduced for comment/criticism.

Looking over the rest of the table, it seems the public is more fond of gross-out or slapstick comedies (such as Diary of a Mad Black Woman (2005), Grandma’s boy (2006)) than the critics. Again, not films I’d jump to defend as underrated. Bad Boys II however…

Most overrated films

Here we’re looking at those films which the critics loved, but paying audiences were then less enthused.

As before, scores are out of 100 and they're ranked by difference between audience and critic scores.

As before, scores are out of 100 and they’re ranked by difference between audience and critic scores.

Strangely the top 15 (by difference) contains both the original 2001 Spy Kids and the sequel Spy Kids 2: The Island of Lost Dreams (2002). What did critics see in these films that the public didn’t? A possibility is bias in the audience reviews collected, the target audience is young children for these films and they probably are underrepresented amongst Rotten Tomatoes reviewers. Maybe there’s even an enrichment for disgruntled parent chaperones.

Thankfully, though, in this table there’s the type of film we might more associate with being “overrated” by critics. Momma’s Man (2008) is an indie drama debuted at the 26th Torino Film Festival. Essential Killing is a 2010 drama and political thriller from Polish director and screenwriter Jerzy Skolimowski. 

There’s also a smattering of Rom-Coms (Friends with Money (2006), Splash (1984)) — if the API returned genre information it would be interesting to look for overall trends but, alas. Additional interesting variables to consider might be budget, the lead, reviews of producer’s previous films… There’s a lot of scope for interesting analysis here but it’s currently just not possible with the Rotten Tomatoes API.

 Caveats / Extensions

The full code will be posted below, so if you want to do a better job with this analysis, please do so and send me a link! 🙂

  • Difference is too simple a metric. A better measure might be weighted by (e.g.) critic ranking. A film critics give 95% but audiences 75% might be more interesting than the same points difference between a 60/40 rated film.
  • There’s something akin to a “founder effect” of my initial chosen films that makes it had to diversify the dataset, especially to films from previous decades and classics.
  • The Rotten Tomatoes API provides an IMDB id for cross-referencing, maybe that’s a path to getting more data and building a better film list.
Full code to reproduce analysis

Note: If you’re viewing this on r-bloggers, you may need to visit the Benomics version to see the attached gist.


library("RCurl")
library("jsonlite")
library("ggplot2")
library("RColorBrewer")
library("scales")
library("gridExtra")
api.key <- "yourAPIkey"
rt <- getURI(paste0("http://api.rottentomatoes.com/api/public/v1.0/lists/dvds/top_rentals.json?apikey=&quot;, api.key, "&limit=50"))
rt <- fromJSON(rt)
title <- rt$movies$title
critics <- rt$movies$ratings$critics_score
audience <- rt$movies$ratings$audience_score
df <- data.frame(title=title, critic.score=critics,
audience.score=audience)
# Top 50 rentals, max returnable
ggplot(df, aes(x=critic.score, y=audience.score)) +
geom_text(aes(label=title, col=1/(critic.score – audience.score)))
# how can we get more? similar chaining
# STILL at most 5 per film (sigh)
getRatings <- function(id){
sim.1 <- getURI(paste0("http://api.rottentomatoes.com/api/public/v1.0/movies/&quot;,
id, "/similar.json?apikey=",
api.key, "&limit=5"))
sim <- fromJSON(sim.1)
#cat(sim$movies$title)
d <- data.frame(id = sim$movies$id,
title = sim$movies$title,
crit = sim$movies$ratings$critics_score,
aud = sim$movies$ratings$audience_score)
return(d)
}
rt.results <- function(idlist){
r <- sapply(unique(as.character(idlist)), getRatings, simplify=F)
r <- do.call(rbind, r)
return(r)
}
# Maybe this could be done via a cool recursion using Recall
r1 <- rt.results(rt$movies$id)
r2 <- rt.results(r1$id)
r3 <- rt.results(r2$id)
r4 <- rt.results(r3$id)
r5 <- rt.results(r4$id)
r6 <- rt.results(r5$id)
r7 <- rt.results(r6$id)
f <- function(x)
(5**x)-1
# Fig. 1: Number of films gathered via recursive descent
# of 'similar films' lists.
dev.off()
options(scipen=99)
pdf(4, 4, file="rottenTomatoHits.pdf")
par(cex.axis=.7, pch=20, mar=c(4,3,1,1), mgp=c(1.5,.3,0), tck=-.02)
plot(1:7, f(1:7), type="b", xlab="Recursions", ylab="Number of hits",
log="y", col=muted("blue"), lwd=2, ylim=c(4, 1e5))
lines(1:7, c(nrow(r1), nrow(r2), nrow(r3), nrow(r4), nrow(r5),
nrow(r6), nrow(r7)), type="b", col=muted("red"), lwd=2)
legend("bottomright", col=c(muted("blue"), muted("red")), pch=20, lwd=2,
legend=c(expression(Max~(5^x)), "Realised"), bty="n", lty="47")
dev.off()
r <- rbind(r1, r2, r3, r4, r5, r6, r7)
# 1279 unique films
ru <- r[!duplicated(as.character(r$id)),]
# Films with insufficient critics reviews get -1 score
ru[which(ru$crit == -1),]
ru <- ru[ru$crit != -1,]
ru$diff <- ru$crit – ru$aud
pcc <- cor(ru$crit, ru$aud)
# Overview figure: Show all critics vs. audience scores
# and highlight the titles of outliers
svg(7, 6, file="overview.svg")
ggplot(ru, aes(x=crit, y=aud, col=diff)) +
geom_point() +
coord_cartesian(xlim=c(-10,110), ylim=c(-10,110)) +
scale_color_gradientn(colours=brewer.pal(11, "RdYlBu"),
breaks=seq(-60,40, length.out=11),
labels=c("Underrated", rep("", 4),
"Agree", rep("", 4),
"Overrated")) +
geom_text(aes(label=ifelse(diff < quantile(diff, .005) |
diff > quantile(diff, .995), as.character(title), ""),
size=abs(diff)),
hjust=0, vjust=0, angle=45) +
scale_size_continuous(range=c(2,4), guide="none") +
labs(list(x="Critic's score", y="Audience score",
col="")) +
annotate("text", 3, 3,
label=paste0("rho ==", format(pcc, digits=2)),
parse=T)
dev.off()
tab <- ru
colnames(tab) <- c("id", "Title", "Critics", "Audience", "Difference")
# Most underrated films:
grid.newpage()
grid.draw(tableGrob(tab[order(tab$Difference),][1:15,-1], show.rownames=F))
# Most overrated:
grid.newpage()
grid.draw(tableGrob(tab[order(tab$Difference, decreasing=T),][1:15,-1], show.rownames=F))

 

105 Comments

Filed under R, Unrelated

Author inflation in academic literature

There seems to be a general consensus that author lists in academic articles are growing. Wikipedia says so, and I’ve also come across a published letter and short Nature article which accept this is the case and discuss ways of mitigating the issue. Recently there was an interesting discussion on academia.stackexchange on the subject but again without much quantification. Luckily given the array of literature database APIs and language bindings available, it should be pretty easy to investigate with some statistical analysis in R.

rplos

ROpenSci offers nice R language bindings for the PLOS (I’m more used to PLoS but I’ll go with it) API, called rplos. There’s no particular reason to limit the search to PLOS journals but rplos seems significantly more straightforward to work with than pubmed API packages I’ve used in the past like RISmed.

Additionally the PLOS group contains two journals of particular interest to me:

  • PLOS Computational Biology — a respectable specialist journal in my field; have bioinformatics articles been particularly susceptible to author inflation?
  • PLOS ONE — the original mega-journal. I wonder if the huge number of articles published here show different trends in authorship over time.

The only strange part of the search was at the PLOS API end. To search by the publication_year field you need to supply both a beginning and end date range, in the form:

publication_date:[2001-01-01T00:00:00Z TO 2013-12-31T23:59:59Z]

A tad verbose, right? I can’t imagine wanting to search for things published at a particular time of day. A full PLOS API query using the rplos package looks something like this:

searchplos(
  # Query: publication date in 2012
  q  = 'publication_date:[2012-01-01T00:00:00Z TO 2012-12-31T23:59:59Z]', 

  # Fields to return: id (doi) and author list
  fl = "id,author", 

  # Filter: only actual articles in journal PLOS ONE
  fq = list("doc_type:full",
            "cross_published_journal_key:PLoSONE"), 

  # 500 results (max 1000 per query)
  start=0, limit=500, sleep=6)

A downside of using the PLOS API is that the set of journals are quite recent, PLOS ONE started in 2006 and PLOS Biology was only a few years before in 2003, so it’ll only give us a limited window into any long-term trends.

Distribution of author counts

Before looking at inflation we can compare the distribution of author counts per paper between the journals:

Distribution of author counts
ECDF per journal

Possibly more usefully — but less pretty — the same data be plotted as empirical cumulative distribution functions (ECDF). From these we can see that PLOS Biology had the highest proportion of single-author papers in my sample (n = ~22,500 articles overall) followed by PLOS Medicine, with PLOS Genetics showing more high-author papers at the long-tail of the distribution, including the paper with the most authors in the sample (Couch et al., 2013 with 270 authors).

Author inflation

So, in these 6 different journals published by PLOS, how has the mean number of authors per paper varied across the past 7 years?

PLOS author inflation

Above I’ve shown yearly means plus their 95% confidence intervals, as estimated by a non-parametric bootstrap method implemented in ggplot2. Generally from this graph it does look like there’s a slight upward trend on average, though arguably the mean is not the best summary statistic to use for this data, which I’ve shown is not normally distributed, and may better fit an extreme value distribution.

The relationship between publication date and number of authors per paper become clearer if it’s broken down by journal:

Author inflation regression

Here linear regression reveals a significant positive coefficient for year against mean author count per paper, as high as .52 extra authors per year on average, down to just .06 a year for PLOS ONE. Surprisingly the mega-journal which published around 80,000 papers over this time period seems least susceptible to author inflation.

Author inflation per journalThe explained variance in mean number of authors per paper (per year) ranges from .28 (PLOS ONE) up to an impressive .87 for PLOS Medicine, with PLOS Computational Biology not far behind on .83. However, PLOS Computational Biology had the second lowest regression coefficient, and the lowest average number of authors of the six journals — maybe us introverted computer types should be collaborating more widely!

Journal effects

It’s interesting to speculate on what drives the observed differences in author inflation between journals. A possible covariate is the roundly-condemned “Impact Factor” journal-level metric — are “high impact” journals seeing more author creep than lesser publications?

Correlation of author inflation and impact factor

If estimate of author inflation is plotted against respective journals’ recent impact factors, there does indeed appear to be a positive correlation. However, this comparison only has 6 data points so there’s not enough evidence to reject the null hypothesis that there is no relationship between these two variables (p = 0.18).

Conclusion

Is author inflation occurring?

Yes, it certainly appears to be on average.

Is it a problem?

I don’t know, but I’d lean towards probably not.

The average trends could be reflecting the proliferation of “Big Science” with huge collaborative consortiums like ENCODE and FANTOM (though the main papers of those examples were targeted to Nature) and doesn’t necessarily support a conclusion that Publish or Perish culture is forcing lots of token authorships and backhanders between scientists.

Maybe instead (as the original discussion hypothesised), people that traditionally may not have been credited with authorship (bioinformaticians doing end-point analysis and lab technicians) are now getting recognised for their input more often, or conceivably advances in cloud computing, distributed data storage and video conferencing has better enabled larger collaborations between scientists across the globe!

Either way, evidence for author inflation is not evidence of a problem per se 🙂

Caveats

  • Means used for regression — while we get a surprisingly high R2 for regression the mean number of authors per paper per year, the predictive power for individual papers unsurprisingly vanishes (R2 plummets to between .02 and 4.6 × 10-4 per journal — significant non-zero coefficients remain). Author inflation wouldn’t be expected to exhibit consistent and pervasive effects in all papers, for example reviews, letters and opinion pieces presumably have consistently lower author counts than research articles and not all science can work in a collaborative, multi-author framework.
  • Search limits — rplos returns a maximum of 1000 results at a time (but they can be returned sequentially using the start and limit parameters). They seem to be drawn in reverse order of time so the results here probably aren’t fully representative of the year in some cases. This has also meant my sample is unevenly split between journals: PLoSBiology: 2487; PLoSCompBiol: 3403; PLoSGenetics: 4013; PLoSMedicine: 2094; PLoSONE: 7176; PLoSPathogens:3647; Total: 22,460.
  • Resolution — this could be done in a more fine-grained way, say with monthly bins. As mentioned above, for high-volume journals like PLOS ONE, the sample is likely coming from the end of the years from ~2010 onwards.

Full code to reproduce analysis


options(PlosApiKey = "<insert your API key here!>")
#install_github("rplos", "ropensci")
library("rplos")
library("ggplot2")
require("dplyr")
# Convert author strings to counts
countAuths <- function(cell)
length(unlist(strsplit(cell, ";")))
countAuths <- Vectorize(countAuths)
# Query PLoS API for 1k papers per journal per year,
# count the number of authors and return a data.frame
getAuths <- function(j, lim=1000, start.year=2006){
cat("Getting results for journal: ", j, "\n")
# seem to be in reverse order by year?
results <- sapply(start.year:2013, function(i) data.frame(year = i,
auths = searchplos(
q = paste0('publication_date:[', i,
'-01-01T00:00:00Z TO ', i,
'-12-31T23:59:59Z]'),
fl = "author",
fq = list("doc_type:full",
paste0("cross_published_journal_key:", j)),
start=0, limit=lim, sleep=6),
year=i), simplify=F)
results <- do.call(rbind, results)
results$counts <- countAuths(results$author)
results$journal <- j
results
}
journals <- journalnamekey()
plos.all <- sapply(journals[c(1:5, 7)], getAuths, simplify=F)
plos <- do.call(rbind, plos.all)
# Fig. 1: Bean plot showing distribution of author counts
# per journal overall
ggplot(plos, aes(x=journal, y=counts, fill=journal)) +
geom_violin(scale="width") +
geom_boxplot(width=.12, fill=I("black"), notch=T,
outlier.size=NA, col="grey40") +
stat_summary(fun.y="median", geom="point", shape=20, col="white") +
scale_y_log10(breaks=c(1:5, seq(10, 50, by=10), 100, 200, 300)) +
coord_flip() + labs(x="", y="Number of authors per paper") +
theme_classic() + theme(legend.position="none") +
scale_fill_brewer()
# Fig 2. ECDFs of the author count distributions
# 5in x 5in
ggplot(plos, aes(x=counts, col=journal)) +
stat_ecdf(geom="smooth", se=F, size=1.2) + theme_bw() +
scale_x_log10(breaks=c(1:5, seq(10, 50, by=10), 100, 200, 300)) +
theme(legend.position=c(.75,.33)) +
labs(x="Number of authors per paper", y="ECDF",
col="") + coord_cartesian(xlim=c(1,300)) +
scale_color_brewer(type="qual", palette=6)
# Fig 3. Trends in author counts over time with
# confidence limits on the means
# 7 x 7
ggplot(plos, aes(x=year, y=counts, col=journal, fill=journal)) +
stat_summary(fun.data="mean_cl_boot", geom="ribbon",
width=.2, alpha=I(.5)) +
stat_summary(fun.y="mean", geom="line") +
labs(list(x="Year", y="Mean number of authors per paper")) +
theme_bw() + theme(legend.position=c(.2,.85)) +
scale_fill_brewer(type="qual", palette=2,
guide=guide_legend(direction="vertical",
label.position="bottom",
title=NULL, ncol=2,
label.hjust=0.5)) +
scale_color_brewer(type="qual", palette=2, guide="none")
# from http://stackoverflow.com/a/17024184/1274516
# show regression equation on each graph facet
lm_eqn <- function(df){
m <- summary(lm(counts ~ year, df))
eq <- substitute(~~y~"="~beta*x+i~(R^2==r2),
list(beta = format(m$coefficients[2,"Estimate"],
digits = 3),
i = format(m$coefficients[1,"Estimate"], digits=3),
r2 = format(m$r.squared, digits=2)))
as.character(as.expression(eq))
}
means <- group_by(plos, journal, year) %.% summarise(counts=mean(counts))
b <- by(means, means$journal, lm_eqn)
df <- data.frame(beta=unclass(b), journal=names(b))
summary(lm(counts ~ year + journal, data=means))
means <- group_by(means, journal) %.% summarise(m=max(counts))
df$top <- means$m * 1.2
# Fig 4. Facetted linear regression of author inflation per journal
# 6 x 8.5
ggplot(plos, aes(x=year, y=counts, col=journal, fill=journal)) +
stat_summary(fun.data="mean_cl_boot", geom="errorbar",
width=.2, alpha=I(.5)) +
stat_summary(fun.y="mean", geom="point") +
#stat_summary(fun.y="median", geom="point", shape=4) +
facet_wrap(~journal, scales="free_y") +
geom_smooth(method="lm") +
scale_x_continuous(breaks=2006:2013) +
labs(list(x="", y="Mean number of authors per paper")) +
theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=1)) +
scale_fill_brewer(type="qual", palette=2, guide="none") +
scale_color_brewer(type="qual", palette=2, guide="none") +
geom_text(data=df, aes(x=2009.5, y=top, label=beta), size=3, parse=T)
# Overall estimate of author inflation?
# .21 extra authors per paper per year, on average
s <- summary(lm(counts ~ year + journal, data=plos))
# Summary barchart data:
bc <- data.frame(journal = unique(means$journal),
trend = c(0.2490979,
0.1211823,
0.5201688,
0.4088536,
0.05894102,
0.1828939),
std.err = c(0.08224567,
0.02213142,
0.1493662,
0.06361849,
0.03891493,
0.03798822),
IF = c(12.690,
4.867,
8.517,
15.253,
3.730,
8.136))
bc$journal <- factor(bc$journal, levels=bc$journal[order(bc$trend)])
# Fig 5. Barchart of author inflation estimate per journal.
# 7 x 5
ggplot(bc, aes(x=journal, y=trend, fill=journal, ymin=trend-std.err,
ymax=trend+std.err)) +
geom_bar(stat="identity") +
geom_errorbar(width=.2) +
scale_y_continuous(expand=c(0,0)) +
theme_classic() +
labs(x="",
y="Estimate of annual author inflation (additional mean authors per paper)") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
scale_fill_brewer(palette="Blues", guide="none")
pcc <- cor(bc$trend, bc$IF)
# Fig 6. Correlation of author inflation and journal impact factors.
# 5 x 5
ggplot(bc, aes(x=trend, y=IF, col=journal)) +
geom_text(aes(label=journal)) + xlim(0,.6) +
labs(x="Author inflation estimate",
y="Journal impact factor (2012)") +
scale_color_brewer(type="qual", palette=2, guide="none") +
annotate("text", x=.05, y=15,
label=paste0("rho == ", format(pcc, digits=2)), parse=T)
# N.S. (p = 0.18)
cor.test(bc$trend, bc$IF)

17 Comments

Filed under R

Guardian data blog — UK general election analysis in R

The Guardian newspaper has for a few years been running a data blog and has built up a massive repository of (often) well-curated datasets on a huge number of topics. They even have an indexed list of all data sets they’ve put together or reused in their articles.

It’s a great repository of interesting data for exploratory analysis, and there’s a low barrier to entry in terms of getting the data into a useful form. Here’s an example using UK election polling data collected over the last thirty years.

ICM polling data

The Guardian and ICM research have conducted monthly polls on voting intentions since 1984, usually with a sample size of between 1,000 and 1,500 people. It’s not made obvious how these polls are conducted (cold-calling?) but for what it’s worth ICM is a member of the British Polling Council, and so hopefully tries to monitor and correct for things like the “Shy Tory Factor“—the observation that Conservative voters supposedly have (or had prior to ’92)  a greater tendency to conceal their voting intentions than Labour supporters.

Preprocessing

The data is made available from The Guardian as a .csv file via Google spreadsheets here and requires minimal cleanup, cut the source information from the end of the file and you can open it up in R.


sop <- read.csv("StateOfTheParties.csv", stringsAsFactors=F)

## Data cleanup
sop[,2:5] <- apply(sop[,2:5], 2, function(x) as.numeric(gsub("%", "", x)))
sop[,1] <- as.Date(sop[,1], format="%d-%m-%Y")
colnames(sop)[1] <- "Date"

# correct for some rounding errors leading to 101/99 %
sop$rsum <- apply(sop[,2:5], 1, sum)
table(sop$rsum)
sop[,2:5] <- sop[,2:5] / sop$rsum

Then after melting the data.frame down (full code at the end of the post), you can get a quick overview with ggplot2.

UK general election overview 1984-2014

Outlines (stacked bars) represent general election results

Election breakdown

The area plot is a nice overview but not that useful quantitatively. Given that the dataset includes general election results as well as opinion polling, it’s straightforward to split the above plot by this important factor. I also found it useful to convert absolute dates to be relative to the election they precede. R has an object class, difftime, which makes this easy to accomplish and calling as.numeric() on a difftime object converts it to raw number of days (handily accounting for things like leap years).

These processing steps lead to a clearer graph with more obvious stories, such as the gradual and monotonic decline of support for Labour during the Blair years.

UK general election data split by election period

NB Facet headers show the election year and result of the election with which the (preceding) points are plotted relative to.

Next election’s result

I originally wanted to look at this data to get a feel for how things are looking before next year’s (2015) general election, maybe even running some predictive models (obviously I’m no fivethirtyeight.com).

However, graphing the trends of public support for the two main UK parties hints it’s unlikely to be a fruitful endeavour at this point, and with the above graphs showing an ominous increasing support for “other” parties (not accidentally coloured purple), it looks like with about 400 days to go the 2015 general election is still all to play for.

lab_con



library("ggplot2")
library("scales")
## Data: http://www.theguardian.com/news/datablog/2009/oct/21/icm-poll-data-labour-conservatives#data (State of the parties spreadsheet)
# ICM poll resultsof ~1000 people taken every month or so
# (more often before an election) as well as election
# results dating from June 1984.
sop <- read.csv("StateOfTheParties.csv", stringsAsFactors=F)
## Data cleanup
sop[,2:5] <- apply(sop[,2:5], 2, function(x) as.numeric(gsub("%", "", x)))
sop[,1] <- as.Date(sop[,1], format="%d-%m-%Y")
colnames(sop)[1] <- "Date"
# correct for some rounding errors leading to 101/99 %
sop$rsum <- apply(sop[,2:5], 1, sum)
table(sop$rsum)
sop[,2:5] <- sop[,2:5] / sop$rsum
# Melt for ggplot
melt.fun <- function(x)
melt(x, measure.vars=c("CON", "LAB", "LIB.DEM", "OTHER"), id.vars="Date")
props <- melt.fun(sop)
# Grab election data for additional layer
elections <- sop[which(grepl("ELECTION", sop$Sample)),]
elections <- melt.fun(elections)
# Party colours (ish)
cols <- brewer.pal(8, "Pastel1")[c(2,1,6,4)]
# Area plot overview of public support for main parties over 30 years
ggplot(props, aes(x=Date, y=value, fill=variable, group=variable)) +
geom_area() + scale_fill_manual(values=cols) +
geom_bar(data=elections, stat="identity", position=position_stack(.1),
width=I(100), col="grey40") +
scale_y_continuous(expand=c(0,0), labels=percent_format()) +
scale_x_date(expand=c(0,0)) +
labs(list(y="Public support", fill="Party", x="")) #+
ggtitle("UK general elections 1984-2014")
## Look at the time between elections
e.dates <- c(unique(elections$Date), as.Date("2015-05-07"))
results <- c(rep("CON",2), rep("LAB", 3), "CON.LIB", "?")
sop.rows <- which(grepl("ELECTION", sop$Sample))
e.years <- c(sub(".*(\\d{4})$", "\\1", sop[sop.rows, "Sample"], perl=T), "2015")
sop$runup <- rep(c(1:7), c(sop.rows[1], diff(c(sop.rows, nrow(sop)))))
sop$win <- rep(results, c(sop.rows[1], diff(c(sop.rows, nrow(sop)))))
e <- sop
e <- melt(e, measure.vars=c("CON", "LAB", "LIB.DEM", "OTHER"),
id.vars=c("runup", "Date", "win"))
e$prior <- with(e, as.numeric(Date – e.dates[runup]))
e$header <- with(e, paste0(e.years[e$runup], " (", win, ")"))
# Fn to make colours legible on white bg
darken <- function(hex.col, amount=.15)
return(rgb(t(col2rgb(hex.col)*(1-amount)), max=255))
darken <- Vectorize(darken)
dcols <- unname(darken(cols))
# Previous 6 election support history
ggplot(subset(e, header != "2015 (?)"),
aes(x=prior, y=value, col=variable, fill=variable)) +
facet_wrap(~header, scales="free_x") +
scale_color_manual(values=dcols) +
scale_fill_manual(values=dcols) +
geom_smooth(method="loess") +
geom_line() + geom_point() + theme_bw() +
scale_y_continuous(labels=percent_format()) +
labs(list(y="Public support", x="Days relative to election",
col="Party", fill="Party"))
# boxlots of election runups
ggplot(e, aes(x=header, y=value, fill=variable)) +
theme_bw() + geom_boxplot(position=position_dodge(.8)) +
scale_fill_manual(values=cols) +
labs(list(y="Public support", x="Days relative to election",
col="Party", fill="Party")) +
ggtitle("UK general elections 1984-2014") +
theme(legend.position=c(.9,.85), legend.background=element_blank())
## Which prior results are most predictive of end result
ggplot(subset(e, variable %in% c("LAB", "CON")),
aes(x=prior, y=value, col=variable, fill=variable)) +
facet_grid(header~., scales="free_y") +
scale_color_manual(values=dcols) +
scale_fill_manual(values=dcols) +
geom_smooth(method="loess") +
scale_x_continuous(expand=c(0.01,0.01)) +
theme_bw() + scale_y_continuous(labels=percent_format()) +
labs(list(y="Public support", x="Days relative to election",
col="Party", fill="Party"))

view raw

UK_elections.R

hosted with ❤ by GitHub

2 Comments

Filed under R

Meticulously recreating bitmap plots in R

There’s a hard-fought drive on Wikimedia commons to convert those images that should be in vector format (i.e. graphs, diagrams) from their current bitmap form. At the time of writing, the relevant category has over 7000 images in the category “Images that should use vector graphics”.

The usual way people move between the two is by tracing over the raster, and great tools like Inkscape (free open-source software) can help a lot with this. But in the case of graphs I thought it’d be fun to try and rebuild a carbon copy from scratch in R.

The original

This is the original bitmap plot I wanted to recreate.

This is the original bitmap plot I wanted to recreate. (Courtesy of Wikimedia Commons)

The file that first caught my eye was this nice graph of US employment stats, currently used on the highly-trafficked Obama article. I’m not sure what drew this originally, it doesn’t look like Excel because of the broken axis and annotations, but maybe it is. It’s currently a png at about 700 × 500 so should be an easy target for improvement.

Figure 2.0

The two raw data files are available here and here as Excel spreadsheets. They have some weird unnecessary formatting so the various xls parsers for R won’t work; save the tables from Excel as csv. I won’t talk through the code as it wasn’t too taxing (or clean) but it’ll be at the end of the post. Here’s what I came up with:

I realise the irony in having to upload a bitmap version for wordpress, but click for the SVG.

I realise the irony in having to upload a bitmap version for wordpress, but click for the SVG.

I expanded my plot to include the 2013 data, so it inescapably has slightly different proportions to the original. And I was working on a single monitor at the time so I didn’t have a constant comparison. I can see now a few things are still off, the fonts are different sized for one and I ditched the broken axis, but overall I think it’s a decent similarity!

ggplot2 version

Two y-axes on the same graph is bad, bad, bad and unsurprisingly forbidden with ggplot2 but I did come across this method of dummy-facetting and then plotting separate layers per facet. An obvious problem is now the y-axis are representing different things and you only have one label. A hacky fix is to write your ylabs into the facet header (I’m 100% confident Hadley Wickham and Leland Wilkinson would not be impressed with this). Another alternative would be to use map a colour aesthetic to your y-axis values and label it in the legend (again, pretty far from recommended practice).

This is what I ended up with, I still think it’s a reasonable alternative to the above, and the loess fitted model nicely shows the unemployment rate trend without the seasonality effects:

ggplot_bitmap

Article version

While mimicking the original exactly was fun (for me at least), I tried to improve upon it for the actual final figure for use on Wikipedia. For instance, it now uses unambiguous month abbreviations, and I swapped the legend for colour-coded text labels. It still has some of the original’s charm though. Looks like after a bit of a rough patch, your employment statistics are starting to look pretty good Mr. President.

new_v2

Next up, the other much less attractive figures on that page ([1], [2]).



require("zoo")
require("ggplot2")
# unemployment
u <- read.csv("Employment.csv", header=T)
u <- melt(u, "Year")
unemployment <- data.frame(date=as.yearmon(do.call(paste, u[,1:2]), "%Y %b"),
rate=u$value)
unemployment <- unemployment[unemployment$date > as.yearmon("2008-12"),]
u2 <- unemployment
colnames(u2) <- c("date", "rate")
n <- read.csv("netChange.csv", header=T)
n <- melt(n, "Year")
# Check (!) dates are the same in each input, else repeat parse
all.equal(net.change$date, unemployment$date) # TRUE
net.change <- data.frame(date=unemployment$date,
change=n$value)
net.change <- net.change[net.change$date > as.yearmon("2008-12"),]
n2 <- net.change
colnames(n2) <- c("date", "rate")
n2$panel <- "Net change in number of jobs (000s)"
u2$panel <- "Unemployment rate (%)"
both <- rbind(u2, n2)
# First: hacky ggplot2 plot
p <- ggplot(both, aes(x=as.Date(date), y=rate)) +
facet_grid(panel~., scale="free_y") +
layer(data=u2, geom="smooth", method="loess") +
layer(data=u2, geom="point") +
layer(data=n2, geom="bar", stat="identity") +
theme_bw() + labs(y="", x="") +
ggtitle("United States employment statistics (2009 – 2013)")
ggsave(p, file="ggplot_svg.svg")
# Or separate plots, but mis-aligned
require("gridExtra")
grid.arrange(
ggplot(unemployment, aes(x=as.Date(date), y=rate)) +
geom_smooth(method="loess") + geom_point() +
labs(x="", y="Unemployment rate (%)") +
theme_classic(),
ggplot(net.change, aes(x=as.Date(date), y=change)) +
geom_bar(stat="identity") +
labs(y="Net change per month (000s of employees)",
x="") + theme_classic()
, ncol=1)
## base R version
dev.off()
## This recreates the original figure as close as possible (well, ish)
pdf("recreated.pdf", 7, 5)
par(mar=c(3,4.5,5,4.2), mgp=c(1.8,.65,0))
# sort by date
net.change <- net.change[order(net.change$date),]
# original starts at 09
net.change <- net.change[net.change$date > as.yearmon("2008-12"),]
#par(yaxs="i")
bpos <- barplot(net.change$change, plot=F)
x <- as.Date(net.change$date)
# scale will be integer (== month)
plot(1:length(x), net.change$change, type="n", ylim=c(-1000, 600),
frame=F, axes=F, xlab="", ylab="", xlim=c(2.7, length(x)-1.7))
rect(xleft=1:length(x)-.3, xright=1:length(x)+.3,
ybottom=0, ytop=net.change$change, col="#4F81BD", lend=2)
abline(h=0, lwd=1)
## Awful code, avert your eyes
labs <- format(as.Date(as.character(
cut(as.Date(seq(x[1], x[length(x)], length.out=5),
format= "%m/%Y"), breaks="years"))), "%m/%d%/%y")
labs <- c(gsub("01","1", labs), "")
axis(1, at=seq(1, length(x), length.out=6), tick=F,
labels=labs, las = 1)
axis(2, at=seq(-1000, 600, by=200), las=1, tck=-0.015)
mtext("Number jobs lost/created", side=2, col="#1F497D", line=3)
## now unemployment
unemployment <- unemployment[order(unemployment$date),]
unemployment <- unemployment[unemployment$date > as.yearmon("2008-12"),]
par(new=T)
# set up same plot
plot(1:length(x), unemployment$rate, type="n", ylim=c(6.5, 10),
frame=F, axes=F, xlab="", ylab="", xlim=c(2.7, length(x)-1.7))
lines(1:length(x), unemployment$rate, col="#C0504D", lwd=3.5)
axis(4, las=1, tck=-.015)
#mtext("Percent unemployed", side=4, col="#4F81BD", line=1, las=0, adj=0)
legend("bottom", legend=c("Unemployment rate", "1 month Net Change (000's)"),
col=c("#C0504D", "#4F81BD"), lty=c(1, NA), lwd=c(3.5, NA),
fill=c(0, "#4F81BD"), merge=T, border=NA, text.font=2)
text(length(x)*1.12, (par("usr")[4] + par("usr")[3])/2, "Percent unemployed",
srt = 270, xpd = TRUE, col="#1F497D")
mtext("United States Employment Statistics\n Jan 2009 – Dec 2013", side=3,
col="#1F497D", cex=1.4, line=2, font=2)
mtext("Monthly change, seasonally adjusted", side=3, cex=1.1, line=1)
dev.off()
## This is the same figure but with a few minor improvements
dev.off()
svg("new.svg", 7.5, 5)
par(mar=c(5.5,4.5,5,3.8), mgp=c(1.8,.7,0))
net.change <- net.change[order(net.change$date),]
net.change <- net.change[net.change$date > as.yearmon("2008-12"),]
bpos <- barplot(net.change$change, plot=F)
x <- as.Date(net.change$date)
# scale will be integer (== month)
plot(1:length(x), net.change$change, type="n", ylim=c(-1000, 600),
frame=F, axes=F, xlab="", ylab="", xlim=c(2.7, length(x)-1.7))
rect(xleft=1:length(x)-.3, xright=1:length(x)+.3,
ybottom=0, ytop=net.change$change, col="#4F81BD", lend=2)
abline(h=0, lwd=1)
labs <- format(as.Date(as.character(cut(as.Date(seq(x[1], x[length(x)], length.out=10),
format= "%M %Y"), breaks="months"))), "%b %Y")
axis(1, at=seq(1, length(x), length.out=length(labs)),
labels=labs, las=3)
axis(2, at=seq(-1000, 600, by=200), las=1)
mtext("Net change in employment per month (1000s of jobs)",
side=2, col="#1F497D", line=3)
## now unemployment
unemployment <- unemployment[order(unemployment$date),]
unemployment <- unemployment[unemployment$date > as.yearmon("2008-12"),]
par(new=T)
# set up same plot
plot(1:length(x), unemployment$rate, type="n", ylim=c(6.5, 10),
frame=F, axes=F, xlab="", ylab="", xlim=c(2.7, length(x)-1.7))
lines(1:length(x), unemployment$rate, col="#C0504D", lwd=3.5)
axis(4, las=1)
text(length(x)*1.1, (par("usr")[4] + par("usr")[3])/2, "Percent unemployed",
srt = 270, xpd = TRUE, col="#C0504D")
text(length(x)*.85, ((par("usr")[4] + par("usr")[3])/4) *1.65, "Unemployment rate",
xpd = TRUE, col="#C0504D")
text(length(x)*.215, ((par("usr")[4] + par("usr")[3])/4) *1.75, "Jobs created or lost",
xpd = TRUE, col="#4F81BD")
mtext("United States Employment Statistics\n Jan 2009 – Dec 2013", side=3,
col="#1F497D", cex=1.4, line=2, font=2)
mtext("Monthly change, seasonally adjusted", side=3, cex=1.1, line=1)
dev.off()

Leave a comment

Filed under R, Wikipedia

Analyse your bank statements using R

Online banking has made reviewing statements and transferring money more convenient than ever before, but most still rely on external methods for looking at their personal finances. However, many banks will happily give you access to long-term transaction logs, and these provide a great opportunity to take a DIY approach.

I’ll be walking through a bit of analysis I tried on my own account (repeated here with dummy data) to look for long-term trends on outgoing expenses. Incidentally, the reason I did this analysis was the combination of a long train journey and just 15 minutes free Wi-Fi (in C21 ?!), ergo a short time to get hold of some interesting data and a considerably longer time to stare at it.

Getting the data

First you need to grab the raw data from your online banking system. My account is with Natwest (UK), so it’s their format output I’ll be working with, but the principals should be easy enough to apply to the data from other banks.

Natwest offers a pretty straightforward Download Transactions dialogue sequence that’ll let you get a maximum of 12 months of transactions as a comma-separated value (CSV) flat file, it’s this we can download and analyse.

Download transaction history for the previous year as CSV.

Download transaction history for the previous year as CSV.

Read this file you’ve downloaded into a data.frame:

s <- read.csv("<filename.csv>", sep=",", row.names=NULL)
colnames(s) <- c("date", "type", "desc", "value", 
                 "balance", "acc")
s$date <- as.Date(s$date, format="%d/%m/%Y")

# Only keep the useful fields
s <- s[,1:5]

This will give you a 5-column table containing these fields:

  1. Date
  2. Type
  3. Description
  4. Value
  5. Balance

It should go without saying that the CSV contains sensitive personal data, and should be treated as such — your account number and sort code are present on each line of the file!

Parsing the statement

The most important stage of processing your transaction log is to classify each one into some meaningful group. A single line in your transaction file may look like this:

07/01/2013,POS,"'0000 06JAN13 , SAINSBURYS S/MKTS , J GROATS GB",-15.90,600.00,"'BOND J","'XXXXXX-XXXXXXXX",

Given the headers above, we can see that most of the useful information is contained within the quoted Description field, which is also full of junk. To get at the good stuff we need the power of regular expressions (regexp), but thankfully some pretty simple ones.

In fact, given the diversity of labels in the description field, our regular expressions end up essentially as lists of related terms. For example, maybe we want to group cash machine withdrawals; by inspecting the description fields we can pick out some useful words, in this case bank names like NATWEST, BARCLAYS and CO-OPERATIVE BANK. Our “cash withdrawal” regexp could then be:

"NATWEST|BARCLAYS|BANK"

And we can test this on our data to make sure only relevant rows are captured:

s[grepl("NATWEST|BARCLAYS|BANK", s$desc),]

Now you can rinse and repeat this strategy for any and all meaningful classes you can think of.

# Build simple regexp strings
coffee <- "PRET|STARBUCKS|NERO|COSTA"
cash <- "NATWEST|BARCLAYS|BANK"
food <- "TESCO|SAINSBURY|WAITROSE"
flights <- "EASYJET|RYANAIR|AIRWAYS"
trains <- "EC MAINLINE|TRAINLINE|GREATER ANGLIA"
# Do this for as many useful classes as you can think of

# Add a class field to the data, default "other"
s$class <- "Other"

# Apply the regexp and return their class
s$class   ifelse(grepl(food, s$desc), "Food",
    ifelse(grepl(flights, s$desc), "Flights",
      ifelse(grepl(trains, s$desc), "Trains", "Other")))))

Aggregating and plotting the data

Now we’ve got through some pre-processing we can build useful plots in R using the ggplot2 package. It’ll also be useful to aggregate transactions per month, and to do this we can employ another powerful R package from Hadley Wickham, plyr.

# Add a month field for aggregation
s$month <- as.Date(cut(s$date, breaks="month"))

# NB. remove incoming funds to look at expenses!
s <- subset(s, s$value < 0)

# Build summary table of monthly spend per class
library(plyr)
smr <- ddply(s, .(month, class), summarise, 
             cost=abs(sum(value)))

Now we can plot these monthly values and look for trends over the year by fitting a statistical model to the observed data. In this example, I’ll use the loess non-linear, local regression technique which is one of the available methods in the geom_smooth layer.

library(ggplot2)
ggplot(smr, aes(month, cost, col=class)) +
  facet_wrap(~class, ncol=2, scale="free_y") +
  geom_smooth(method="loess", se=F) + geom_point() +
  theme(axis.text.x=element_text(angle=45, hjust=1),
        legend.position="none") +
  labs(x="", y="Monthly total (£)")
Monthly totals for each class of expense are shown over 12 months.

Monthly totals for each class of expense are shown over 12 months for example data.

In this example, it seems the person has possibly stopped paying for things in cash as much, and has swapped trains for flying! However a significant amount of the transaction log remain classified as “other” — these transactions could be split into several more useful classes with more judicious use of regexp. This becomes pretty obvious when you look at the mean monthly spend per class:

yl <- ddply(smr, .(class), summarise, m=mean(cost))

ggplot(yl, aes(x=class, y=m)) +
  geom_bar(stat="identity") +
  labs(y="Average monthly expense (£)", x="")
Overwhelmingly "other".

Overwhelmingly “other” — needs more work!

Hopefully this gives you some ideas of how to investigate your own personal finance over the past year!


Here’s the full code to run the above analysis, which should work as-is on any CSV format transaction history downloaded for a single Natwest account.


library(plyr)
library(ggplot2)
s <- read.csv("<filename.csv>", sep=",", row.names=NULL)
colnames(s) <- c("date", "type", "desc", "value", "balance", "acc")
s$date <- as.Date(s$date, format="%d/%m/%Y")
s <- s[,1:5]
s$month <- as.Date(cut(s$date, breaks="month"))
s <- subset(s, s$value < 0)
# Build simple regexp strings
coffee <- "PRET|STARBUCKS|NERO|COSTA"
cash <- "NATWEST|BARCLAYS|BANK"
food <- "TESCO|SAINSBURY|WAITROSE"
flights <- "EASYJET|RYANAIR|AIRWAYS"
trains <- "EC MAINLINE|TRAINLINE|GREATER ANGLIA"
# Do this for as many useful classes as you can think of
# Add a class field to the data, default "other"
s$class <- "Other"
# Apply the regexp and return their class
s$class <- ifelse(grepl(coffee, s$desc), "Coffee",
ifelse(grepl(cash, s$desc), "Cash",
ifelse(grepl(food, s$desc), "Food",
ifelse(grepl(flights, s$desc), "Flights",
ifelse(grepl(trains, s$desc), "Trains", "Other")))))
smr <- ddply(s, .(month, class), summarise, cost=abs(sum(value)))
ggplot(smr, aes(month, cost, col=class)) +
facet_wrap(~class, ncol=2, scale="free_y") +
geom_smooth(method="loess", se=F) + geom_point() +
theme(axis.text.x=element_text(angle=45, hjust=1),
legend.position="none") +
labs(x="", y="Monthly total (£)")
yl <- ddply(smr, .(class), summarise, m=mean(cost))
ggplot(yl, aes(x=class, y=m)) +
geom_bar(stat="identity") +
labs(y="Average monthly expense (£)", x="")

view raw

stateAnalysis.R

hosted with ❤ by GitHub

13 Comments

Filed under R