Author inflation in academic literature

There seems to be a general consensus that author lists in academic articles are growing. Wikipedia says so, and I’ve also come across a published letter and short Nature article which accept this is the case and discuss ways of mitigating the issue. Recently there was an interesting discussion on academia.stackexchange on the subject but again without much quantification. Luckily given the array of literature database APIs and language bindings available, it should be pretty easy to investigate with some statistical analysis in R.

rplos

ROpenSci offers nice R language bindings for the PLOS (I’m more used to PLoS but I’ll go with it) API, called rplos. There’s no particular reason to limit the search to PLOS journals but rplos seems significantly more straightforward to work with than pubmed API packages I’ve used in the past like RISmed.

Additionally the PLOS group contains two journals of particular interest to me:

  • PLOS Computational Biology — a respectable specialist journal in my field; have bioinformatics articles been particularly susceptible to author inflation?
  • PLOS ONE — the original mega-journal. I wonder if the huge number of articles published here show different trends in authorship over time.

The only strange part of the search was at the PLOS API end. To search by the publication_year field you need to supply both a beginning and end date range, in the form:

publication_date:[2001-01-01T00:00:00Z TO 2013-12-31T23:59:59Z]

A tad verbose, right? I can’t imagine wanting to search for things published at a particular time of day. A full PLOS API query using the rplos package looks something like this:

searchplos(
  # Query: publication date in 2012
  q  = 'publication_date:[2012-01-01T00:00:00Z TO 2012-12-31T23:59:59Z]', 

  # Fields to return: id (doi) and author list
  fl = "id,author", 

  # Filter: only actual articles in journal PLOS ONE
  fq = list("doc_type:full",
            "cross_published_journal_key:PLoSONE"), 

  # 500 results (max 1000 per query)
  start=0, limit=500, sleep=6)

A downside of using the PLOS API is that the set of journals are quite recent, PLOS ONE started in 2006 and PLOS Biology was only a few years before in 2003, so it’ll only give us a limited window into any long-term trends.

Distribution of author counts

Before looking at inflation we can compare the distribution of author counts per paper between the journals:

Distribution of author counts
ECDF per journal

Possibly more usefully — but less pretty — the same data be plotted as empirical cumulative distribution functions (ECDF). From these we can see that PLOS Biology had the highest proportion of single-author papers in my sample (n = ~22,500 articles overall) followed by PLOS Medicine, with PLOS Genetics showing more high-author papers at the long-tail of the distribution, including the paper with the most authors in the sample (Couch et al., 2013 with 270 authors).

Author inflation

So, in these 6 different journals published by PLOS, how has the mean number of authors per paper varied across the past 7 years?

PLOS author inflation

Above I’ve shown yearly means plus their 95% confidence intervals, as estimated by a non-parametric bootstrap method implemented in ggplot2. Generally from this graph it does look like there’s a slight upward trend on average, though arguably the mean is not the best summary statistic to use for this data, which I’ve shown is not normally distributed, and may better fit an extreme value distribution.

The relationship between publication date and number of authors per paper become clearer if it’s broken down by journal:

Author inflation regression

Here linear regression reveals a significant positive coefficient for year against mean author count per paper, as high as .52 extra authors per year on average, down to just .06 a year for PLOS ONE. Surprisingly the mega-journal which published around 80,000 papers over this time period seems least susceptible to author inflation.

Author inflation per journalThe explained variance in mean number of authors per paper (per year) ranges from .28 (PLOS ONE) up to an impressive .87 for PLOS Medicine, with PLOS Computational Biology not far behind on .83. However, PLOS Computational Biology had the second lowest regression coefficient, and the lowest average number of authors of the six journals — maybe us introverted computer types should be collaborating more widely!

Journal effects

It’s interesting to speculate on what drives the observed differences in author inflation between journals. A possible covariate is the roundly-condemned “Impact Factor” journal-level metric — are “high impact” journals seeing more author creep than lesser publications?

Correlation of author inflation and impact factor

If estimate of author inflation is plotted against respective journals’ recent impact factors, there does indeed appear to be a positive correlation. However, this comparison only has 6 data points so there’s not enough evidence to reject the null hypothesis that there is no relationship between these two variables (p = 0.18).

Conclusion

Is author inflation occurring?

Yes, it certainly appears to be on average.

Is it a problem?

I don’t know, but I’d lean towards probably not.

The average trends could be reflecting the proliferation of “Big Science” with huge collaborative consortiums like ENCODE and FANTOM (though the main papers of those examples were targeted to Nature) and doesn’t necessarily support a conclusion that Publish or Perish culture is forcing lots of token authorships and backhanders between scientists.

Maybe instead (as the original discussion hypothesised), people that traditionally may not have been credited with authorship (bioinformaticians doing end-point analysis and lab technicians) are now getting recognised for their input more often, or conceivably advances in cloud computing, distributed data storage and video conferencing has better enabled larger collaborations between scientists across the globe!

Either way, evidence for author inflation is not evidence of a problem per se :)

Caveats

  • Means used for regression — while we get a surprisingly high R2 for regression the mean number of authors per paper per year, the predictive power for individual papers unsurprisingly vanishes (R2 plummets to between .02 and 4.6 × 10-4 per journal — significant non-zero coefficients remain). Author inflation wouldn’t be expected to exhibit consistent and pervasive effects in all papers, for example reviews, letters and opinion pieces presumably have consistently lower author counts than research articles and not all science can work in a collaborative, multi-author framework.
  • Search limits — rplos returns a maximum of 1000 results at a time (but they can be returned sequentially using the start and limit parameters). They seem to be drawn in reverse order of time so the results here probably aren’t fully representative of the year in some cases. This has also meant my sample is unevenly split between journals: PLoSBiology: 2487; PLoSCompBiol: 3403; PLoSGenetics: 4013; PLoSMedicine: 2094; PLoSONE: 7176; PLoSPathogens:3647; Total: 22,460.
  • Resolution — this could be done in a more fine-grained way, say with monthly bins. As mentioned above, for high-volume journals like PLOS ONE, the sample is likely coming from the end of the years from ~2010 onwards.

Full code to reproduce analysis

9 Comments

Filed under R

Guardian data blog — UK general election analysis in R

The Guardian newspaper has for a few years been running a data blog and has built up a massive repository of (often) well-curated datasets on a huge number of topics. They even have an indexed list of all data sets they’ve put together or reused in their articles.

It’s a great repository of interesting data for exploratory analysis, and there’s a low barrier to entry in terms of getting the data into a useful form. Here’s an example using UK election polling data collected over the last thirty years.

ICM polling data

The Guardian and ICM research have conducted monthly polls on voting intentions since 1984, usually with a sample size of between 1,000 and 1,500 people. It’s not made obvious how these polls are conducted (cold-calling?) but for what it’s worth ICM is a member of the British Polling Council, and so hopefully tries to monitor and correct for things like the “Shy Tory Factor“—the observation that Conservative voters supposedly have (or had prior to ’92)  a greater tendency to conceal their voting intentions than Labour supporters.

Preprocessing

The data is made available from The Guardian as a .csv file via Google spreadsheets here and requires minimal cleanup, cut the source information from the end of the file and you can open it up in R.


sop <- read.csv("StateOfTheParties.csv", stringsAsFactors=F)

## Data cleanup
sop[,2:5] <- apply(sop[,2:5], 2, function(x) as.numeric(gsub("%", "", x)))
sop[,1] <- as.Date(sop[,1], format="%d-%m-%Y")
colnames(sop)[1] <- "Date"

# correct for some rounding errors leading to 101/99 %
sop$rsum <- apply(sop[,2:5], 1, sum)
table(sop$rsum)
sop[,2:5] <- sop[,2:5] / sop$rsum

Then after melting the data.frame down (full code at the end of the post), you can get a quick overview with ggplot2.

UK general election overview 1984-2014

Outlines (stacked bars) represent general election results

Election breakdown

The area plot is a nice overview but not that useful quantitatively. Given that the dataset includes general election results as well as opinion polling, it’s straightforward to split the above plot by this important factor. I also found it useful to convert absolute dates to be relative to the election they precede. R has an object class, difftime, which makes this easy to accomplish and calling as.numeric() on a difftime object converts it to raw number of days (handily accounting for things like leap years).

These processing steps lead to a clearer graph with more obvious stories, such as the gradual and monotonic decline of support for Labour during the Blair years.

UK general election data split by election period

NB Facet headers show the election year and result of the election with which the (preceding) points are plotted relative to.

Next election’s result

I originally wanted to look at this data to get a feel for how things are looking before next year’s (2015) general election, maybe even running some predictive models (obviously I’m no fivethirtyeight.com).

However, graphing the trends of public support for the two main UK parties hints it’s unlikely to be a fruitful endeavour at this point, and with the above graphs showing an ominous increasing support for “other” parties (not accidentally coloured purple), it looks like with about 400 days to go the 2015 general election is still all to play for.

lab_con


Leave a comment

Filed under R

What are the most common RNG seeds used in R scripts on Github?

In the R programming language, the random number generator (RNG) is seeded each session using the current time and process ID. Via the magic of the popular Mersenne Twister PRNG, the values stored in .Random.seed are used sequentially each time “randomness” is invoked in a function. This means, of course, that the same function run in different R sessions can produce varying results, and in the case of modelling a system sensitive to initial conditions the observed differences could be huge.

For this reason it’s common to manually set the PRNG seed (using set.seed() in R), thereby creating the same .Random.seed vector which can be drawn from in your analysis to produce reproducible results. The actual value passed to this function is irrelevant for practical purposes — for whatever reason I generally user the same number across projects (42) — so this made me wonder: what values do the major R developers tend to pick?

To Github

The Github API is currently in a transitional period between versions 2 and 3 and has (annoyingly) limited code search results to specific users or organisations. This means to perform a code search programmatically, I’ll need a list of R users.

Google BigQuery

One way of building a list is through Github archive. The dev (Ilya Grigorik) has put up a public dataset with Google BigQuery, which is a neat cloud-based platform for querying huge datasets. My SQL isn’t all that, but the Google BigQuery interface is really functional (e.g. it autocompletes table fields) and makes it easy to get the data you’re looking for.

Fishing for prolific R users via Google BigQuery.

Fishing for prolific R users via Google BigQuery.

In this case I pulled out R code repositories ordered by repo pushes (as a heuristic for codebase size and activity, I guess) with their owner’s username. It was this list of names I then used for the API query.

Github API

It looks like there’s a decent set of R bindings for the Github API, but it’s not clear how they work with code search, so I opted for the messier option of calling curl through system(). To build the command to search the API per user:

getCMD <- function(user){
  cmd <- paste0("curl -H 'Accept: application/vnd.github.v3.text-match+json Authorization: token ",
                   oauth, "' 'https://api.github.com/search/code?q=set.seed+in:file+language:R+user:",
                   user,"&page=1&per_page=500' | grep 'fragment' -")
  return(cmd)
}

As you can see this is pretty rough and ready, there may be a pagination issue if someone sets PRNG seeds everywhere but it’ll do.

You can then run the command and pull out the returned string matches for the query, in this case I searched for “set.seed” and then used Haldey Wickham’s stringr R package to regexp out the number (if any) passed to the function.

getPRNGseeds <- function(user){
  print(user)
  api.result <- system(getCMD(user), intern=T)
  if(is.null(attr(api.result, "status"))) {
    seeds <- cbind(user, 
        str_match(api.result, "set\\.seed\\((\\d+)\\)")[,2])
    Sys.sleep(10)
    return(seeds)
  } else {
    cat(user, " failed")
    return(cbind(user, NA))
  }
}

The Github API spits out JSON data (ignored and just grepped out in the above) so I looked into a couple of smarter ways of parsing it. Firstly there’s the jsonlite R package which offers the fromJSON() function to import JSON data into what resembles a sometimes-hard-to-work-with nested R object. It seems like the Github API query results return too many nested layers to produce a useful object in this case. Another option is jq, a command-line program which has a neat syntax for dealing specifically with the JSON data structure — I’ll definitely be using it for more complex JSON wrangling in the future.

The data

Despite the harsh search limitations I ended up with 27 users who owned the top 100 R repositories, and of those 15 used set.seed() somewhere (or at least something like it). However the regex fails in some cases — where a variable is being passed to the function instead of an integer, for one. Long story short, I scraped together 187 lines of set.seed(\d+) from some of the big names in the R community and here’s how the counts looked:

RNG seeds

So plain old 1 is the stand-out winner!

There’s a few sequences in there (123, 321, 1234, 12345) and some date references (2011, 20051028), but surprisingly few programmer in-jokes or web-culture references, save a lone 1337 and I guess some binary.

1410 (or 1014 in less-sensible countries) and 141079 look like they could be a certain R developer’s birthday and birth year, but that’s pure speculation :)

Here’s one of those awful wordle / wordcloud things too.

wordle_crop

Hopefully as the roll-out of the v3 Github API progresses the current search restriction will be lifted, still this was a fun glimpse at other programmer’s conventions!


R script

Leave a comment

Filed under R

Slidify: Modern, simple presentations written in R Markdown

As a LaTeX fan I’m used to using Beamer for presentations, but the built-in themes are definitely starting to show their age — and writing a custom .sty file looks like a nightmare — so for a while I’ve been looking at trying out an HTML5 framework.

Reveal.js is a great looking HTML presentation framework from Hakim El Hattab.

Reveal.js is a great looking HTML presentation framework from Hakim El Hattab.

The first nice option I noticed was reveal.js which seems to find a solid balance between looking sleek and modern, but not generating a prezi-like rollercoaster of a talk. Another project I came across, impress.js, probably leans towards the latter, and needs a decent array of web-dev skills to really customise.

These are both nice solutions but require decent web development skills to take advantage of, else offer limited web UI front-ends. An ideal solution for me would be simple to write and look great from the outset, needing only minor CSS, Javascript and HTML tweaks to build a good-looking and functional slide deck.

Slidify

Enter slidify written by Ramnath Vaidyanathan (github), a wrapper of several libraries which allows you to go from simple R Markdown to slick HTML presentations. The introductory presentation gives a nice overview of what’s possible and how simple these slide decks can be to write.

As the author modestly points out, slidify really is a go-between to other R packages:

  • knitr (link) — (a replacement for Sweave) think IPython notebook for R and other languages
  • whisker (link) — for making use of mustache (geddit?) “logic-less templating”, which reminds me a lot of MediaWiki markup templates with extended functionality
  • R Markdown — the markdown extension introduced by the RStudio team

Together with slidify these packages make writing and customising presentations a breeze, so install the library from github (using Hadley Wickham’s devtools) per the instructions here. It also comes with some great default themes, like Google’s io2012 (my favourite) and deck.js. The video below shows how to get started authoring presentations much better than I could:

Features

There’s a tonne of cool things slidify can do that I haven’t even explored yet, but that look great. Of course, through knitr you can embed R code, including analysis and plot generation, in your presentation, bringing together reproducible analysis and neat presentation of your results. Even cooler, it plays nice with rCharts — from the same author — allowing interactive charts to be embedded in presentations; oh and Shiny applications can be added too, according to this.

Slidify enables straightforward github publishing (just author()) and RStudio allows quick upload to RPubs, both make it trivial to have an online and archived copy of your presentation.

Convert to PDF

A PDF is a nice security blanket to have if you’re worried about unforeseeable display issues on presentation day — it’s a format designed to be environment independent after all. With the io2012 theme, this can be done natively from chrome using the print dialogue, however I consistently found that for my presentation at least, the active slide you are viewing and and sometimes an adjacent slide is glitched in the PDF output.

Chrome print PDF reproducibly bugs out on the active slide each time.

Chrome print PDF reproducibly bugs out on the active slide, for my presentation at least.

The hacky fix for this was to go to the final slide of the talk, print all but the last slide to PDF, then go back to an earlier slide and only print to file that remaining last slide. Then stitch these files together in Preview (assuming OS X), or Imagemagick or whatever.

Other than this active-slide glitch the PDF conversion worked surprisingly well and the output is passable as a decent presentation, albeit without the finesse of the subtle default transitions.

Update:

On twitter Ramnath points out that this is a recent problem with the Chrome browser, and Safari or Firefox should be able to export to PDF without issue. A quick check with Safari confirms that’s the case for my presentation.

Issues

There were a couple of things that either can’t be done (without digging deep into the js) or at least things I couldn’t figure out after some (non-extensive) googling.

Image features

First, making things appear sequentially (like PowerPoint bullet points) is achieved simply with:

---

## Slide title

> * First bullet point to appear

> * Second...

But for an image to then appear in the same way seems to require a continuation of the ul, i.e. your image needs a bullet point (?). Maybe I’m wrong on this but when included without the bullet point, the image seems to then precede the other bulleted content.

Another issue was resizing and centering images. I made use of the code from this answer on SO to add the quick and dirty CSS / jQuery to auto-centre / reduce oversized images for the slideshow. For me it would be nice for this to be default behaviour, but I suppose for a web developer this is trivial everyday stuff:

<!-- Limit image width and height -->
<style type="text/css">
img {     
  max-height: 560px;     
  max-width: 964px; 
}
</style>

<!-- Center image on slide -->
<script type="text/javascript" src="http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.7.min.js"></script>
<script type="text/javascript">
$(function() {     
  $("p:has(img)").addClass('centered'); 
});
</script>

RStudio integration

This caused some confusion for me, but RStudio actually has its own presentation framework and uses slightly different markdown syntax to create it. On reviewing the two, it doesn’t seem as developed as Slidify yet, and the defaults aren’t as polished as the io2012 deck. The confusing part is, somewhere between the packages slidify and slidifyLibraries a function overloads RStudio’s Knit HTML button, faking seamless integration.

The result is that the IDE is a great place to write the presentation, but I can’t help thinking, as was mentioned on twitter recently, that the slidify framework would make a nice alternative or replacement for RStudio’s current offering.

Customising the title slide

One of the problems I had was editing the theme’s title slide. Most of the presentation is amenable to CSS hacking but the title slide is stubbornly hardcoded. (Well, it’s not of course but the file is buried in the library install.)

The way I got at this file was by changing the presentation mode (within the YAML frontmatter) to selfcontained, and run slidify, copying the libraries folder to the same directory as your .Rmd file. Then, for the io2012 deck, the title slide template is at (thanks Ramnath):

libraries/frameworks/io2012/partials/titleslide.html

Conclusion

I found slidify to be a great package and I ended up with, what I consider to be, the cleanest and nicest-looking presentation I’ve made to date. Also I’ve learnt a bit of web-programming along the way! I expect I’ll be switching from beamer to slidify for future talks too.

The final(ish) presentation in all its glory. Link may not work for long.

The final(ish) presentation in all its glory. Link may not work for long.

Leave a comment

Filed under R

Meticulously recreating bitmap plots in R

There’s a hard-fought drive on Wikimedia commons to convert those images that should be in vector format (i.e. graphs, diagrams) from their current bitmap form. At the time of writing, the relevant category has over 7000 images in the category “Images that should use vector graphics”.

The usual way people move between the two is by tracing over the raster, and great tools like Inkscape (free open-source software) can help a lot with this. But in the case of graphs I thought it’d be fun to try and rebuild a carbon copy from scratch in R.

The original

This is the original bitmap plot I wanted to recreate.

This is the original bitmap plot I wanted to recreate. (Courtesy of Wikimedia Commons)

The file that first caught my eye was this nice graph of US employment stats, currently used on the highly-trafficked Obama article. I’m not sure what drew this originally, it doesn’t look like Excel because of the broken axis and annotations, but maybe it is. It’s currently a png at about 700 × 500 so should be an easy target for improvement.

Figure 2.0

The two raw data files are available here and here as Excel spreadsheets. They have some weird unnecessary formatting so the various xls parsers for R won’t work; save the tables from Excel as csv. I won’t talk through the code as it wasn’t too taxing (or clean) but it’ll be at the end of the post. Here’s what I came up with:

I realise the irony in having to upload a bitmap version for wordpress, but click for the SVG.

I realise the irony in having to upload a bitmap version for wordpress, but click for the SVG.

I expanded my plot to include the 2013 data, so it inescapably has slightly different proportions to the original. And I was working on a single monitor at the time so I didn’t have a constant comparison. I can see now a few things are still off, the fonts are different sized for one and I ditched the broken axis, but overall I think it’s a decent similarity!

ggplot2 version

Two y-axes on the same graph is bad, bad, bad and unsurprisingly forbidden with ggplot2 but I did come across this method of dummy-facetting and then plotting separate layers per facet. An obvious problem is now the y-axis are representing different things and you only have one label. A hacky fix is to write your ylabs into the facet header (I’m 100% confident Hadley Wickham and Leland Wilkinson would not be impressed with this). Another alternative would be to use map a colour aesthetic to your y-axis values and label it in the legend (again, pretty far from recommended practice).

This is what I ended up with, I still think it’s a reasonable alternative to the above, and the loess fitted model nicely shows the unemployment rate trend without the seasonality effects:

ggplot_bitmap

Article version

While mimicking the original exactly was fun (for me at least), I tried to improve upon it for the actual final figure for use on Wikipedia. For instance, it now uses unambiguous month abbreviations, and I swapped the legend for colour-coded text labels. It still has some of the original’s charm though. Looks like after a bit of a rough patch, your employment statistics are starting to look pretty good Mr. President.

new_v2

Next up, the other much less attractive figures on that page ([1], [2]).


Leave a comment

Filed under R, Wikipedia

Wikipedia is dead, long live the ‘pedia

I was a bit surprised when looking at the Wikipedia pageviews for 2013 (nicely presented here). After 5 years of consistent and reasonably stable growth, over 2013 monthly pageviews actually dropped, and to the tune of 2 *billion* views  (10%) from their peak early in the year.

pviewsThis was surprising to me. The problem Wikipedia has attracting new editors has been well-publicised, but it’s never had trouble with PageRank or increasing its reach to casual viewers.

Well, it turns out one area seeing consistent and healthy growth is, as you would guess, mobile views, which are showing gains of about 150k pageviews a month on English Wikipedia. This makes up for almost half a billion of those lost over 2013 in the graph above, but still leaves some explaining to do.

mobile

Interestingly, another useful metric of web traffic, unique visitors per month, continues to grow considerably. Maybe this reflects how mobile visitors use the site differently, just looking something up (e.g. to settle an argument) and closing their browser as opposed to a few hours going from topic to topic and ending up admiring a list of Eiffel tower replicates.

mvvu

A quick graph of mean monthly pageviews per visitor gives this theory some support, but the data seems pretty noisy and has varied a lot over the past few years.

Another possibility is that this data is telling us what we already know: the unique visitors with the highest total page views must be the article writers and the Wikignomes that built the place — and they’ve been in precipitous decline for nearly 6 years now. I’m speculating of course, but maybe that’s starting to show through on the page views site-wide, emphasising how much work a small group of people have been putting in, and the dent they’re leaving in Wikipedia as they leave.


Have I missed something, do you have a better idea of why en.wiki pageviews fell over 2013? Let me know!

Full R code to reproduce the graphs shown in this post is here:

Leave a comment

Filed under Wikipedia

Analyse your bank statements using R

Online banking has made reviewing statements and transferring money more convenient than ever before, but most still rely on external methods for looking at their personal finances. However, many banks will happily give you access to long-term transaction logs, and these provide a great opportunity to take a DIY approach.

I’ll be walking through a bit of analysis I tried on my own account (repeated here with dummy data) to look for long-term trends on outgoing expenses. Incidentally, the reason I did this analysis was the combination of a long train journey and just 15 minutes free Wi-Fi (in C21 ?!), ergo a short time to get hold of some interesting data and a considerably longer time to stare at it.

Getting the data

First you need to grab the raw data from your online banking system. My account is with Natwest (UK), so it’s their format output I’ll be working with, but the principals should be easy enough to apply to the data from other banks.

Natwest offers a pretty straightforward Download Transactions dialogue sequence that’ll let you get a maximum of 12 months of transactions as a comma-separated value (CSV) flat file, it’s this we can download and analyse.

Download transaction history for the previous year as CSV.

Download transaction history for the previous year as CSV.

Read this file you’ve downloaded into a data.frame:

s <- read.csv("<filename.csv>", sep=",", row.names=NULL)
colnames(s) <- c("date", "type", "desc", "value", 
                 "balance", "acc")
s$date <- as.Date(s$date, format="%d/%m/%Y")

# Only keep the useful fields
s <- s[,1:5]

This will give you a 5-column table containing these fields:

  1. Date
  2. Type
  3. Description
  4. Value
  5. Balance

It should go without saying that the CSV contains sensitive personal data, and should be treated as such — your account number and sort code are present on each line of the file!

Parsing the statement

The most important stage of processing your transaction log is to classify each one into some meaningful group. A single line in your transaction file may look like this:

07/01/2013,POS,"'0000 06JAN13 , SAINSBURYS S/MKTS , J GROATS GB",-15.90,600.00,"'BOND J","'XXXXXX-XXXXXXXX",

Given the headers above, we can see that most of the useful information is contained within the quoted Description field, which is also full of junk. To get at the good stuff we need the power of regular expressions (regexp), but thankfully some pretty simple ones.

In fact, given the diversity of labels in the description field, our regular expressions end up essentially as lists of related terms. For example, maybe we want to group cash machine withdrawals; by inspecting the description fields we can pick out some useful words, in this case bank names like NATWEST, BARCLAYS and CO-OPERATIVE BANK. Our “cash withdrawal” regexp could then be:

"NATWEST|BARCLAYS|BANK"

And we can test this on our data to make sure only relevant rows are captured:

s[grepl("NATWEST|BARCLAYS|BANK", s$desc),]

Now you can rinse and repeat this strategy for any and all meaningful classes you can think of.

# Build simple regexp strings
coffee <- "PRET|STARBUCKS|NERO|COSTA"
cash <- "NATWEST|BARCLAYS|BANK"
food <- "TESCO|SAINSBURY|WAITROSE"
flights <- "EASYJET|RYANAIR|AIRWAYS"
trains <- "EC MAINLINE|TRAINLINE|GREATER ANGLIA"
# Do this for as many useful classes as you can think of

# Add a class field to the data, default "other"
s$class <- "Other"

# Apply the regexp and return their class
s$class   ifelse(grepl(food, s$desc), "Food",
    ifelse(grepl(flights, s$desc), "Flights",
      ifelse(grepl(trains, s$desc), "Trains", "Other")))))

Aggregating and plotting the data

Now we’ve got through some pre-processing we can build useful plots in R using the ggplot2 package. It’ll also be useful to aggregate transactions per month, and to do this we can employ another powerful R package from Hadley Wickham, plyr.

# Add a month field for aggregation
s$month <- as.Date(cut(s$date, breaks="month"))

# NB. remove incoming funds to look at expenses!
s <- subset(s, s$value < 0)

# Build summary table of monthly spend per class
library(plyr)
smr <- ddply(s, .(month, class), summarise, 
             cost=abs(sum(value)))

Now we can plot these monthly values and look for trends over the year by fitting a statistical model to the observed data. In this example, I’ll use the loess non-linear, local regression technique which is one of the available methods in the geom_smooth layer.

library(ggplot2)
ggplot(smr, aes(month, cost, col=class)) +
  facet_wrap(~class, ncol=2, scale="free_y") +
  geom_smooth(method="loess", se=F) + geom_point() +
  theme(axis.text.x=element_text(angle=45, hjust=1),
        legend.position="none") +
  labs(x="", y="Monthly total (£)")
Monthly totals for each class of expense are shown over 12 months.

Monthly totals for each class of expense are shown over 12 months for example data.

In this example, it seems the person has possibly stopped paying for things in cash as much, and has swapped trains for flying! However a significant amount of the transaction log remain classified as “other” — these transactions could be split into several more useful classes with more judicious use of regexp. This becomes pretty obvious when you look at the mean monthly spend per class:

yl <- ddply(smr, .(class), summarise, m=mean(cost))

ggplot(yl, aes(x=class, y=m)) +
  geom_bar(stat="identity") +
  labs(y="Average monthly expense (£)", x="")
Overwhelmingly "other".

Overwhelmingly “other” — needs more work!

Hopefully this gives you some ideas of how to investigate your own personal finance over the past year!


Here’s the full code to run the above analysis, which should work as-is on any CSV format transaction history downloaded for a single Natwest account.

1 Comment

Filed under R