Monthly Archives: January 2014

Wikipedia is dead, long live the ‘pedia

I was a bit surprised when looking at the Wikipedia pageviews for 2013 (nicely presented here). After 5 years of consistent and reasonably stable growth, over 2013 monthly pageviews actually dropped, and to the tune of 2 *billion* views  (10%) from their peak early in the year.

pviewsThis was surprising to me. The problem Wikipedia has attracting new editors has been well-publicised, but it’s never had trouble with PageRank or increasing its reach to casual viewers.

Well, it turns out one area seeing consistent and healthy growth is, as you would guess, mobile views, which are showing gains of about 150k pageviews a month on English Wikipedia. This makes up for almost half a billion of those lost over 2013 in the graph above, but still leaves some explaining to do.


Interestingly, another useful metric of web traffic, unique visitors per month, continues to grow considerably. Maybe this reflects how mobile visitors use the site differently, just looking something up (e.g. to settle an argument) and closing their browser as opposed to a few hours going from topic to topic and ending up admiring a list of Eiffel tower replicates.


A quick graph of mean monthly pageviews per visitor gives this theory some support, but the data seems pretty noisy and has varied a lot over the past few years.

Another possibility is that this data is telling us what we already know: the unique visitors with the highest total page views must be the article writers and the Wikignomes that built the place — and they’ve been in precipitous decline for nearly 6 years now. I’m speculating of course, but maybe that’s starting to show through on the page views site-wide, emphasising how much work a small group of people have been putting in, and the dent they’re leaving in Wikipedia as they leave.

Have I missed something, do you have a better idea of why pageviews fell over 2013? Let me know!

Full R code to reproduce the graphs shown in this post is here:


Filed under Wikipedia

Analyse your bank statements using R

Online banking has made reviewing statements and transferring money more convenient than ever before, but most still rely on external methods for looking at their personal finances. However, many banks will happily give you access to long-term transaction logs, and these provide a great opportunity to take a DIY approach.

I’ll be walking through a bit of analysis I tried on my own account (repeated here with dummy data) to look for long-term trends on outgoing expenses. Incidentally, the reason I did this analysis was the combination of a long train journey and just 15 minutes free Wi-Fi (in C21 ?!), ergo a short time to get hold of some interesting data and a considerably longer time to stare at it.

Getting the data

First you need to grab the raw data from your online banking system. My account is with Natwest (UK), so it’s their format output I’ll be working with, but the principals should be easy enough to apply to the data from other banks.

Natwest offers a pretty straightforward Download Transactions dialogue sequence that’ll let you get a maximum of 12 months of transactions as a comma-separated value (CSV) flat file, it’s this we can download and analyse.

Download transaction history for the previous year as CSV.

Download transaction history for the previous year as CSV.

Read this file you’ve downloaded into a data.frame:

s <- read.csv("<filename.csv>", sep=",", row.names=NULL)
colnames(s) <- c("date", "type", "desc", "value", 
                 "balance", "acc")
s$date <- as.Date(s$date, format="%d/%m/%Y")

# Only keep the useful fields
s <- s[,1:5]

This will give you a 5-column table containing these fields:

  1. Date
  2. Type
  3. Description
  4. Value
  5. Balance

It should go without saying that the CSV contains sensitive personal data, and should be treated as such — your account number and sort code are present on each line of the file!

Parsing the statement

The most important stage of processing your transaction log is to classify each one into some meaningful group. A single line in your transaction file may look like this:

07/01/2013,POS,"'0000 06JAN13 , SAINSBURYS S/MKTS , J GROATS GB",-15.90,600.00,"'BOND J","'XXXXXX-XXXXXXXX",

Given the headers above, we can see that most of the useful information is contained within the quoted Description field, which is also full of junk. To get at the good stuff we need the power of regular expressions (regexp), but thankfully some pretty simple ones.

In fact, given the diversity of labels in the description field, our regular expressions end up essentially as lists of related terms. For example, maybe we want to group cash machine withdrawals; by inspecting the description fields we can pick out some useful words, in this case bank names like NATWEST, BARCLAYS and CO-OPERATIVE BANK. Our “cash withdrawal” regexp could then be:


And we can test this on our data to make sure only relevant rows are captured:

s[grepl("NATWEST|BARCLAYS|BANK", s$desc),]

Now you can rinse and repeat this strategy for any and all meaningful classes you can think of.

# Build simple regexp strings
# Do this for as many useful classes as you can think of

# Add a class field to the data, default "other"
s$class <- "Other"

# Apply the regexp and return their class
s$class   ifelse(grepl(food, s$desc), "Food",
    ifelse(grepl(flights, s$desc), "Flights",
      ifelse(grepl(trains, s$desc), "Trains", "Other")))))

Aggregating and plotting the data

Now we’ve got through some pre-processing we can build useful plots in R using the ggplot2 package. It’ll also be useful to aggregate transactions per month, and to do this we can employ another powerful R package from Hadley Wickham, plyr.

# Add a month field for aggregation
s$month <- as.Date(cut(s$date, breaks="month"))

# NB. remove incoming funds to look at expenses!
s <- subset(s, s$value < 0)

# Build summary table of monthly spend per class
smr <- ddply(s, .(month, class), summarise, 

Now we can plot these monthly values and look for trends over the year by fitting a statistical model to the observed data. In this example, I’ll use the loess non-linear, local regression technique which is one of the available methods in the geom_smooth layer.

ggplot(smr, aes(month, cost, col=class)) +
  facet_wrap(~class, ncol=2, scale="free_y") +
  geom_smooth(method="loess", se=F) + geom_point() +
  theme(axis.text.x=element_text(angle=45, hjust=1),
        legend.position="none") +
  labs(x="", y="Monthly total (£)")
Monthly totals for each class of expense are shown over 12 months.

Monthly totals for each class of expense are shown over 12 months for example data.

In this example, it seems the person has possibly stopped paying for things in cash as much, and has swapped trains for flying! However a significant amount of the transaction log remain classified as “other” — these transactions could be split into several more useful classes with more judicious use of regexp. This becomes pretty obvious when you look at the mean monthly spend per class:

yl <- ddply(smr, .(class), summarise, m=mean(cost))

ggplot(yl, aes(x=class, y=m)) +
  geom_bar(stat="identity") +
  labs(y="Average monthly expense (£)", x="")
Overwhelmingly "other".

Overwhelmingly “other” — needs more work!

Hopefully this gives you some ideas of how to investigate your own personal finance over the past year!

Here’s the full code to run the above analysis, which should work as-is on any CSV format transaction history downloaded for a single Natwest account.


Filed under R