Analyse your bank statements using R

Online banking has made reviewing statements and transferring money more convenient than ever before, but most still rely on external methods for looking at their personal finances. However, many banks will happily give you access to long-term transaction logs, and these provide a great opportunity to take a DIY approach.

I’ll be walking through a bit of analysis I tried on my own account (repeated here with dummy data) to look for long-term trends on outgoing expenses. Incidentally, the reason I did this analysis was the combination of a long train journey and just 15 minutes free Wi-Fi (in C21 ?!), ergo a short time to get hold of some interesting data and a considerably longer time to stare at it.

Getting the data

First you need to grab the raw data from your online banking system. My account is with Natwest (UK), so it’s their format output I’ll be working with, but the principals should be easy enough to apply to the data from other banks.

Natwest offers a pretty straightforward Download Transactions dialogue sequence that’ll let you get a maximum of 12 months of transactions as a comma-separated value (CSV) flat file, it’s this we can download and analyse.

Download transaction history for the previous year as CSV.

Download transaction history for the previous year as CSV.

Read this file you’ve downloaded into a data.frame:

s <- read.csv("<filename.csv>", sep=",", row.names=NULL)
colnames(s) <- c("date", "type", "desc", "value", 
                 "balance", "acc")
s$date <- as.Date(s$date, format="%d/%m/%Y")

# Only keep the useful fields
s <- s[,1:5]

This will give you a 5-column table containing these fields:

  1. Date
  2. Type
  3. Description
  4. Value
  5. Balance

It should go without saying that the CSV contains sensitive personal data, and should be treated as such — your account number and sort code are present on each line of the file!

Parsing the statement

The most important stage of processing your transaction log is to classify each one into some meaningful group. A single line in your transaction file may look like this:

07/01/2013,POS,"'0000 06JAN13 , SAINSBURYS S/MKTS , J GROATS GB",-15.90,600.00,"'BOND J","'XXXXXX-XXXXXXXX",

Given the headers above, we can see that most of the useful information is contained within the quoted Description field, which is also full of junk. To get at the good stuff we need the power of regular expressions (regexp), but thankfully some pretty simple ones.

In fact, given the diversity of labels in the description field, our regular expressions end up essentially as lists of related terms. For example, maybe we want to group cash machine withdrawals; by inspecting the description fields we can pick out some useful words, in this case bank names like NATWEST, BARCLAYS and CO-OPERATIVE BANK. Our “cash withdrawal” regexp could then be:

"NATWEST|BARCLAYS|BANK"

And we can test this on our data to make sure only relevant rows are captured:

s[grepl("NATWEST|BARCLAYS|BANK", s$desc),]

Now you can rinse and repeat this strategy for any and all meaningful classes you can think of.

# Build simple regexp strings
coffee <- "PRET|STARBUCKS|NERO|COSTA"
cash <- "NATWEST|BARCLAYS|BANK"
food <- "TESCO|SAINSBURY|WAITROSE"
flights <- "EASYJET|RYANAIR|AIRWAYS"
trains <- "EC MAINLINE|TRAINLINE|GREATER ANGLIA"
# Do this for as many useful classes as you can think of

# Add a class field to the data, default "other"
s$class <- "Other"

# Apply the regexp and return their class
s$class   ifelse(grepl(food, s$desc), "Food",
    ifelse(grepl(flights, s$desc), "Flights",
      ifelse(grepl(trains, s$desc), "Trains", "Other")))))

Aggregating and plotting the data

Now we’ve got through some pre-processing we can build useful plots in R using the ggplot2 package. It’ll also be useful to aggregate transactions per month, and to do this we can employ another powerful R package from Hadley Wickham, plyr.

# Add a month field for aggregation
s$month <- as.Date(cut(s$date, breaks="month"))

# NB. remove incoming funds to look at expenses!
s <- subset(s, s$value < 0)

# Build summary table of monthly spend per class
library(plyr)
smr <- ddply(s, .(month, class), summarise, 
             cost=abs(sum(value)))

Now we can plot these monthly values and look for trends over the year by fitting a statistical model to the observed data. In this example, I’ll use the loess non-linear, local regression technique which is one of the available methods in the geom_smooth layer.

library(ggplot2)
ggplot(smr, aes(month, cost, col=class)) +
  facet_wrap(~class, ncol=2, scale="free_y") +
  geom_smooth(method="loess", se=F) + geom_point() +
  theme(axis.text.x=element_text(angle=45, hjust=1),
        legend.position="none") +
  labs(x="", y="Monthly total (£)")
Monthly totals for each class of expense are shown over 12 months.

Monthly totals for each class of expense are shown over 12 months for example data.

In this example, it seems the person has possibly stopped paying for things in cash as much, and has swapped trains for flying! However a significant amount of the transaction log remain classified as “other” — these transactions could be split into several more useful classes with more judicious use of regexp. This becomes pretty obvious when you look at the mean monthly spend per class:

yl <- ddply(smr, .(class), summarise, m=mean(cost))

ggplot(yl, aes(x=class, y=m)) +
  geom_bar(stat="identity") +
  labs(y="Average monthly expense (£)", x="")
Overwhelmingly "other".

Overwhelmingly “other” — needs more work!

Hopefully this gives you some ideas of how to investigate your own personal finance over the past year!


Here’s the full code to run the above analysis, which should work as-is on any CSV format transaction history downloaded for a single Natwest account.

Advertisements

8 Comments

Filed under R

8 responses to “Analyse your bank statements using R

  1. That´s great! Keep up your good work!

  2. Jason

    I just want to say how grateful I am for your effort. I had the same idea in my head after trying out every budgeting program out there. The main issue (as you’re probably aware) was categorizing each transaction after importing the data. While services link yodlee do it automatically, I wasn’t comfortable with providing login credentials and it doesn’t always categorize everything correctly. Every localized program package I’ve downloaded do not do it automatically. If I had of found just one program that did this, I would have paid for it. I knew R was capable of this kind of analysis (it’s very, very basic) and this will provide a springboard for my own. Thank you again.

  3. Jason

    Do you know if it’s possible to write an R script that retrieves the bank statements automatically from the website’s URL?

    • I think in the US some banks will let you get your data via OFX, which could then be read into R as XML. Here in the UK there are no such APIs as far as I can tell.

      • Jason

        Thanks Benjamin.

        On a side note, I ran into a problem with nested ifelse statements.
        I had more than 51 categories which ran into an error stating that ifelse statements couldn’t be nested any further. I tried to break down the categories into separate ifelse groups for “income” “expense” etc. however they just overwrite the categories already identified in the data frame. I tried doing a “for” loop to go through each row and match the pattern in the transaction description column however the “grep” function doesn’t recognise the placeholder (or temp) “i” as a regular expression.
        I overcame this issue by doing a nested “for” loop as follows for anyone wanted to categorise an account with many, many categories. note: the categories are structured as a data frame with category names as column names and each pattern listed in each element in the column:

        for (j in 1:number of categories){
        cat <- as.character(paste(category data frame[,j], collapse = “|”))
        for (i in 1:length(df)){
        df$category[i] <- ifelse(grepl(cat, df$Description[i]), colnames(cats[j]), df$category[i])
        }
        }

  4. Anonymous

    Hi. Do you know if it is possible to somehow avoid the ifelse chain when populating s$class?

    • Yes there are a few tidier ways, an obvious one is just swapping the nested ifelse for switch. Cleaner still might be to iterate over the regexs until you find a match (nowadays I’d use stringr regexs too).

  5. Thanks! This would be very handy as a web-based tool.

Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s