Monthly Archives: October 2012

9 reasons to use RStudio

In no particular order, here are nine reasons why I really like the RStudio IDE for the R statistical programming language.

1) R benefits from an IDE – I accept that in some languages an IDE is unnecessary—Perl is the first example that comes to mind—and in some languages it’s near-essential (Java). A good case can be made that R is at the former end of the spectrum, but RStudio should be enough to convince you otherwise:

The usefulness of many of the visible panes can be seen from the outset (more obviously so than in Eclipse, in my experience). For beginners, the help pane will display the answers to your ?queries and the upper-right workspace pane lists the data, values and functions in your current workspace. The import dataset button helps write your read command for parsing csvs, or files with other delimiters and gives a handy preview of the resulting R object. But advanced users will also benefit from the source editor / console combination (admittedly a universal feature of IDEs), tabbed editing, execute from source and possibly also the…

2) Plot device – Often when you’re doing exploratory data analysis in R, you’ll churn out a pile of graphs, a minority of which will be of interest. Any plots in the Plot pane graphics device can be post-hoc exported to pdf/svg/postscript (as well as png, jpeg etc.) in any reasonable size you choose. A niggle with this has been the differences between plots in the Plot pane and those after export, but this is constantly being improved (v0.97 sets the default export size to that of the dimensions of the current pane) and  it’s fair to say that when you transition from exploratory data analysis to producing high-quality graphics, you should probably be setting up your graphics device explicitly anyway.

3) It runs on your OS – (probably) and there’s a server version too.

4) It’s open source – I don’t just mean it’s free software (though it is), I mean you can grab a copy of its source code from GitHub and, C++ permitting, edit the src and even redistribute it under AGPL.

5) The devs are, well, awesome 

The above tweet references one specific issue but by browsing their support forum it’s easy to find other examples of fast and efficient bug fixes from devs who obviously care about the feedback they receive and improving the functionality of RStudio (for more examples see: [1], [2], [3]).

Additionally, the people now working on RStudio include some big names from the R world, including Hadley Wickham, creator of ggplot2, which can only mean good things (see their blog post).

6) Features, features, features – RStudio supports version control and codebase organisation in the form of projects. Another cool feature which might be of use to some is the manipulate package for dynamically changing plot parameters; I can’t say I’ve ever used this package but I spotted it in the documentation and seems like a nice tool for last-minute data massage (“Hmm… that does look better without outliers”).

7) Code completion – always handy. Though I did feel the need to turn off the automatic matching quote marks.

8) Rpubs and R markdown – with the push for openness in science, sites like figshare are starting to gain traction as nice ways to share supporting material and preprints, but what better way to share figures and analysis done in R than directly from your RStudio IDE to a site like Rpubs. You can craft an R markdown document in the source editor in RStudio via the knitr package, and then by clicking Publish you can immediately share figures and code with collaborators, tweeps and the blogosphere. The site and corresponding RStudio functionality was added earlier this year and although I can’t honestly say I’ve written an R markdown document yet, I’m sure it’s only a matter of time before myself and other R users start to engage with this new method of sharing our analyses.

9) It’s actively under development yet stable – well most of the time. It is still v0 afterall…


Filed under R

Oxbridgewich and shaky tables

The Oxbridge application deadline for 2013 entry is tomorrow (!) and aspirational students can soon begin the long wait to find out if they’ll be attending the UK’s most prestigious institutions. But according to UCAS application data from 2011, Oxford and Cambridge combined received less applicants than the University of Greenwich, and this is despite a difference of around 100 places in one of the UK league tables. Greenwich even got more applicants per place, approximately 7:1 compared with Cambridge’s 5:1 and Oxford’s 6:1. So, overall the UoG must be more popular and more selective than these world-renowned institutions, should we now be referring to Oxbridgewich?

Well, probably not. I was cheating by combining Cambridge and Oxford as you can only apply for one or the other without a degree, and due to Greenwich’s relatively lower entry standards, it was likely used as a safety application for some who would go on to accept a higher-grade offer (I don’t mean to single out UoG, by the way). My point is that if seemingly obvious metrics like applications per place aren’t really indicators of an institution’s standards, what can we use instead? Well, there are several well-known league tables but each uses different methodologies; QS, for example,  relies heavily on surveying academics whereas the Guardian’s effort makes more use of the National Student Survey. Unsurprisingly they give varying results. The Telegraph’s Andrew Marzsal wrote on this subject last week, saying:

“Although nominally answering the same question, they don’t share a methodology, a data set or indeed a winner […] in fact the wildly differing outcomes of these tables make them more, not less, useful.”

University rankings: which world university rankings should we trust? (Oct 4 2012)

He justifies the latter argument by referring to expected strengths or weaknesses of the different table methods, implying if you’re interested in x, then you want table y. I wasn’t completely convinced by this, and browsing the existing tables shows remarkable year-on-year fluctuations: The above distributions show the changes in rank from 2012 to 2013 in over a hundred UK universities. The highest of these jumps was a rise of 38 places (from 82→44 for Brunel University in the Guardian’s table)—in a single year! Other big leaps include a fall of 29 places for Leeds Trinity University College, and a 30 rank rise for Birmingham City (both from the Guardian’s table). These aren’t small universities either, BCU took on over 5,000 new undergraduates in 2011. Further there’s no significant change in the analytics that produced the rankings (for the Guardian: “The methodology employed in the tables has generally remained very constant since 2008” [doc]). So how could academic standards, quality of research, job prospects or any other considered metric vary so widely in 12 months at these universities? Are the student surveys that fickle?

In a related vein, I wanted to look at the correlation of university rankings across different tables but discovered it has already been done in some detail by Sawyer, K. et al. (2012) in “Measuring the unmeasurable” (also the inconsistent naming amongst tables is probably too big an ask for my regex abilities). The authors found that while high-ranking institutions were well-correlated, those lower down the tables were not. They go on to analogise with financial markets and make somewhat fluffy generalisations about the validity of inference… but nevertheless the correlation analysis seems valid. To see if this differential treatment of lower-ranked institutions held in the 2012-13 change data, I did a simple linear regression analysis: As you can see, while the regression line itself looks an unconvincing fit, it had a significantly non-zero coefficient (0.064 ± 0.012, p = 1.18 ×10-6). The amount of variance explained by this trend would be a not-uninteresting 19%, so it does generally seem more unstable down there, or at least it is in this snapshot of the THE table. As evidence in favour of the usefulness of tables, a principal components analysis, again using world rankings [pdf], concluded that the variable with the highest explanatory power was indeed academic performance (R2 = 0.48), though this study didn’t stratify high and low-ranking educational bodies. In light of the above result, it seems likely that a subset of lower-ranked universities may have a different principle component.

Overall I’m not making an argument against the use of these tables, I know I relied on them when picking out my UCAS choices, but it seems likely that while Oxbridge may have a gentlemanly back-and-forth over the top spots for years to come, the University of Greenwich and its ilk will probably be flying all over the place, and proclamations of ‘this year’s most improved rank’ (e.g. [1], [2][3]) should be viewed with particular scepticism.

Leave a comment

Filed under Musings

Hello, world

Having just edged up to (an admittedly meagre) 100 followers on twitter, and as I’ve just started my PhD, now seems a good time to finally start a blog. The hope is that by writing about my research and other (un)related topics, I’ll have a lasting record of what I’ve been up to which might even interest some wandering internet folk. At worst it should help improve my informal writing.

Expected topics include: recent advances in computational biology, particularly regarding higher-order chromatin and gene regulation; programming bits and pieces, focusing on R and Python; life as a PhD student; maybe even whimsical tales of life in Edinburgh—who knows!

Leave a comment

Filed under Musings