Category Archives: R

Ligature fonts for R

Ligature fonts are fonts which sometimes map multiple characters to a single glyph, either for readability or just because it looks neat. Importantly, this only affects the rendering of the text with said font, while the distinct characters remain in the source.

Screen Shot 2017-07-19 at 20.39.24

The Apple Chancery font with and without ligatures enabled.

Maybe ligatures are an interesting topic in themselves if you’re into typography, but it’s the relatively modern monospaced variants which are significantly more useful in the context of R programming.

Two of the most popular fonts in this category are:

  • Fira Code — an extension of Fira Mono which really goes all out providing a wide range of ligatures for obscure Haskell operators, as well as the more standard set which will be used when writing R
  • Hasklig — a fork of Source Code Pro (in my opinion a nicer base font) which is more conservative with the ligatures it introduces

Here’s some code to try out with these ligature fonts, first rendered via bog-standard monospace font:

library(magrittr)
library(tidyverse)

filtered_storms <- dplyr::storms %>%
  filter(category == 5, year &amp;gt;= 2000) %>%
  unite("date", year:day, sep = "-") %>%
  group_by(name) %>%
  filter(pressure == max(pressure)) %>%
  mutate(date = as.Date(date)) %>%
  arrange(desc(date)) %>%
  ungroup() %T>%
  print()

Here’s the same code rendered with Hasklig:

hasklig.png

Some of the glyphs on show here are:

  • A single arrow glyph for less-than hyphen (<-)
  • Altered spacing around two colons (::)
  • Joined up double-equals

Fira Code takes this a bit further and also converts >= to a single glyph:

fira_code

In my opinion these fonts are a nice and clear way of reading and writing R. In particular the single arrow glyph harks back to the APL keyboards with real arrow keys, for which our modern two-character <- is a poor substitute.

One downside could be a bit of confusion when showing your IDE to someone else, or maybe writing slightly longer lines than it appears, but personally I’m a fan and my RStudio is now in Hasklig.

Advertisements

2 Comments

Filed under R

The Mandelbrot Set in R

The Mandelbrot set is iconic and countless beautiful visualisations have been born from its deceptively simple recursive equation. R’s plotting ecosystem should be the perfect setting for generating these eye-catching visualisations, but to date the package support has been lacking.

Googling for Mandelbrot set implementations in R didn’t immediately strike pay dirt — for sure there are a few scripts and maybe one dusty package out there but nothing definitive. One of the more useful search results was an age-old academic’s page (presumably pre-dating CSS) with a zip archive of an R wrapper around a C implementation of a Mandelbrot set generator. What’s more, the accompanying README bore the epitaph:

Eventually, perhaps in 50 years or so, I’ll put everything together in a
proper R package.

—a siren song to any R developer with too much time on their hands!

lava_cardoid_bulb

Expect a few of these plots in this post (view in shiny)

Mandelbrot R package

The first output was an R package, mandelbrot, which re-wraps the original C code by Mario dos Reis. This provides two interfaces to the underlying set generation algorithm:

  • mandelbrot() for generating an object for use with base R image
  • mandelbrot0() for generating a tidy data.frame for use with ggplot2 (equivalent to as.data.frame(mandelbrot()) but faster)

It also has a few other helper functions and utilities (see the docs for info). One of the examples in the README reuses a weird uneven palette made for a previous blog post to pretty good effect:

library(ggplot2)

mb <- mandelbrot(xlim = c(-0.8335, -0.8325),
                 ylim = c(0.205, 0.206),
                 resolution = 1200L,
                 iterations = 1000)

# vaccination heatmap palette
cols <- c(
  colorRampPalette(c("#e7f0fa", "#c9e2f6", "#95cbee",
                     "#0099dc", "#4ab04a", "#ffd73e"))(10),
  colorRampPalette(c("#eec73a", "#e29421", "#e29421",
                     "#f05336","#ce472e"), bias=2)(90),
  "black")

df <- as.data.frame(mb)
ggplot(df, aes(x = x, y = y, fill = value)) +
  geom_raster(interpolate = TRUE) + theme_void() +
  scale_fill_gradientn(colours = cols, guide = "none")

README-ggplot-1

This is great for single views, but you pretty quickly want to explore and zoom interactively. Mario dos Reis and Jason Turner (via r-help) implemented this in R using locator to read the cursor position and zoom. 14 years ago that was a pretty neat solution but today we can take this idea a bit further with the R web framework Shiny.

Shinybrot Shiny app

Shinybrot is a pretty simple Shiny app for exploring views generated by the mandelbrot package. It uses “brushing” for plot interaction, allowing the user to drag a rectangle selection which is then set to the x and y limits for the subsequent plot. This can be recursed to go deeper and deeper into the fractal and to get some appreciation of the set complexity.

grid_lines

Not sure why this happens…

Eventually when you go deep enough you bump up against some ggplot2 hard limit where the view is obscured by mysterious white grid lines. (Hadley points out that these aren’t grid lines but gaps between tiles — I think a problem that comes with approaching the limits of R’s numeric precision.) Still, you’re good for a fair few recursions.

Parameters from URI query string

seahorse_spec

Seahorse valley (view in shiny)

One feature I wanted was static URIs which resolve to a given view. For example, if you want to link to some tiny but interesting region of “Seahorse Valley” (the crevasse between the two primary bulbs), you should be able to direct link to the view you’re looking at.

As usual, Shiny has great support for this out of the box. parsedQueryString parses URI params in the form /?param=value to a named list. Then, as parameters change they can be pushed back to the user’s address bar using updateQueryString. Using mode = "push" with updateQueryString pushes the parameterised URI to the browser’s history stack, meaning users get working forward and back buttons for free!

Try it out in the shinyapps hosted version:

Screen Shot 2017-06-27 at 23.23.07

The shinybrot app, view at shinyapps.io or serve locally with shiny::runGitHub("blmoore/shinybrot")

ToDo

Possible extensions would be Julia sets and maybe even more exotic fractal equations (Newton fractals, magnetic fractals, Beasley ferns?). Apparently there are also some reasonably straightforward optimisations to the Mandelbrot set algorithm, for example shortcuts for blocking out the primary bulbs rather than iterating over each point, however the simple existing C code Mario put together is already blazingly fast, to the point where generating a view is mostly rasterisation-bound rather than from performing the set calculations.

Code

Issues and pull requests welcome:


After speaking to the author it turns out he did recently get around to packaging his code, see mariodosreis/fractal

Leave a comment

Filed under R, R package

Interactive charts in R

I’m giving a talk tomorrow at the Edinburgh R usergroup (EdinbR) on how to get started building interactive charts in R. I’ll talk about rCharts as a great general entry point to quickly generating interactive charts, and also the newer htmlwidgets movement, allowing interactive charts to be more easily integrated with RMarkdown and Shiny. I also tried to throw in a decent amount of Edinburgh-related examples along the way.

Current slides are here:

Click through for HTML slide deck.

Click through for HTML slide deck.

I’ve since spun out what started as a simple example for the talk into a live web app, viewable at blackspot.org.uk. Here I’m looking at Edinburgh Open Data from the county council of vehicle collisions in the city. It’s still under development and will be my first real project in Shiny, but already has started to come together quite nicely.

blackspot Shiny web app. Code available on github.

Blackspot Shiny web app. Code available on github. NB. The UI currently retains a lot of code borrowed from Joe Cheng’s beautiful SuperZip shiny example.

The other speaker for the session is Alastair Kerr (head of bioinformatics at the Wellcome Trust Centre for Cell Biology here in Edinburgh), and he’ll be giving a beginner’s guide to the Shiny web framework. All in all it should be a great meeting, if you’re nearby do come along!

2 Comments

Filed under R

Recreating the vaccination heatmaps in R

In February the WSJ graphics team put together a series of interactive visualisations on the impact of vaccination that blew up on twitter and facebook, and were roundly lauded as great-looking and effective dataviz. Some of these had enough data available to look particularly good, such as for the measles vaccine:

Credit to the WSJ and creators: Tynan DeBold and Dov Friedman

Credit to the WSJ and creators: Tynan DeBold and Dov Friedman

How hard would it be to recreate an R version?

Base R version

Quite recently Mick Watson, a computational biologist based here in Edinburgh, put together a base R version of this figure using heatmap.2 from the gplots package.

If you’re interested in the code for this, I suggest you check out his blog post where he walks the reader through creating the figure, beginning from heatmap defaults.

However, it didn’t take long for someone to pipe up asking for a ggplot2 version (3 minutes in fact…) and that’s my preference too, so I decided to have a go at putting one together.

ggplot2 version

Thankfully the hard work of tracking down the data had already been done for me, to get at it follow these steps:

  1. Register and login to “Project Tycho
  2. Go to level 1 data, then Search and retrieve data
  3. Now change a couple of options: geographic level := state; disease outcome := incidence
  4. Add all states (highlight all at once with Ctrl+A (or Cmd+A on Macs)
  5. Hit submit and scroll down to Click here to download results to excel
  6. Open in excel and export to CSV

Simple right!

Now all that’s left to do is a bit of tidying. The data comes in wide format, so can be melted to our ggplot2-friendly long format with:

measles <- melt(measles, id.var=c("YEAR", "WEEK"))

After that we can clean up the column names and use dplyr to aggregate weekly incidence rates into an annual measure:

colnames(measles) <- c("year", "week", "state", "cases")
mdf <- measles %>% group_by(state, year) %>% 
       summarise(c=if(all(is.na(cases))) NA else 
                 sum(cases, na.rm=T))
mdf$state <- factor(mdf$state, levels=rev(levels(mdf$state)))

It’s a bit crude but what I’m doing is summing the weekly incidence rates and leaving NAs if there’s no data for a whole year. This seems to match what’s been done in the WSJ article, though a more intepretable method could be something like average weekly incidence, as used by Robert Allison in his SAS version.

After trying to match colours via the OS X utility “digital colour meter” without much success, I instead grabbed the colours and breaks from the original plot’s javascript to make them as close as possible.

In full, the actual ggplot2 command took a fair bit of tweaking:

ggplot(mdf, aes(y=state, x=year, fill=c)) + 
  geom_tile(colour="white", linewidth=2, 
            width=.9, height=.9) + theme_minimal() +
    scale_fill_gradientn(colours=cols, limits=c(0, 4000),
                        breaks=seq(0, 4e3, by=1e3), 
                        na.value=rgb(246, 246, 246, max=255),
                        labels=c("0k", "1k", "2k", "3k", "4k"),
                        guide=guide_colourbar(ticks=T, nbin=50,
                               barheight=.5, label=T, 
                               barwidth=10)) +
  scale_x_continuous(expand=c(0,0), 
                     breaks=seq(1930, 2010, by=10)) +
  geom_segment(x=1963, xend=1963, y=0, yend=51.5, size=.9) +
  labs(x="", y="", fill="") +
  ggtitle("Measles") +
  theme(legend.position=c(.5, -.13),
        legend.direction="horizontal",
        legend.text=element_text(colour="grey20"),
        plot.margin=grid::unit(c(.5,.5,1.5,.5), "cm"),
        axis.text.y=element_text(size=6, family="Helvetica", 
                                 hjust=1),
        axis.text.x=element_text(size=8),
        axis.ticks.y=element_blank(),
        panel.grid=element_blank(),
        title=element_text(hjust=-.07, face="bold", vjust=1, 
                           family="Helvetica"),
        text=element_text(family="URWHelvetica")) +
  annotate("text", label="Vaccine introduced", x=1963, y=53, 
           vjust=1, hjust=0, size=I(3), family="Helvetica")

Result

measles_incidence_heatmap_2I’m pretty happy with the outcome but there are a few differences: the ordering is out (someone pointed out the original is ordered by two letter code rather than full state name) and the fonts are off (as far as I can tell they use “Whitney ScreenSmart” among others).

Obviously the original is an interactive chart which works great with this data. It turns out it was built with the highcharts library, which actually has R bindings via the rCharts package, so in theory the original chart could be entirely recreated in R! However, for now at least, that’ll be left as an exercise for the reader…


Full code to reproduce this graphic is on github.

19 Comments

Filed under R

Planning an R usergroup meeting in R

The edinbr_logoEdinburgh R usergroup (EdinbR) put together a survey a while back to figure out some of the logistical details for organising a succesful meeting. We had 75 responses (and a few more after I grabbed the results) so here’s some quick analysis, all done in R of course. The code and data for these figures are available on the EdinbR github account.

Who’s coming to EdinbR meetings?

attendance

It looks like the majority of our audience would describe themselves as having an “intermediate” level of R knowledge, but there’s a good number of beginners too. The overall number of attendees is promising: 43 said they’d attend either every meeting or most meetings!

When’s the best time and day for meetings?

best_day

best_time

Together the results for the best time and day were potentially biased by our very first meeting (which happened at Wednesday 5pm)…

best_time_combo

Regardless this result means we’ll stick with Wednesdays at 5pm for now, without prejudice against lunchtime sessions in the future.

Where should meetings be held?

venues

Again strong support for the status quo: George square (and Bristo square, informatics etc.) are nice, central locations which are ideal for those based in the city centre and a decent compromise for those further afield, such as EdinbR attendees from the IGMM or King’s Buildings.

What do we want to hear about?

The organisers had already put together a  very non-comprehensive list of potential topics for future meetings and so we asked respondents to select those they were interested in hearing about:

popular_topics

This gives a rough roadmap for upcoming EdinbR meetings: if you could give a talk about any of the above topics then do let us know; you can reach me by email or on twitter, and do join our mailing list to be notified of our next meeting!

This post is mirrored on edinbr.org

2 Comments

Filed under R

EdinbR: A new R usergroup for Edinburgh

Inspired by succesful RUGs like LondonR and CambR, I’m pleased to announce a new R usergroup for those in and around Edinburgh: EdinbR!

edinbr_logo
Edinburgh has a large research community using R, spread across different campuses and even universities so a centralised discussion group is long overdue. Many R packages have been developed by Edinburgh researchers, ranging from general parallelisation packages (SPRINT) to highly-specific packages targetting cutting-edge genomics data (poRe). It seems like a great idea to get these people talking to each other: developers, users and interested newbies alike!

So without further ado, the details for our first meeting are:

  • Date: Wednesday 17:00 18th February 2015
  • Venue: Room S1, 7 George Square Psychology building, Edinburgh
  • Topics: Refining plans for meeting format, timing, venues; selecting speakers for future meetings; advertising meetings, etc.

All are welcome regardless of R experience. We hope to run a range of meetings with some aimed at general or beginner topics and others delving more deeply into advanced areas.

If you might be interested in attending, we have a mailing list where meetings will be advertised, we’re on twitter (@edinb_r) and you can email the organisers at info@edinbr.org.

We hope to see you there!

Leave a comment

Filed under R

Celebrity twitter followers by gender

The most popular accounts on twitter have millions of followers, but what are their demographics like? Twitter doesn’t collect or release this kind of information, and even things like name and location are only voluntarily added to people’s profiles. Unlike Google+ and Facebook, twitter has no real name policy, they don’t care what you call yourself, because they can still divine out useful information from your account activity.

For example, you can optionally set your location on your twitter profile. Should you choose not to, twitter can still just geolocate your IP. If you use an anonymiser or VPN, they could use the timing of your account activity to infer a timezone. This could then be refined to a city or town using the topics you tweet about and the locations of friends and services you mention most.

I chose to look at one small aspect of demographics: gender, and used a cheap heuristic based on stated first name to estimate the male:female ratios in a sample of followers from these very popular accounts.

Top 100 twitter accounts by followers

A top 100 list is made available by Twitter Counter. It’s not clear that they have made this list available through their API, but thanks to the markup, a quick hack is to scrape the usernames using RCurl and some regex:

require("RCurl")
top.100 <- getURL("http://twittercounter.com/pages/100")

# split into lines
top.100 <- unlist(strsplit(top.100, "\n"))
# Get only those lines with an @
top.100 <- top.100[sapply(top.100, grepl, pattern="@")]

# Grep out anchored usernames: <a ...>@username</a>
top.100 <- gsub(".*>@(.+)<.*", "\\1", top.100)[2:101]
head(top.100)
# [1] "katyperry"  "justinbieber"  "BarackObama"  ...

R package twitteR

Getting data from the twitter API is made simple by the twitteR package. I made use of Dave Tang’s worked example for the initial OAuth setup, once that’s complete the twitteR package is really easy to use.

The difficulty getting data from the API, as ever, is to do with rate limits. Twitter allows 15 requests for follower information per 15 minute window. (Number of followers can be queried by a much more generous 180 requests per window.) This means that to get a sample of followers for each of the top 100 twitter accounts, it’ll take at a minimum 1 hour 40 mins to stay on the right side of the rate limit. I ended up using 90 second sleep windows between requests to be safe, making a total query time of two and a half hours!

Another issue is possibly to do with strange characters being returned and breaking the JSON import. This error crops up a lot and meant that I had to lower the sample size of followers to avoid including these problem accounts. After some highly unscientific tests, I settled on about 1000 followers per account which seemed a good trade-off between maximising sample size but minimising failure rate.

# Try to sample 3000 followers for a user:
username$getFollowers(n=3000)
# Error in twFromJSON(out) :
#  Error: Malformed response from server, was not JSON.
# The most likely cause of this error is Twitter returning
# a character which can't be properly parsed by R. Generally
# the only remedy is to wait long enough for the offending
# character to disappear from searches.

Gender inference

Here I used a relatively new R package, rOpenSci’s gender (kudos for resisting gendR and the like). This uses U.S. social security data to probabilistically link first names with genders, e.g.:

devtools::install_github("ropensci/gender")
require("gender")
gender("ben")
#   name proportion_male proportion_female gender
# 1  ben          0.9962            0.0038   male

So chances are good that I’m male. But the package also returns proportional data based on the frequency of appearances in the SSA database. Naively these can be interpreted as the probability a given name is either male or female. So in terms of converting a list of 1000 first names to genders, there are a few options:

  1. Threshold: if  >.98 male or female, assign gender, else ignore.
  2. Probabilistically: use random number generation to assign each case, if a name is .95 male and .05 female, on average assign that name to females 5% of the time.
  3. Bayesian-ish: threshold for almost certain genders (e.g. .99+) and use this as a prior belief of gender ratios when assigning gender to the other followers for a given user. This would probably lower bias when working with heavily skewed accounts.

I went with #2 here. Anecdotal evidence suggests it’s reasonably accurate anyway, with twitter analytics (using bag of words, sentiment analysis and all sorts of clever tricks to unearth gender) estimating my account has 83% male followers (!), with probabilistic first name assignment estimating 79% (and that’s with a smaller sample). Method #3 may correct this further but the implementation tripped me up.

Results

Celebrity twitter followers by gender

So boys prefer football (soccer) and girls prefer One Direction, who knew? Interestingly Barack Obama appears to have a more male following (59%), as does Bill Gates with 67%.

At the other end of the spectrum, below One Direction, Simon Cowell is a hit with predominantly female twitter users (70%), as is Kanye West (67%) and Khloe Kardashian (72%).

Another surprise is that Justin Bieber, famed as teen girl heartthrob, actually has a more broad gender appeal with 41 / 59 male-female split.

Interactive charts

Click for an interactive version.

Click for an interactive version.

Using the fantastic rCharts library, I’ve put together some interactive graphics that let you explore the above results further. These use the NVD3 graphing library, as opposed to my previous effort which used dimple.js.

The first of these is ordered by number of followers, and the second by gender split. The eagle-eyed among you will see that one account from the top 100 is missing from all these charts due to the JSON error I discuss above, thankfully it’s a boring one (sorry @TwitPic).

Where would your account be on these graphs? Somehow I end up alongside Wayne Rooney in terms of gender diversity :s

Caveats

  • A lot of the time genders can’t be called from an account’s first name. Maybe they haven’t given a first name, maybe it’s a business account or some pretty unicode symbols, maybe it’s a spammy egg account. This means my realised sample size is <<1000, sometimes the majority of usernames had no gender (e.g. @UberSoc, fake followers?).

    This (big) chart includes % for those that couldn't be assigned (NA)

    This (big) chart includes % for those that couldn’t be assigned (NA)

  • The SSA data is heavily biased towards Western (esp. US) and non-English names are likely to not be assigned a gender throughout. This is a shame, if you know of a more international gender DB please let me know.
  • I’m sampling most recent followers, so maybe accounts like Justin Bieber have a much higher female ratio in earlier followers than those which have only just hit the follow button.
  • The sample size of 1000 followers per account is smaller than I’d like, especially for accounts with 50 million followers.

If you have other ideas of what to do with demographics data, or have noticed additional caveats of this study, please let me know in the comments!


Full code to reproduce this analysis is available on Github.

21 Comments

Filed under R, Unrelated