Category Archives: Unrelated

Celebrity twitter followers by gender

The most popular accounts on twitter have millions of followers, but what are their demographics like? Twitter doesn’t collect or release this kind of information, and even things like name and location are only voluntarily added to people’s profiles. Unlike Google+ and Facebook, twitter has no real name policy, they don’t care what you call yourself, because they can still divine out useful information from your account activity.

For example, you can optionally set your location on your twitter profile. Should you choose not to, twitter can still just geolocate your IP. If you use an anonymiser or VPN, they could use the timing of your account activity to infer a timezone. This could then be refined to a city or town using the topics you tweet about and the locations of friends and services you mention most.

I chose to look at one small aspect of demographics: gender, and used a cheap heuristic based on stated first name to estimate the male:female ratios in a sample of followers from these very popular accounts.

Top 100 twitter accounts by followers

A top 100 list is made available by Twitter Counter. It’s not clear that they have made this list available through their API, but thanks to the markup, a quick hack is to scrape the usernames using RCurl and some regex:

top.100 <- getURL("")

# split into lines
top.100 <- unlist(strsplit(top.100, "\n"))
# Get only those lines with an @
top.100 <- top.100[sapply(top.100, grepl, pattern="@")]

# Grep out anchored usernames: <a ...>@username</a>
top.100 <- gsub(".*>@(.+)<.*", "\\1", top.100)[2:101]
# [1] "katyperry"  "justinbieber"  "BarackObama"  ...

R package twitteR

Getting data from the twitter API is made simple by the twitteR package. I made use of Dave Tang’s worked example for the initial OAuth setup, once that’s complete the twitteR package is really easy to use.

The difficulty getting data from the API, as ever, is to do with rate limits. Twitter allows 15 requests for follower information per 15 minute window. (Number of followers can be queried by a much more generous 180 requests per window.) This means that to get a sample of followers for each of the top 100 twitter accounts, it’ll take at a minimum 1 hour 40 mins to stay on the right side of the rate limit. I ended up using 90 second sleep windows between requests to be safe, making a total query time of two and a half hours!

Another issue is possibly to do with strange characters being returned and breaking the JSON import. This error crops up a lot and meant that I had to lower the sample size of followers to avoid including these problem accounts. After some highly unscientific tests, I settled on about 1000 followers per account which seemed a good trade-off between maximising sample size but minimising failure rate.

# Try to sample 3000 followers for a user:
# Error in twFromJSON(out) :
#  Error: Malformed response from server, was not JSON.
# The most likely cause of this error is Twitter returning
# a character which can't be properly parsed by R. Generally
# the only remedy is to wait long enough for the offending
# character to disappear from searches.

Gender inference

Here I used a relatively new R package, rOpenSci’s gender (kudos for resisting gendR and the like). This uses U.S. social security data to probabilistically link first names with genders, e.g.:

#   name proportion_male proportion_female gender
# 1  ben          0.9962            0.0038   male

So chances are good that I’m male. But the package also returns proportional data based on the frequency of appearances in the SSA database. Naively these can be interpreted as the probability a given name is either male or female. So in terms of converting a list of 1000 first names to genders, there are a few options:

  1. Threshold: if  >.98 male or female, assign gender, else ignore.
  2. Probabilistically: use random number generation to assign each case, if a name is .95 male and .05 female, on average assign that name to females 5% of the time.
  3. Bayesian-ish: threshold for almost certain genders (e.g. .99+) and use this as a prior belief of gender ratios when assigning gender to the other followers for a given user. This would probably lower bias when working with heavily skewed accounts.

I went with #2 here. Anecdotal evidence suggests it’s reasonably accurate anyway, with twitter analytics (using bag of words, sentiment analysis and all sorts of clever tricks to unearth gender) estimating my account has 83% male followers (!), with probabilistic first name assignment estimating 79% (and that’s with a smaller sample). Method #3 may correct this further but the implementation tripped me up.


Celebrity twitter followers by gender

So boys prefer football (soccer) and girls prefer One Direction, who knew? Interestingly Barack Obama appears to have a more male following (59%), as does Bill Gates with 67%.

At the other end of the spectrum, below One Direction, Simon Cowell is a hit with predominantly female twitter users (70%), as is Kanye West (67%) and Khloe Kardashian (72%).

Another surprise is that Justin Bieber, famed as teen girl heartthrob, actually has a more broad gender appeal with 41 / 59 male-female split.

Interactive charts

Click for an interactive version.

Click for an interactive version.

Using the fantastic rCharts library, I’ve put together some interactive graphics that let you explore the above results further. These use the NVD3 graphing library, as opposed to my previous effort which used dimple.js.

The first of these is ordered by number of followers, and the second by gender split. The eagle-eyed among you will see that one account from the top 100 is missing from all these charts due to the JSON error I discuss above, thankfully it’s a boring one (sorry @TwitPic).

Where would your account be on these graphs? Somehow I end up alongside Wayne Rooney in terms of gender diversity :s


  • A lot of the time genders can’t be called from an account’s first name. Maybe they haven’t given a first name, maybe it’s a business account or some pretty unicode symbols, maybe it’s a spammy egg account. This means my realised sample size is <<1000, sometimes the majority of usernames had no gender (e.g. @UberSoc, fake followers?).

    This (big) chart includes % for those that couldn't be assigned (NA)

    This (big) chart includes % for those that couldn’t be assigned (NA)

  • The SSA data is heavily biased towards Western (esp. US) and non-English names are likely to not be assigned a gender throughout. This is a shame, if you know of a more international gender DB please let me know.
  • I’m sampling most recent followers, so maybe accounts like Justin Bieber have a much higher female ratio in earlier followers than those which have only just hit the follow button.
  • The sample size of 1000 followers per account is smaller than I’d like, especially for accounts with 50 million followers.

If you have other ideas of what to do with demographics data, or have noticed additional caveats of this study, please let me know in the comments!

Full code to reproduce this analysis is available on Github.


Filed under R, Unrelated

What are the most overrated films?

“Overrated” and “underrated” are slippery terms to try to quantify. An interesting way of looking at this, I thought, would be to compare the reviews of film critics with those of Joe Public, reasoning that a film which is roundly-lauded by the Hollywood press but proved disappointing for the real audience would be “overrated” and vice versa.

To get some data for this I turned to the most prominent review aggregator: Rotten Tomatoes. All this analysis was done in the R programming language, and full code to reproduce it will be attached at the end.

Rotten Tomatoes API

This API is nicely documented, easy to access and permissive with rate limits, as well as being cripplingly restrictive in what data is presents. Want a list of all films in the database? Nope. Most reviewed? Top rated? Highest box-office takings? Nope.

The related forum is full of what seem like simple requests that should be available through the API but aren’t: top 100 lists? Search using mulitple IDs at once? Get audience reviews? All are unanswered or not currently implemented.

So the starting point (a big list of films) is actually kinda hard to get at. The Rube Golbergian method I eventually used was this:

  1. Get the “Top Rentals” list of movie details (max: 50)
  2. Search each one for “Similar films” (max: 5)
  3. Get the unique film IDs from step 2 and iterate

(N.B. This wasn’t my idea but one from a post in the API forums, unfortunately didn’t save the link.)

In theory this grows your set of films at a reasonable pace, but in reality the number of unique films being returned was significantly lower (shown below). I guess this was due to pulling in “walled gardens” to my dataset, e.g. if a Harry Potter film was hit, each further round would pull in the 5 other films as most similar.

Films returned


Here’s an overview of the critic and audience scores I collected through the Rotten Tomatoes API, with some outliers labelled.

Most over- and underrated films

On the whole it should be noted that critics and audience agree most of the time, as shown by the Pearson correlation coefficient between the two scores (0.71 across >1200 films).

Click for interactive version.


I’ve put together an interactive version of the same plot here using the rCharts R package. It’ll show film title and review scores when you hover over a point so you know what you’re looking at. Also I’ve more than doubled the size of the film dataset by repeating the above method for a couple more iterations — take a look!

Most underrated films

Using our earlier definition it’s easy to build a table of those films where the audience ending up really liking a film that was panned by critics.

Scores are shown out of 100 for both aggregated critics and members of Rotten Tomatoes.

Scores are shown out of 100 for both aggregated critics and members of Rotten Tomatoes.

Somewhat surprisingly, the top of the table is Facing the Giants (2006), an evangelical Christian film. I guess non-Christians might have stayed away, and presumably it struck a chord within its target demographic — but after watching the trailer, I’d probably agree with the critics on this one.

This showed that some weighting of the difference might be needed, at the very least weighting by number of reviews, but the Rotten Tomatoes API doesn’t provide that data.

In addition the Rotten Tomatoes page for the film, shows a “want to see” percentage, rather than an audience score. This came up a few times and I’ve seen no explanation for it, presumably “want to see” rating is for unreleased films, but the API returns a separate (and undisclosed?) audience score for these films also.

Above shows a "want to see" rating, different to the "liked it" rating returned by the API and shown below

Above shows a “want to see” rating, different to the “liked it” rating returned by the API and shown below. Note: these screenshots from are not CC licensed and is shown here under a claim of Fair Use, reproduced for comment/criticism.

Looking over the rest of the table, it seems the public is more fond of gross-out or slapstick comedies (such as Diary of a Mad Black Woman (2005), Grandma’s boy (2006)) than the critics. Again, not films I’d jump to defend as underrated. Bad Boys II however…

Most overrated films

Here we’re looking at those films which the critics loved, but paying audiences were then less enthused.

As before, scores are out of 100 and they're ranked by difference between audience and critic scores.

As before, scores are out of 100 and they’re ranked by difference between audience and critic scores.

Strangely the top 15 (by difference) contains both the original 2001 Spy Kids and the sequel Spy Kids 2: The Island of Lost Dreams (2002). What did critics see in these films that the public didn’t? A possibility is bias in the audience reviews collected, the target audience is young children for these films and they probably are underrepresented amongst Rotten Tomatoes reviewers. Maybe there’s even an enrichment for disgruntled parent chaperones.

Thankfully, though, in this table there’s the type of film we might more associate with being “overrated” by critics. Momma’s Man (2008) is an indie drama debuted at the 26th Torino Film Festival. Essential Killing is a 2010 drama and political thriller from Polish director and screenwriter Jerzy Skolimowski. 

There’s also a smattering of Rom-Coms (Friends with Money (2006), Splash (1984)) — if the API returned genre information it would be interesting to look for overall trends but, alas. Additional interesting variables to consider might be budget, the lead, reviews of producer’s previous films… There’s a lot of scope for interesting analysis here but it’s currently just not possible with the Rotten Tomatoes API.

 Caveats / Extensions

The full code will be posted below, so if you want to do a better job with this analysis, please do so and send me a link! 🙂

  • Difference is too simple a metric. A better measure might be weighted by (e.g.) critic ranking. A film critics give 95% but audiences 75% might be more interesting than the same points difference between a 60/40 rated film.
  • There’s something akin to a “founder effect” of my initial chosen films that makes it had to diversify the dataset, especially to films from previous decades and classics.
  • The Rotten Tomatoes API provides an IMDB id for cross-referencing, maybe that’s a path to getting more data and building a better film list.
Full code to reproduce analysis

Note: If you’re viewing this on r-bloggers, you may need to visit the Benomics version to see the attached gist.

api.key <- "yourAPIkey"
rt <- getURI(paste0("", api.key, "&limit=50"))
rt <- fromJSON(rt)
title <- rt$movies$title
critics <- rt$movies$ratings$critics_score
audience <- rt$movies$ratings$audience_score
df <- data.frame(title=title, critic.score=critics,
# Top 50 rentals, max returnable
ggplot(df, aes(x=critic.score, y=audience.score)) +
geom_text(aes(label=title, col=1/(critic.score audience.score)))
# how can we get more? similar chaining
# STILL at most 5 per film (sigh)
getRatings <- function(id){
sim.1 <- getURI(paste0("",
id, "/similar.json?apikey=",
api.key, "&limit=5"))
sim <- fromJSON(sim.1)
d <- data.frame(id = sim$movies$id,
title = sim$movies$title,
crit = sim$movies$ratings$critics_score,
aud = sim$movies$ratings$audience_score)
rt.results <- function(idlist){
r <- sapply(unique(as.character(idlist)), getRatings, simplify=F)
r <-, r)
# Maybe this could be done via a cool recursion using Recall
r1 <- rt.results(rt$movies$id)
r2 <- rt.results(r1$id)
r3 <- rt.results(r2$id)
r4 <- rt.results(r3$id)
r5 <- rt.results(r4$id)
r6 <- rt.results(r5$id)
r7 <- rt.results(r6$id)
f <- function(x)
# Fig. 1: Number of films gathered via recursive descent
# of 'similar films' lists.
pdf(4, 4, file="rottenTomatoHits.pdf")
par(cex.axis=.7, pch=20, mar=c(4,3,1,1), mgp=c(1.5,.3,0), tck=.02)
plot(1:7, f(1:7), type="b", xlab="Recursions", ylab="Number of hits",
log="y", col=muted("blue"), lwd=2, ylim=c(4, 1e5))
lines(1:7, c(nrow(r1), nrow(r2), nrow(r3), nrow(r4), nrow(r5),
nrow(r6), nrow(r7)), type="b", col=muted("red"), lwd=2)
legend("bottomright", col=c(muted("blue"), muted("red")), pch=20, lwd=2,
legend=c(expression(Max~(5^x)), "Realised"), bty="n", lty="47")
r <- rbind(r1, r2, r3, r4, r5, r6, r7)
# 1279 unique films
ru <- r[!duplicated(as.character(r$id)),]
# Films with insufficient critics reviews get -1 score
ru[which(ru$crit == 1),]
ru <- ru[ru$crit != 1,]
ru$diff <- ru$crit ru$aud
pcc <- cor(ru$crit, ru$aud)
# Overview figure: Show all critics vs. audience scores
# and highlight the titles of outliers
svg(7, 6, file="overview.svg")
ggplot(ru, aes(x=crit, y=aud, col=diff)) +
geom_point() +
coord_cartesian(xlim=c(10,110), ylim=c(10,110)) +
scale_color_gradientn(colours=brewer.pal(11, "RdYlBu"),
breaks=seq(60,40, length.out=11),
labels=c("Underrated", rep("", 4),
"Agree", rep("", 4),
"Overrated")) +
geom_text(aes(label=ifelse(diff < quantile(diff, .005) |
diff > quantile(diff, .995), as.character(title), ""),
hjust=0, vjust=0, angle=45) +
scale_size_continuous(range=c(2,4), guide="none") +
labs(list(x="Critic's score", y="Audience score",
col="")) +
annotate("text", 3, 3,
label=paste0("rho ==", format(pcc, digits=2)),
tab <- ru
colnames(tab) <- c("id", "Title", "Critics", "Audience", "Difference")
# Most underrated films:
grid.draw(tableGrob(tab[order(tab$Difference),][1:15,1], show.rownames=F))
# Most overrated:
grid.draw(tableGrob(tab[order(tab$Difference, decreasing=T),][1:15,1], show.rownames=F))

view raw
hosted with ❤ by GitHub



Filed under R, Unrelated

My New Years Resolution(s)

New Years’ Resolutions: meaningless, silly and forgotten by February. With that in mind, and in no particular order, here’s mine for 2013:

1) Begin learning a functional programming language (Clojure, Haskell, ML, OCaml or other) — I’m really interested in functional programming for several reasons, some more sensible than others. Nowadays I spend a lot of time using R and the functional aspects are powerful and intuitive for mathematical programming, so a deeper understanding of the FP paradigm will likely improve my grasp of R. More importantly I expect FP to become more important as parallelisation becomes evermore vital in ‘big data’ computational biology.


Clojure anyone?

Also, as a purely academic exercise, I think lambda calculus frames an aesthetically pleasing syntax and fosters interesting programming approaches. Lastly (and perhaps least importantly) I think it could be a quirky and interesting addition to my CV, as well as making me a more “well-rounded” programmer — for all that’s worth.

2) Develop my informal science writing — While regular blogging is on the backburner for now, I still think it’s important to practice writing about science for a general audience. One way of doing this is through competitions, such as that recently run by Europe PubMed Central, which I always seem to bookmark but not get around to entering. So note to self: follow up with these this year.

3) Work on web development — I’ve never made a true web app or used javascript or PHP in anger, but I’m increasingly aware that this is something I’ll need to get to grips with sooner or later. I’ve got a testing account for the new RStudio glimmer web server which servers Shiny apps, so there’s an easy way to get started.

In line with this resolution, I also plan to tie down some actual real estate: a nice domain name and some hosting which would presumably encourage both my web development and blogging. I’m particularly interested in the new .bio general top-level domains which are due for release in 2013; presumably they’re designed for biographies but could also work well for a biologist such as myself 😉

4) There’s more to life than a PhD — keep this in mind.

Edinburgh castle – awesome picturesque scene a lot closer to the city centre than you might think.[1]

Leave a comment

Filed under Musings, Unrelated