Celebrity twitter followers by gender

The most popular accounts on twitter have millions of followers, but what are their demographics like? Twitter doesn’t collect or release this kind of information, and even things like name and location are only voluntarily added to people’s profiles. Unlike Google+ and Facebook, twitter has no real name policy, they don’t care what you call yourself, because they can still divine out useful information from your account activity.

For example, you can optionally set your location on your twitter profile. Should you choose not to, twitter can still just geolocate your IP. If you use an anonymiser or VPN, they could use the timing of your account activity to infer a timezone. This could then be refined to a city or town using the topics you tweet about and the locations of friends and services you mention most.

I chose to look at one small aspect of demographics: gender, and used a cheap heuristic based on stated first name to estimate the male:female ratios in a sample of followers from these very popular accounts.

Top 100 twitter accounts by followers

A top 100 list is made available by Twitter Counter. It’s not clear that they have made this list available through their API, but thanks to the markup, a quick hack is to scrape the usernames using RCurl and some regex:

require("RCurl")
top.100 <- getURL("http://twittercounter.com/pages/100")

# split into lines
top.100 <- unlist(strsplit(top.100, "\n"))
# Get only those lines with an @
top.100 <- top.100[sapply(top.100, grepl, pattern="@")]

# Grep out anchored usernames: <a ...>@username</a>
top.100 <- gsub(".*>@(.+)<.*", "\\1", top.100)[2:101]
head(top.100)
# [1] "katyperry"  "justinbieber"  "BarackObama"  ...

R package twitteR

Getting data from the twitter API is made simple by the twitteR package. I made use of Dave Tang’s worked example for the initial OAuth setup, once that’s complete the twitteR package is really easy to use.

The difficulty getting data from the API, as ever, is to do with rate limits. Twitter allows 15 requests for follower information per 15 minute window. (Number of followers can be queried by a much more generous 180 requests per window.) This means that to get a sample of followers for each of the top 100 twitter accounts, it’ll take at a minimum 1 hour 40 mins to stay on the right side of the rate limit. I ended up using 90 second sleep windows between requests to be safe, making a total query time of two and a half hours!

Another issue is possibly to do with strange characters being returned and breaking the JSON import. This error crops up a lot and meant that I had to lower the sample size of followers to avoid including these problem accounts. After some highly unscientific tests, I settled on about 1000 followers per account which seemed a good trade-off between maximising sample size but minimising failure rate.

# Try to sample 3000 followers for a user:
username$getFollowers(n=3000)
# Error in twFromJSON(out) :
#  Error: Malformed response from server, was not JSON.
# The most likely cause of this error is Twitter returning
# a character which can't be properly parsed by R. Generally
# the only remedy is to wait long enough for the offending
# character to disappear from searches.

Gender inference

Here I used a relatively new R package, rOpenSci’s gender (kudos for resisting gendR and the like). This uses U.S. social security data to probabilistically link first names with genders, e.g.:

devtools::install_github("ropensci/gender")
require("gender")
gender("ben")
#   name proportion_male proportion_female gender
# 1  ben          0.9962            0.0038   male

So chances are good that I’m male. But the package also returns proportional data based on the frequency of appearances in the SSA database. Naively these can be interpreted as the probability a given name is either male or female. So in terms of converting a list of 1000 first names to genders, there are a few options:

  1. Threshold: if  >.98 male or female, assign gender, else ignore.
  2. Probabilistically: use random number generation to assign each case, if a name is .95 male and .05 female, on average assign that name to females 5% of the time.
  3. Bayesian-ish: threshold for almost certain genders (e.g. .99+) and use this as a prior belief of gender ratios when assigning gender to the other followers for a given user. This would probably lower bias when working with heavily skewed accounts.

I went with #2 here. Anecdotal evidence suggests it’s reasonably accurate anyway, with twitter analytics (using bag of words, sentiment analysis and all sorts of clever tricks to unearth gender) estimating my account has 83% male followers (!), with probabilistic first name assignment estimating 79% (and that’s with a smaller sample). Method #3 may correct this further but the implementation tripped me up.

Results

Celebrity twitter followers by gender

So boys prefer football (soccer) and girls prefer One Direction, who knew? Interestingly Barack Obama appears to have a more male following (59%), as does Bill Gates with 67%.

At the other end of the spectrum, below One Direction, Simon Cowell is a hit with predominantly female twitter users (70%), as is Kanye West (67%) and Khloe Kardashian (72%).

Another surprise is that Justin Bieber, famed as teen girl heartthrob, actually has a more broad gender appeal with 41 / 59 male-female split.

Interactive charts

Click for an interactive version.

Click for an interactive version.

Using the fantastic rCharts library, I’ve put together some interactive graphics that let you explore the above results further. These use the NVD3 graphing library, as opposed to my previous effort which used dimple.js.

The first of these is ordered by number of followers, and the second by gender split. The eagle-eyed among you will see that one account from the top 100 is missing from all these charts due to the JSON error I discuss above, thankfully it’s a boring one (sorry @TwitPic).

Where would your account be on these graphs? Somehow I end up alongside Wayne Rooney in terms of gender diversity :s

Caveats

  • A lot of the time genders can’t be called from an account’s first name. Maybe they haven’t given a first name, maybe it’s a business account or some pretty unicode symbols, maybe it’s a spammy egg account. This means my realised sample size is <<1000, sometimes the majority of usernames had no gender (e.g. @UberSoc, fake followers?).

    This (big) chart includes % for those that couldn't be assigned (NA)

    This (big) chart includes % for those that couldn’t be assigned (NA)

  • The SSA data is heavily biased towards Western (esp. US) and non-English names are likely to not be assigned a gender throughout. This is a shame, if you know of a more international gender DB please let me know.
  • I’m sampling most recent followers, so maybe accounts like Justin Bieber have a much higher female ratio in earlier followers than those which have only just hit the follow button.
  • The sample size of 1000 followers per account is smaller than I’d like, especially for accounts with 50 million followers.

If you have other ideas of what to do with demographics data, or have noticed additional caveats of this study, please let me know in the comments!


Full code to reproduce this analysis is available on Github.

About these ads

18 Comments

Filed under R, Unrelated

18 responses to “Celebrity twitter followers by gender

  1. Pingback: Momento R do Dia – O que Claudia Leitte, Paris Hilton, Ivete Sangalo e Lady Gaga têm em comum? | De Gustibus Non Est Disputandum

  2. Eden Myers

    It would be interesting to see what degree of overlap there was between the followings of the top 100…

  3. Adit

    I am a beginner to R, so please bear with me. Can you explain what gsub(“.>@(.+)<.", "\1", does.

    • Sure, gsub is for global substitution (could also be done with grep or sub). It’s swapping the first argument with the second argument in a given character vector (see ?gsub in R). Here I’m using a regular expression to get the @username from HTML that looks like <a blah blah>@username</a>.

      The brackets define a capture group (“this is the part I want”) and the \1 references that group. So in an overly-complicated way I’m pulling out the twitter username and blanking everything else. This probably looks really weird if you’ve not used regular expressions before.

  4. Great post! Thank you so much for sharing..

    For those who want to learn R Programming, here is a great new course on youtube for beginners and Data Science aspirants. The content is great and the videos are short and crisp. New ones are getting added, so I suggest to subscribe.

    R Programming For Beginners

  5. Michael

    Since the 50% line is only relevant if there’s a 50:50 split of overall active twitter users/followers (it probably is, I guess), it would be interesting to try something to estimate the male:female ratio of twitter users in general to get a better idea of what a “neutral” result would be.

    Maybe for this, checking the portion of each gender in your entire dataset would be enough (or better?).

    • Yeah good point, overall in this dataset there’s a slightly higher proportion of females, but does that mean the top 100 just happen to contain more One Direction and pop star accounts than sportsmen or that there’s more females on twitter in general? Maybe there’s just more female names in the SSA database, or less variation than in male names. Interesting to think about but it could get confusing fast if you try to account for all these.

  6. Pingback: Daily Krunch » Celebrity Twitter Followers, by Gender

  7. Pingback: Celebrity Twitter Followers, by Gender | Kronosim

  8. Pingback: Actors In The News » Celebrity Twitter Followers, by Gender

  9. Pingback: Celebrity Twitter Followers, by Gender - Gazeta Shqiptare

  10. Pingback: Today in Technology May 29, 2014 | Tech Fann.com

  11. Povonte

    Wow! I love sharing a glimpse into your creative perspective!

  12. I have followed twitter for a long time. I have seen this happen a lot, I have male twitter accounts and also female accounts. I am quite surprised to see the difference in amount of followers even after doing the same things in both the account.

  13. Xirux Nefer

    Which criteria did you use to choose the accounts to be tracked? It’s easy to fall in some bias there. Also, it is strongly based on US culture (it makes me shiver to know that U.S. social security data is so easy to access) so the results can be biased. I can imagine the results in Japan would be quite different for example, or can you imagine what could we get if we did it with data from the middle east? Or from researchers of the experimental nuclear physics community at CERN? (some of them don’t even have a twitter account). Anyway, it’s an interesting application of R! Looks like you can prove whatever you want (sometimes it’s easy to lie with statistics, he he) I’ll try to see if I can prepare some results to prove that red haired people prefer rum & raisins ice cream ;-)

    • I used the top 100 accounts by number of followers. Yes the name list has a US bias. I didn’t set out to prove / “lie” about gender proportions, just a bit of analysis for fun.

      • Xirux Nefer

        Thanks for the reply.
        Nah, when I said that it’s easy to lie with stats, I was referring to one of the most famous books in statistics “How to lie with Statistics”:

        http://en.wikipedia.org/wiki/How_to_Lie_with_Statistics

        Sometimes even experienced statisticians are carried along or confused by the data. It’s difficult to make a good analysis. For example, if you took the top 100 accounts by country, the results will be different. Remember that the majority of twitter users are from the US. So the bias is not only in the name list =)

Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s