Category Archives: Wikipedia

Meticulously recreating bitmap plots in R

There’s a hard-fought drive on Wikimedia commons to convert those images that should be in vector format (i.e. graphs, diagrams) from their current bitmap form. At the time of writing, the relevant category has over 7000 images in the category “Images that should use vector graphics”.

The usual way people move between the two is by tracing over the raster, and great tools like Inkscape (free open-source software) can help a lot with this. But in the case of graphs I thought it’d be fun to try and rebuild a carbon copy from scratch in R.

The original

This is the original bitmap plot I wanted to recreate.

This is the original bitmap plot I wanted to recreate. (Courtesy of Wikimedia Commons)

The file that first caught my eye was this nice graph of US employment stats, currently used on the highly-trafficked Obama article. I’m not sure what drew this originally, it doesn’t look like Excel because of the broken axis and annotations, but maybe it is. It’s currently a png at about 700 × 500 so should be an easy target for improvement.

Figure 2.0

The two raw data files are available here and here as Excel spreadsheets. They have some weird unnecessary formatting so the various xls parsers for R won’t work; save the tables from Excel as csv. I won’t talk through the code as it wasn’t too taxing (or clean) but it’ll be at the end of the post. Here’s what I came up with:

I realise the irony in having to upload a bitmap version for wordpress, but click for the SVG.

I realise the irony in having to upload a bitmap version for wordpress, but click for the SVG.

I expanded my plot to include the 2013 data, so it inescapably has slightly different proportions to the original. And I was working on a single monitor at the time so I didn’t have a constant comparison. I can see now a few things are still off, the fonts are different sized for one and I ditched the broken axis, but overall I think it’s a decent similarity!

ggplot2 version

Two y-axes on the same graph is bad, bad, bad and unsurprisingly forbidden with ggplot2 but I did come across this method of dummy-facetting and then plotting separate layers per facet. An obvious problem is now the y-axis are representing different things and you only have one label. A hacky fix is to write your ylabs into the facet header (I’m 100% confident Hadley Wickham and Leland Wilkinson would not be impressed with this). Another alternative would be to use map a colour aesthetic to your y-axis values and label it in the legend (again, pretty far from recommended practice).

This is what I ended up with, I still think it’s a reasonable alternative to the above, and the loess fitted model nicely shows the unemployment rate trend without the seasonality effects:


Article version

While mimicking the original exactly was fun (for me at least), I tried to improve upon it for the actual final figure for use on Wikipedia. For instance, it now uses unambiguous month abbreviations, and I swapped the legend for colour-coded text labels. It still has some of the original’s charm though. Looks like after a bit of a rough patch, your employment statistics are starting to look pretty good Mr. President.


Next up, the other much less attractive figures on that page ([1], [2]).

Leave a comment

Filed under R, Wikipedia

Wikipedia is dead, long live the ‘pedia

I was a bit surprised when looking at the Wikipedia pageviews for 2013 (nicely presented here). After 5 years of consistent and reasonably stable growth, over 2013 monthly pageviews actually dropped, and to the tune of 2 *billion* views  (10%) from their peak early in the year.

pviewsThis was surprising to me. The problem Wikipedia has attracting new editors has been well-publicised, but it’s never had trouble with PageRank or increasing its reach to casual viewers.

Well, it turns out one area seeing consistent and healthy growth is, as you would guess, mobile views, which are showing gains of about 150k pageviews a month on English Wikipedia. This makes up for almost half a billion of those lost over 2013 in the graph above, but still leaves some explaining to do.


Interestingly, another useful metric of web traffic, unique visitors per month, continues to grow considerably. Maybe this reflects how mobile visitors use the site differently, just looking something up (e.g. to settle an argument) and closing their browser as opposed to a few hours going from topic to topic and ending up admiring a list of Eiffel tower replicates.


A quick graph of mean monthly pageviews per visitor gives this theory some support, but the data seems pretty noisy and has varied a lot over the past few years.

Another possibility is that this data is telling us what we already know: the unique visitors with the highest total page views must be the article writers and the Wikignomes that built the place — and they’ve been in precipitous decline for nearly 6 years now. I’m speculating of course, but maybe that’s starting to show through on the page views site-wide, emphasising how much work a small group of people have been putting in, and the dent they’re leaving in Wikipedia as they leave.

Have I missed something, do you have a better idea of why pageviews fell over 2013? Let me know!

Full R code to reproduce the graphs shown in this post is here:


Filed under Wikipedia

How to tell if a scientist has written their own Wikipedia page

I remember being a young aspiring scientist and thinking it’d be great to have a Wikipedia page written about me one day… So I was somewhat surprised a few years later to hear notable scientists actively discouraging keen editors from creating one in their name. Wikipedia has an essay titled “An article about yourself if nothing to be proud of“, which lays out the arguments for and against, emphasising that you as the subject won’t have control over the article’s contents, the good or the bad. What might start out as a brilliant looking CV could later sprout a “Criticisms” or “Controversy” section, and chances are that’ll be the page that tops Google’s search ranking.

Another entirely terrible idea is to write one about yourself.

In addition to the reasons listed in the aforementioned essay, there’s another and it’s, in my view, a more important consideration: it’s entirely obvious that you have done so, and the evidence is almost always permanently available.

I’ve come across a small number of scientist’s biographies (BLPs in Wikijargon) which are unambiguously self-written and a few more which are probably written by friends or colleagues.* It’s perhaps more common in my field of Computational Biology, as you have somewhat computer-savvy scientists looking to climb the academic career ladder; besides if most casual readers don’t realise it’s self-written, the professor or department head you’re aiming to impress probably won’t either. There’s little actual harm done to the project through doing this, it’s strongly discouraged but if you meet specific “notability guidelines” and write from a neutral point of view, it will likely go unnoticed—having said that they’ll always be a few tells:

1) Article History – if you’re just a casual Wikipedia reader you might not often notice the “View History” tab attached to each article. By clicking this you see the entire list of page contributors, with diffs that highlight the changes they made (along with the time and date). Yeah, it’s version control for Wikipedia articles. 


This is a permanent record of every change since the article’s creation—so an obvious first check is who created the article? Did a single user write the majority of it?


“User contributions”

2) User contributions – If the primary contributor or article creator is an IP address, this doesn’t necessarily add evidence to either side. James D. Watson’s somewhat modest article was started by an IP address back in 2001, and of course there’s no reason to suggest that was the man himself. Whether the edits were made by a logged-in Wikipedia editor (one with a username listed) or not, you can click their username and after reaching their userpage, the “Toolbox” can be used to inspect their “User contributions”. This lists all edits ever made by this editor or IP address.

At this point it’s easy to rule out non-autobiographical articles if the editor has made thousands of edits or written heaps of articles. If the article is self-written, however, it may be the editors only contribution. Even more telling is if they made only a couple of other related edits: say, adding their name to their institution’s article. Their work completed, the autobiographer will often then flee the scene, likely with no other substantial contributions to the encyclopaedia.

3) IPs aren’t anonymous – One of the reasons editors may want to create an account is to hide your IP address that is otherwise present with every edit. GeoLocation web apps can attempt to map an IP address to an approximate physical location, with varying success. Of course, dynamic IPs are used by most home ISPs so any geodata in such cases is likely useless, but most institutions will have a fixed line with a static external IP or two. In fact, the “talk pages” of large institutional IP addresses are usually tagged as such, due to previous high levels of vandalism (here‘s a high school example). Again, if the subject and institution match up, it’s anecdotal but by no means irrefutable evidence.

4) Article Content – Wikipedia has a core policy of “verifiability” and this is demonstrated through “reliable sources“; but self-written articles may contain substantial unreferenced information. That’s not to say bad referencing means self-written, but it is a touch curious if an article with no known sources can give personal details about where a subject grew up, or how many kids he or she may have. Lengthy lists of publications are also not a sight commonly found BLP articles, but often on a self-written promotional page. Even things like birthdays and years are not commonly found on staff pages, so raise some suspicion when added without references.

Of course, articles with promotional content may well be written by a colleague or a keen grad student rather than the subject themselves, but it takes a special effort to write an article as a non-regular contributor so any article written by a fly-by-night editor or IP address may well have been written by someone with a conflict of interest.

5) All of the above – Each of the above “things to look out for”, when combined together, represent quite a solid body of evidence that the article you’re looking at is self-written. To summarise: a short article with few (if any) references, but surprisingly personal knowledge, written by a throwaway account which made no other contributions, with the only other content edits coming from a corresponding institutional IP address, updating a publications list… is in all likelihood an autobiography written for promotional purposes.

What’s more, the evidence is there forever for anyone to see, given they have a few minutes to waste. My advice: tempting though it may be to write yourself an article, don’t do it! (…Drop a hint to your grad student.)

*Note: I debated using the real examples I’d found of self-written pages but in the end decided against it, as it’s not my intention to discredit decent scientists for a bit of harmless self-promotion.

1 Comment

Filed under Wikipedia