There seems to be a general consensus that author lists in academic articles are growing. Wikipedia says so, and I’ve also come across a published letter and short Nature article which accept this is the case and discuss ways of mitigating the issue. Recently there was an interesting discussion on academia.stackexchange on the subject but again without much quantification. Luckily given the array of literature database APIs and language bindings available, it should be pretty easy to investigate with some statistical analysis in R.

## rplos

ROpenSci offers nice R language bindings for the PLOS (I’m more used to PLoS but I’ll go with it) API, called rplos. There’s no particular reason to limit the search to PLOS journals but rplos seems significantly more straightforward to work with than pubmed API packages I’ve used in the past like RISmed.

Additionally the PLOS group contains two journals of particular interest to me:

- PLOS Computational Biology — a respectable specialist journal in my field; have bioinformatics articles been particularly susceptible to author inflation?
- PLOS ONE — the original mega-journal. I wonder if the huge number of articles published here show different trends in authorship over time.

The only strange part of the search was at the PLOS API end. To search by the `publication_year`

field you need to supply both a beginning and end date range, in the form:

publication_date:[2001-01-01T00:00:00Z TO 2013-12-31T23:59:59Z]

A tad verbose, right? I can’t imagine wanting to search for things published at a particular time of day. A full PLOS API query using the rplos package looks something like this:

searchplos( # Query: publication date in 2012 q = 'publication_date:[2012-01-01T00:00:00Z TO 2012-12-31T23:59:59Z]', # Fields to return: id (doi) and author list fl = "id,author", # Filter: only actual articles in journal PLOS ONE fq = list("doc_type:full", "cross_published_journal_key:PLoSONE"), # 500 results (max 1000 per query) start=0, limit=500, sleep=6)

A downside of using the PLOS API is that the set of journals are quite recent, PLOS ONE started in 2006 and PLOS Biology was only a few years before in 2003, so it’ll only give us a limited window into any long-term trends.

## Distribution of author counts

Before looking at inflation we can compare the distribution of author counts per paper between the journals:

Possibly more usefully — but less pretty — the same data be plotted as empirical cumulative distribution functions (ECDF). From these we can see that PLOS Biology had the highest proportion of single-author papers in my sample (n = ~22,500 articles overall) followed by PLOS Medicine, with PLOS Genetics showing more high-author papers at the long-tail of the distribution, including the paper with the most authors in the sample (Couch et al., 2013 with 270 authors).

## Author inflation

So, in these 6 different journals published by PLOS, how has the mean number of authors per paper varied across the past 7 years?

Above I’ve shown yearly means plus their 95% confidence intervals, as estimated by a non-parametric bootstrap method implemented in ggplot2. Generally from this graph it does look like there’s a slight upward trend on average, though arguably the mean is not the best summary statistic to use for this data, which I’ve shown is not normally distributed, and may better fit an extreme value distribution.

The relationship between publication date and number of authors per paper become clearer if it’s broken down by journal:

Here linear regression reveals a significant positive coefficient for year against mean author count per paper, as high as .52 extra authors per year on average, down to just .06 a year for PLOS ONE. Surprisingly the mega-journal which published around 80,000 papers over this time period seems least susceptible to author inflation.

The explained variance in mean number of authors per paper (per year) ranges from .28 (PLOS ONE) up to an impressive .87 for PLOS Medicine, with PLOS Computational Biology not far behind on .83. However, PLOS Computational Biology had the second lowest regression coefficient, and the lowest average number of authors of the six journals — maybe us introverted computer types should be collaborating more widely!

## Journal effects

It’s interesting to speculate on what drives the observed differences in author inflation between journals. A possible covariate is the roundly-condemned “Impact Factor” journal-level metric — are “high impact” journals seeing more author creep than lesser publications?

If estimate of author inflation is plotted against respective journals’ recent impact factors, there does indeed appear to be a positive correlation. However, this comparison only has 6 data points so there’s not enough evidence to reject the null hypothesis that there is no relationship between these two variables (p = 0.18).

## Conclusion

#### Is author inflation occurring?

**Yes**, it certainly appears to be on average.

#### Is it a problem?

I don’t know, but I’d lean towards probably not.

The average trends could be reflecting the proliferation of “Big Science” with huge collaborative consortiums like ENCODE and FANTOM (though the main papers of those examples were targeted to *Nature*) and doesn’t necessarily support a conclusion that Publish or Perish culture is forcing lots of token authorships and backhanders between scientists.

Maybe instead (as the original discussion hypothesised), people that traditionally may not have been credited with authorship (bioinformaticians doing end-point analysis and lab technicians) are now getting recognised for their input more often, or conceivably advances in cloud computing, distributed data storage and video conferencing has better enabled larger collaborations between scientists across the globe!

Either way, evidence for author inflation is not evidence of a problem *per se* 🙂

### Caveats

- Means used for regression — while we get a surprisingly high R
^{2}for regression the mean number of authors per paper per year, the predictive power for individual papers unsurprisingly vanishes (R^{2}plummets to between .02 and 4.6 × 10^{-4}per journal — significant non-zero coefficients remain). Author inflation wouldn’t be expected to exhibit consistent and pervasive effects in all papers, for example reviews, letters and opinion pieces presumably have consistently lower author counts than research articles and not all science can work in a collaborative, multi-author framework. - Search limits — rplos returns a maximum of 1000 results at a time (but they can be returned sequentially using the
`start`

and`limit`

parameters). They seem to be drawn in reverse order of time so the results here probably aren’t fully representative of the year in some cases. This has also meant my sample is unevenly split between journals: PLoSBiology: 2487; PLoSCompBiol: 3403; PLoSGenetics: 4013; PLoSMedicine: 2094; PLoSONE: 7176; PLoSPathogens:3647;**Total:**22,460. - Resolution — this could be done in a more fine-grained way, say with monthly bins. As mentioned above, for high-volume journals like PLOS ONE, the sample is likely coming from the end of the years from ~2010 onwards.

#### Full code to reproduce analysis

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.

Learn more about bidirectional Unicode characters

options(PlosApiKey = "<insert your API key here!>") | |

#install_github("rplos", "ropensci") | |

library("rplos") | |

library("ggplot2") | |

require("dplyr") | |

# Convert author strings to counts | |

countAuths <- function(cell) | |

length(unlist(strsplit(cell, ";"))) | |

countAuths <- Vectorize(countAuths) | |

# Query PLoS API for 1k papers per journal per year, | |

# count the number of authors and return a data.frame | |

getAuths <- function(j, lim=1000, start.year=2006){ | |

cat("Getting results for journal: ", j, "\n") | |

# seem to be in reverse order by year? | |

results <- sapply(start.year:2013, function(i) data.frame(year = i, | |

auths = searchplos( | |

q = paste0('publication_date:[', i, | |

'-01-01T00:00:00Z TO ', i, | |

'-12-31T23:59:59Z]'), | |

fl = "author", | |

fq = list("doc_type:full", | |

paste0("cross_published_journal_key:", j)), | |

start=0, limit=lim, sleep=6), | |

year=i), simplify=F) | |

results <- do.call(rbind, results) | |

results$counts <- countAuths(results$author) | |

results$journal <- j | |

results | |

} | |

journals <- journalnamekey() | |

plos.all <- sapply(journals[c(1:5, 7)], getAuths, simplify=F) | |

plos <- do.call(rbind, plos.all) | |

# Fig. 1: Bean plot showing distribution of author counts | |

# per journal overall | |

ggplot(plos, aes(x=journal, y=counts, fill=journal)) + | |

geom_violin(scale="width") + | |

geom_boxplot(width=.12, fill=I("black"), notch=T, | |

outlier.size=NA, col="grey40") + | |

stat_summary(fun.y="median", geom="point", shape=20, col="white") + | |

scale_y_log10(breaks=c(1:5, seq(10, 50, by=10), 100, 200, 300)) + | |

coord_flip() + labs(x="", y="Number of authors per paper") + | |

theme_classic() + theme(legend.position="none") + | |

scale_fill_brewer() | |

# Fig 2. ECDFs of the author count distributions | |

# 5in x 5in | |

ggplot(plos, aes(x=counts, col=journal)) + | |

stat_ecdf(geom="smooth", se=F, size=1.2) + theme_bw() + | |

scale_x_log10(breaks=c(1:5, seq(10, 50, by=10), 100, 200, 300)) + | |

theme(legend.position=c(.75,.33)) + | |

labs(x="Number of authors per paper", y="ECDF", | |

col="") + coord_cartesian(xlim=c(1,300)) + | |

scale_color_brewer(type="qual", palette=6) | |

# Fig 3. Trends in author counts over time with | |

# confidence limits on the means | |

# 7 x 7 | |

ggplot(plos, aes(x=year, y=counts, col=journal, fill=journal)) + | |

stat_summary(fun.data="mean_cl_boot", geom="ribbon", | |

width=.2, alpha=I(.5)) + | |

stat_summary(fun.y="mean", geom="line") + | |

labs(list(x="Year", y="Mean number of authors per paper")) + | |

theme_bw() + theme(legend.position=c(.2,.85)) + | |

scale_fill_brewer(type="qual", palette=2, | |

guide=guide_legend(direction="vertical", | |

label.position="bottom", | |

title=NULL, ncol=2, | |

label.hjust=0.5)) + | |

scale_color_brewer(type="qual", palette=2, guide="none") | |

# from http://stackoverflow.com/a/17024184/1274516 | |

# show regression equation on each graph facet | |

lm_eqn <- function(df){ | |

m <- summary(lm(counts ~ year, df)) | |

eq <- substitute(~~y~"="~beta*x+i~(R^2==r2), | |

list(beta = format(m$coefficients[2,"Estimate"], | |

digits = 3), | |

i = format(m$coefficients[1,"Estimate"], digits=3), | |

r2 = format(m$r.squared, digits=2))) | |

as.character(as.expression(eq)) | |

} | |

means <- group_by(plos, journal, year) %.% summarise(counts=mean(counts)) | |

b <- by(means, means$journal, lm_eqn) | |

df <- data.frame(beta=unclass(b), journal=names(b)) | |

summary(lm(counts ~ year + journal, data=means)) | |

means <- group_by(means, journal) %.% summarise(m=max(counts)) | |

df$top <- means$m * 1.2 | |

# Fig 4. Facetted linear regression of author inflation per journal | |

# 6 x 8.5 | |

ggplot(plos, aes(x=year, y=counts, col=journal, fill=journal)) + | |

stat_summary(fun.data="mean_cl_boot", geom="errorbar", | |

width=.2, alpha=I(.5)) + | |

stat_summary(fun.y="mean", geom="point") + | |

#stat_summary(fun.y="median", geom="point", shape=4) + | |

facet_wrap(~journal, scales="free_y") + | |

geom_smooth(method="lm") + | |

scale_x_continuous(breaks=2006:2013) + | |

labs(list(x="", y="Mean number of authors per paper")) + | |

theme_bw() + theme(axis.text.x=element_text(angle=45, hjust=1)) + | |

scale_fill_brewer(type="qual", palette=2, guide="none") + | |

scale_color_brewer(type="qual", palette=2, guide="none") + | |

geom_text(data=df, aes(x=2009.5, y=top, label=beta), size=3, parse=T) | |

# Overall estimate of author inflation? | |

# .21 extra authors per paper per year, on average | |

s <- summary(lm(counts ~ year + journal, data=plos)) | |

# Summary barchart data: | |

bc <- data.frame(journal = unique(means$journal), | |

trend = c(0.2490979, | |

0.1211823, | |

0.5201688, | |

0.4088536, | |

0.05894102, | |

0.1828939), | |

std.err = c(0.08224567, | |

0.02213142, | |

0.1493662, | |

0.06361849, | |

0.03891493, | |

0.03798822), | |

IF = c(12.690, | |

4.867, | |

8.517, | |

15.253, | |

3.730, | |

8.136)) | |

bc$journal <- factor(bc$journal, levels=bc$journal[order(bc$trend)]) | |

# Fig 5. Barchart of author inflation estimate per journal. | |

# 7 x 5 | |

ggplot(bc, aes(x=journal, y=trend, fill=journal, ymin=trend–std.err, | |

ymax=trend+std.err)) + | |

geom_bar(stat="identity") + | |

geom_errorbar(width=.2) + | |

scale_y_continuous(expand=c(0,0)) + | |

theme_classic() + | |

labs(x="", | |

y="Estimate of annual author inflation (additional mean authors per paper)") + | |

theme(axis.text.x=element_text(angle=45, hjust=1)) + | |

scale_fill_brewer(palette="Blues", guide="none") | |

pcc <- cor(bc$trend, bc$IF) | |

# Fig 6. Correlation of author inflation and journal impact factors. | |

# 5 x 5 | |

ggplot(bc, aes(x=trend, y=IF, col=journal)) + | |

geom_text(aes(label=journal)) + xlim(0,.6) + | |

labs(x="Author inflation estimate", | |

y="Journal impact factor (2012)") + | |

scale_color_brewer(type="qual", palette=2, guide="none") + | |

annotate("text", x=.05, y=15, | |

label=paste0("rho == ", format(pcc, digits=2)), parse=T) | |

# N.S. (p = 0.18) | |

cor.test(bc$trend, bc$IF) |

Nice post. If you have any feedback on rplos, I’d love to hear it at https://github.com/ropensci/rplos/issues?state=open

I’ll have a think but on the whole it was really useful and easy to use!

Cool, thanks 🙂

I wrote a short something on the topic for the Journal of American Chemical Society, in a larger time span (1961–2011): see http://blog.coudert.name/post/2014/02/21/Evolution-of-chemistry-writing-over-5-decades and http://blog.coudert.name/post/2014/02/26/Evolution-of-chemists-world-over-5-decades-%28part-2%29

Thanks for those links, very interesting to look at this over a longer time period.

There’s a bug in the line 94 of the code:

means <- group_by(means, journal) %.% summarise(m=max(top))

I think this should be:

means <- group_by(means, journal) %.% summarise(m=max(counts))

Cheers,

Marco

Great work by the way. Enjoyed the post very much.

You’re right, now fixed — thanks!

Pingback: Somewhere else, part 125 | Freakonometrics

Very nice job!!! An offtopic question. What options you used to save the plots. I always have problems to get nice plots in publications.

Thanks for your comment!

I generally draw on the RStudio graphics device until it looks good and then save to a vector format by calling pdf or svg (ggsave is another option). The actual svg calls with figure sizes are in here: https://github.com/blmoore/blogR/blob/master/R/plos_authInflation.R

When saving a figure I try to keep the dimensions small so that the labels and legends remain legible when printed or squeezed into a blog post.

Thank you for you comment. I also like vector format, but some journals require use tiff formats, and sometimes this is a pain. The suggestion to keep the dimensions small is quite useful.

I like the analysis, but I think that segmentation is required to make sense of the question.

Author inflation is well known across the sciences (though some like to think it indicates more effort):

http://www.ncbi.nlm.nih.gov/pubmed/24107079

Thanks for this, a very narrow/specific example but interesting nonetheless.

Pingback: Distinguishing the Middle Authors – Giving Credit Where Credit is Due | AMEZTECH

Pingback: 2. The ggplot2 package: Your Gateway Drug to Becoming an R User | Sak on Science