Satoshi Village the blog of Daniel Himmelstein

The licensing of bioRxiv preprints

Jordan Anaya of Omnes Res — creator of the PrePubMed search engine for biomedical preprints — recently compared bioRxiv to PeerJ Preprints. We agree that PeerJ offers the better technology and user experience. However, bioRxiv has greater adoption in the biodata sciences.

In fact, since my last blog post on preprints at the beginning of 2016, bioRxiv has grown by 149% from 2,785 to 6,933 preprints. The growth has been fueled largely by the efforts of ASAPbio and the growing recognition that publishing delays are interfering with science.

While PeerJ requires that all preprints are published under a Creative Commons Attribution (CC BY) License, bioRxiv allows authors to choose from five options: CC BY, CC BY-ND (Attribution-NoDerivatives), CC BY-NC (Attribution-NonCommercial), CC BY-NC-ND (Attribution-NonCommercial-NoDerivatives), and no license (all rights reserved).

The purpose of Creative Commons licenses is to allow the reuse of content that would otherwise be prohibited by copyright. As a result, the open licensing of preprints is crucial for the growth of open access, the movement to make publicly-funded research articles available to and reusable by the public.

License breakdown

Unfortunately, of the five options offered by bioRxiv, CC BY is the only open license, which requires that content “can be freely used, modified, and shared by anyone for any purpose.” Therefore, with the help of Jordan Anaya, I looked into which licensing options authors were choosing for their bioRxiv preprints (till the end of November 2016). The breakdown shows that there’s major room for improvement (see why license choices matter below).

License Count Percent Score
CC BY 1,237 17.8% 5
CC BY-ND 496 7.2% 3
CC BY-NC 586 8.5% 3
CC BY-NC-ND 2,553 36.8% 2
None 2,061 29.7% 1

Licenses over time

Next, I looked to see whether bioRxiv preprints were becoming more open over time. They weren’t. In fact, the proportion of CC BY licenses has been in decline since mid 2014.

Licenses by subject

What about licensing by discipline? The figure below shows all subjects with at least 100 preprints. Bioinformatics appears to be the most open, whereas Cell Biology is the least open.

The bioRxiv leaderboard

Using the scoring system above, I assigned each preprint a score based on its license. Each author received a score equaling the sum of their preprints.

Middle initials were removed to consolidate duplicate names for the same person. If you’re worried that this analysis conflates multiple authors with the same name, start using ORCID. Luckily, with only 29,436 distinct author names, name collisions are yet to be a major problem.

Congratulations to Mark Daly of the Broad Institute for leading both in terms of number of preprints and score. Also notable was Jesse Bloom of the Fred Hutchinson Cancer Research Center, who’s posted 10 preprints, all under CC BY. Condolences to my advisor, Casey Greene, whose unfortunate decision to post one of his 11 bioRxiv preprints under all rights reserved kept him out of the top 10.

License implications

Why are only a paltry 17.8% of bioRxiv preprints openly licensed? Perhaps many authors are unaware of the implications of their choice.

First, it’s important to note that these licenses affect copyright, which only covers the original works of authorship in the preprint, such as writing and figures. The licenses do not affect whether others can use the knowledge or inventions presented in the preprint. If you want to prevent others from using your discoveries, you’ll need a patent.

Second, many preprints go on to be published in subscription journals where authors transfer copyright or grant the journal an exclusive license to publish. In essence, they forfeit their copyrights. As a consequence, openly licensing your preprint is the best way to ensure that you yourself retain the right to reuse it.

Ultimately, most researchers care about getting citations and preventing plagiarism. However, the Attribution clause (BY) of Creative Commons Licenses is sufficient, which is summarized as:

You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

NC or ND stipulations provide no additional protection against plagiarism. And anyways, scholarly norms rather than threats of being sued are what actually encourages attribution. If you want to maximize citations, choose the license that allows for maximum dissemination.

All rights reserved

29.8% of bioRxiv preprints are all rights reserved, placing a large portion of the bioRxiv corpus in the same troublesome legal situation as traditional academic publishing.

Copying or distributing these preprints is copyright infringement, exposing well-intentioned scientists to legal peril. Diego Gomez — who was a master’s student in Colombia 🇨🇴 — learned this the hard way, when in 2011 he posted a thesis on Scribd to share with colleagues. Now Deigo’s legal nightmare has lasted for more than two years, and he’s still facing the potential of a 4–8 year jail sentence.

Furthermore, in almost all cases, it’s in the researcher’s best interest to license their preprints. The benefits for the preprint are many, including decreasing the friction for: other scientists to share, disseminate, or present it; search engines to index and display it; and text mining corpuses to include it.

Should bioRxiv cease to exist, preserving these unlicensed preprints would be fraught with risk. Who wants to look back in their elder years to see their past research disappeared? And in the case of publicly-funded research, there’s an ethical imperative to make it freely available and reusable.

Non-commercial

75.0% of bioRxiv preprints forbid commercial use. Unfortunately, the ambiguity of what qualifies as commercial dissuades potential users. But commercial use is awesome — it means that someone has found a way to add enough value to your work that others will pay for it. And unless you plan on selling your preprint, there’s zero opportunity cost.

We’re quickly entering an era where scientific literature is read first by a computer, and then second, if at all, by a human. As Chris Hartgerinka text miner in the Netherlands 🇳🇱 who purchased legal insurance during his PhD — explains, “NC restricts the value of text and data mining outputs by limiting the potential applications of the method.”

No derivatives

73.7% of bioRxiv preprints forbid derivatives. A derivative could be that someone likes your figure and wants to modify it. Or someone wants to translate your preprint into a different language. Even more so than NC, ND stipulations are incompatible with other licenses, making it difficult to remix or combine content.

What can I do?

  • Authors: Choose CC BY in the future. If you’ve posted any preprints that aren’t CC BY, you can submit a revision on bioRxiv to update the license (assuming that you still own the rights and the work hasn’t been published by a journal).

    If you’re worried about some nebulous risk to open licensing, just remember that there are already over a quarter billion CC BY licensed works. Note that if your preprint includes content that you don’t own the rights to and that isn’t compatibly licensed, you’ll have to seek permission or take a wager on fair use.

  • Funders: Require open! See the Gates Foundation’s policy as an example.

  • Preprint Servers: Remove the all rights reserved option. Consider replacing it with CC0 (Public Domain Dedication), which is ideal for preprints by US Government employees 🇺🇸, whose work is not subject to copyright under US law but would still benefit from CC0 internationally.

  • Researchers: Check out the GitHub repository for this post or play with the binder.

Preprints are an exciting development in scholarly communication. Now let’s start off down the right track.


Update on March 29, 2017: Jessica Polka, Director of ASAPbio, noticed that bioRxiv reordered their license options so the more open licenses now appear first (as I suggested in the comments below). Accordingly, I compared the distribution of licenses for the 100 days preceding this blog post to the 100 days following this blog post. The proportion of CC BY licenses increased from 13.7% to 21.1% (p = 1.67 × 10-8). Based on this effect, I estimate this blog post led to 153 additional CC BY preprints in the 100 days following its publication!

Comments