On author versus numeric citation styles

2018-03-12

Should citations in scholarly writing appear as author-year snippets, like (Pantcheva, 2018; Zelle, 2015), or numbers, like [1,2]? Let’s refer to these two methods as author-style and numeric-style. You may have also heard them referred to as the Harvard and Vancouver referencing systems.

Author-style

Here’s an example of author-style from our recent Sci-Hub Coverage Study published in eLife. First, see how citations appear in the main text:

Notice how studies with 3 or more authors use “et al.” rather than listing every single author. Also note how the letter a was appended to Van Noorden, 2013 to denote that this is the first of the two Van Noorden articles from 2013 that we cited.

With author-style, references (items in the bibliography) are sorted alphabetically by first-author’s surname:

Numeric-style

Here’s the same paragraph as above, but using numeric-style, which is the default for Manubot – the tool we used to write the manuscript:

When using numeric-style citations, references are numbered according to the order they were cited. As such, the references section (bibliography) is a numbered list:

Usage in PubMed Central

In general, each journal (or even publisher) has a preferred citation style that’s applied to all of their articles. However, I couldn’t find much information on the overall prevalence of the two styles. Hence, I turned to the PubMed Central (PMC) Open Access (OA) Subset, which as of March 4, 2018 contained fulltexts for 1,875,131 articles in a standardized machine-readable format (JATS XML). This corpus is fantastic for text & data mining. Note it could be even better, but due to licensing issues, 61% of the 4.8 million articles in PMC are excluded from its OA Subset: please help by only publishing in libre OA journals!

Anyways, 1,602,392 articles included citations and references (source code). I crafted a heuristic (a bunch of handmade rules) to classify the citation style of an article as numeric, author, or unknown. While there’s no guarantee the algorithm correctly classifies every citation/article, I fed it a collection of test cases (enforced via continuous integration) to ensure it’s not too misbehaved.

Overall, 86.0% of articles used numeric-style and 12.2% used author-style (the algorithm could not resolve 1.8% of articles, classifying them as unknown). Here’s the popularity of each citation style by year of publication (absolute counts on the left, normalized proportions on the right):

Popularity of citation styles by year

We see that each year more articles were added to the PMC OA Subset than the year before, with a total of 255,736 articles added from 2017. However, the relative popularity of author- versus numeric-style citations has remained relatively constant.

What about the proliferation of unknown-style articles from 2008–2012? These are almost all from the Acta Crystallographica series of journals, which hyperlinks citations using the ▶ symbol. See PMC3793688 for example. Manual inspection reveals that Acta Crystallographica references are actually author-style.

Here are some easter eggs:

In 2003 and 2004, PLOS Biology published 283 articles using author-style citations before switching to numeric-style, which remains the PLOS style to this day.
The article with the most references at 3,112 is World checklist of hornworts and liverworts in the journal PhytoKeys (PMC4758082), which used author-style for its 12,274 in-text citations.
The article with the second-most references at 2,857 is QCD and strongly coupled gauge theories from The European Physical Journal C (PMC4413533), which uses numeric-style to render its 3,679 in-text citations.

The worst case

Below, I’ll go over the pros and cons of author- versus numeric-style. But first, I want to introduce the Project Rephetio manuscript that will help exhibit the worst-case performance of the two styles.

Project Rephetio was the final act of my PhD, and we took a radically open approach for this study. First, we used the (now defunct) website Thinklab to post our proposal and publicly discuss the project while it was underway. All code was posted immediately to GitHub under an open license. Data was also uploaded to GitHub, or if too large, to Figshare. In the end, the project encompassed 86 Thinklab discussions, 41 GitHub repositories (23 of which we archived on Zenodo), 9 Figshare records, and several prepublication manuscripts (i.e. via Manubot, Thinklab, and bioRxiv).

The Project Rephetio manuscript was eventually published in the journal eLife with a total of 394 citations to 241 references. I was the first author of 93 of the references, which took a full three pages of bibliography in eLife‘s PDF (eLife uses author-style):

Project Rephetio eLife PDF: all references to Himmelstein first-author works

Only one of these “self-citations” was to a traditional scholarly output — the predecessor study to Project Rephetio. The remaining 92 references were to non-traditional outputs generated during the course of the study: 62 Thinklab discussions/documents, 24 Zenodo records, 5 Figshare records, and 1 preprint.

As scientific communication grows beyond immutable journal articles in the coming decades, we can expect to see more and more studies like Project Rephetio, which will have a large number of references to non-journal outputs, such as code, data, and discussion forums. Today’s worst-case bibliography will be tomorrow’s average case.

Which style is best?

Which style do I prefer? Numeric-style! The worst-case performance of author-style is unacceptable. The benefits of author-style are becoming less and less relevant to modern scholarship and publication media.

To demonstrate my point, let’s examine the pros of each style. For this task, we’ll refer to a paragraph from Project Rephetio. Here’s the paragraph in author-style, via eLife Lens:

And here’s the same paragraph in numeric-style via Thinklab:

Advantages of author-style

Author-style has several benefits:

You can recognize the referenced work from just its in-text citation. For example, you don’t need a reference section to know which study this sentence cites (Watson & Crick, 1953). However, science has grown. No one has the mental capacity to remember every study in their field. Nowadays, imagine you’re reading a genomics study and encounter a citation to Li et al., 2017. Even if you’d memorized every genomics study in PubMed, you’d still be choosing between 84 papers. Furthermore, the advantages of immediate recognizability are scant when hovering over a citation pops-up a tooltip with full reference information. Also, there’s less room for misrecognition when readers are always shown the full author list, title, and journal information when investigating a citation.
The first author is more visibly credited. Credit is important in science. As a first author, it can certainly be nice to find your surname in the main text of articles that you’ve influenced. Unfortunately, science is rarely a one-(wo)man job these days, with biomedical publications now averaging over five authors. Precisely since credit is so important for propelling the academic enterprise forward, we should avoid systems that tend to improperly credit individuals over communities.
Dates help readers establish the chronology of prior work and quickly identify outdated citations. For example, the average R&D cost of a new drug in the U.S. is $93 million (DiMasi et al., 1995). However, numeric-style citations still allow readers to access dates, albeit with an extra step. And when trying to reconstruct the chronology of a certain topic, I’ll often need to see the precise date in ISO 8601 format (e.g. 2018-03-12), which is too verbose for in-text citations anyways.
Citations and references can be prepared independently, without the assistance of a typesetting program. Not having to renumber every reference when adding a citation to the beginning of a document is a big advantage if you’re typesetting manually. However, there’s always the risk of author-year collisions, which either breaks the independence of citations and references or results in ambiguous citations.

Also, the future of scholarly writing is cite-by-identifier. For example, I’d cite Project Rephetio by its Digital Object Identifier like [@doi:10.7554/eLife.26726]. Then, typesetting software, such as Manubot, retrieves the corresponding bibliographic metadata and automatically formats both the citations and references according to whatever style you specify.

Advantages of numeric-style

Now let’s look at the advantages of numeric-style:

Numeric-style is more space efficient. The 394 author-style citations from Project Rephetio consumed a total of 8,542 characters (excluding the surrounding parentheses and multi-citation separators, notebook). That’s an average of 21.7 characters per citation versus 2.6 for numeric-style. Author-style citations required 8 times as many characters numeric-style!
Numeric-style is less distracting. Author-style citations can be visually distracting because they introduce large gaps into the flow of the prose. To avoid this disruption, author-style encourages grouping citations rather than interspersing them throughout a sentence. It also discourages citing a large number of works. The number and position of citations in scholarly writing should not be constrained by what is essentially a user-interface limitation!
Numeric lookup of references is easiest. For both humans and machines, it’s easier to navigate a numbered list compared to an alphabetical one. This is especially true when the alphabetical list has multiple references from the same first author. If you disagree, I challenge you to pick a Himmelstein citation from the snippet above and try to find it in the alphabetical reference compilation. Also notice the Himmelstein et al. references get all the way to 2015z… I wonder if 2015aa would have been next?
Numeric-style citations do not degrade when citing works without first authors. Sometimes references don’t have authors. Or the author is an organization or consortium. Author-style references can get quite unwieldy in these circumstances. For example, check out this dataset citation as per PeerJ’s author-style, which consumes 71 characters:

Pounds of meat purchased per household during 2006 was extracted from the 2011 Food Environment Atlas (United States Department of Agriculture Economic Research Service, 2014)

Treatise

In certain types of writing, author-style citations do make sense. Particularly, author-style is good for short, informal documents without a separate references section. This includes emails, Steem posts, GitHub issues, and even blog posts. Of course, you should hyperlink author-style citations to the referenced work, so there’s no ambiguity.

But, in today’s scholarly environment, numeric-style is preferable for substantial manuscripts, where several failure modes of author-style citations are beginning to appear. For the publishers out there, here are some takeaways:

If you’re using author-style citations, consider switching to numeric-style. PeerJ and eLife designed their interfaces with user experience of readers in mind. Interestingly, they both chose author-style. It will be informative to see if they decide to switch to numeric-style.
If you’re using numeric-style citations, make sure that HTML manuscript views provide tooltips with reference metadata. See for example, the Thinklab style above, which not only provides tooltips, but also highlights numeric citations by their type (external, project discussion, code or data).
Consider reformatting the style of references. I think the title should go first, before authors, as seen in the Manubot-generated references above. Always provide a standard identifier or URL in references. It’s the single most important element of a textual reference. In other words, PMC4805733 or https://doi.org/cmbr is a better reference than A. Ofosu, Aog 29 (2016) 1-8 (the style currently used by uBiome SmartGut reports).

Update: D. J. Bernstein explores additional arguments in favor of numeric citations in a 2024 blog post. Yes, this is the cryptographer from Bernstein v. United States, the series of court cases that established source code as First Amendment-protected speech!

Satoshi Village the blog of Daniel Himmelstein