Thursday, August 03, 2017

The State of Open Access: Some New Data

A preprint posted on PeerJ yesterday offers some new insight into the number of articles now available on an open-access basis. 

The new study is different to previous ones in a number of ways, not least because it includes data from users of Unpaywall, a browser plug-in that identifies papers that researchers are looking for, and then checks to see whether the papers are available for free anywhere on the Web. 

Unpaywall is based on oaDOIa tool that scours the web for open-access full-text versions of journal articles.

Both tools were developed by Impactstory, a non-profit focused on open-access issues in science. Two of the authors of the PeerJ preprint  Heather Piwowar and Jason Priem – founded Impactstory. They also wrote the Unpaywall and oaDOI software.

The paper – which is called The State of OA: A large-scale analysis of the prevalence and impact of Open Access articles – reports that 28% of the scholarly literature (19 million articles) is now OA, and growing, and that for recent articles the percentage available as OA rises to 45%.

The study authors say they also found that OA articles receive 18% more citations than average. 

In addition, the authors report on what they describe as a previously under-discussed phenomenon of open access  Bronze OA. This refers to articles that are made free-to-read on the publishers website without an explicit open licence. 

Below I publish a Q&A with Heather Piwowar about the study. 

Note: my questions were based on an earlier version of the article I saw, and a couple of the quotes I cite were changed in the final version of the paper. Nevertheless, all the questions and the answers remain relevant and useful so I have not changed any of the questions.

The interview

RP: What is new and different about your study? Do you feel it is more accurate than previous studies that have sought to estimate how much of the literature is OA, or is it just another shot at trying to do that?

HP: Our study has a few important differences:

·       We look at a broader range of the literature than previous studies and go further back (to pre-1950 articles), we look at more articles (all of Crossref, not just all of Scopus or Web of Science – Crossref has twice the number of articles that Scopus has), and we take a larger sample than most other studies. That’s because we classify OA status algorithmically, rather than relying on manual classification. This allowed us to sample 300k articles, rather than a few hundred as many OA studies have done. So, our sample is more accurate than most; and more generalizable as well.

·       We undertook a more detailed categorization of OA. We looked not just at Green and Gold OA, but also Hybrid, and a new category we call Bronze OA. Many other studies (including the most comparable to ours, the European Commission report you mention below) do not bring out all these categories specifically. (I will say more on that below). Furthermore, we didn’t include Academic Social Networks. Mixing those with publisher-hosted free-to-read content makes the results less useful to policy makers.

·       Our data and our methods are open, for anyone to use and build upon. Again, this is a big difference from the Archambault et al. study (that is, the one commissioned by the European Commission) and we think it is an important difference.

·       We include data from Unpaywall users, which allows us to get a sense of how much of the literature is OA from the perspective of actual readers. Readers massively favour newer articles, for instance, which is good news because such articles are more likely to be OA. By sampling actual reader data, from people using an OA tool that anyone can install, we can report OA percentages that are more realistic and useful for many real-world policy issues.

RP: You estimate that at least 28% of the scholarly literature is open access today. OA advocates tend nowadays to cite the earlier European Commission report which, the EU claims, indicates that back in 2011 nearly 50% of papers were OA. Was the EU study an overestimate in your view, or has there been a step backwards?

HP: Their 50% estimate was of recent papers, and included papers posted to ResearchGate (RG) and as open access. Our 28% estimate is for all journal articles, going back to 1900 – everything with a DOI. We found 45% OA for recent articles, and that’s excluding RG and Academia. So, they are pretty similar estimates.

RP: In fact, you came up with a number of different percentages. Can you explain the differences between these figures, why it is important to make these distinctions, and what the implications of the different figures are?

HP: There are two summary percentages: 28% OA for all journal articles, and 47% OA for journal articles that people read. As I noted, people read more recent articles, and more recent articles are more likely to be OA, so it turns out that almost half of the papers people are interested in reading right now are actually OA. Which is really cool!

Actually, when you consider that we used automated methods that missed a bit of OA it is more than half, so the 47% is a lower bound.

RP: You coin a new definition of open access in your paper, what you call Bronze OA. Can you say something about Bronze OA and its implications? It seems to me, for instance, that a lot of papers (over half?) currently available as open access are vulnerable to losing their OA status. Is that right? If so, what can be done to mitigate the problem?

HP: Yes, we did think we were coining a new term. But this morning I learned we weren’t the first to use the term Bronze OA – that honour goes to Ged Ridgway, who posted the tweet below in 2014

I guess it’s a case of Great Minds Think Alike!

Our definition of Bronze OA is the same as Ged’s: articles made free-to-read on the publisher’s website, without an explicit open license. This includes Delayed OA and promotional material like newsworthy articles that the publishers have chosen to make free but not open.

It also includes a surprising number of articles (perhaps as much as half of the Bronze total, based on a very preliminary sample) from entirely free-to-read journals that are not listed in DOAJ and do not publish content under an open license. Opinions will differ on whether these are properly called “Gold OA” journals/articles; in the paper, we suggest they might be called “Dark Gold” (because they are hard to find in OA indexes) or “Hidden Gold.” We are keen to see more research on this. 

More research is also needed to understand the other characteristics of Bronze OA. Is it disproportionately non-peer-reviewed content (e.g. front-matter), as seems likely? How much of Bronze OA is also Delayed OA? How much Bronze is Promotional, and how transient is the free-to-read status of this content? How many Bronze articles are published in “hidden gold” journals that are not listed in the DOAJ? Why are these journals not defining an explicit license for their content, and are there effective ways to encourage them to do so?

This kind of follow-up research is needed before we can understand the risks associated with Bronze and what kind of mitigation would be helpful.

RP: You say in your paper, “About 7% of the literature (and 17% of the OA literature) is Green, and this number does not seem to be growing at the rate of Gold and Hybrid OA.” You also suspect that much of this green OA is “backfilling” repositories with older articles, which are generally viewed as being of less value. What happened to the OA dream articulated by Stevan Harnad in 1994, and what future do you predict for green OA going forward?

HP: First, I should clarify: our definition of Green OA for the purposes of the study is that a paper is in a repository and is not available for free on the publisher site. This is so we don’t double count articles as both Green and Gold (or Hybrid or Bronze) for our analysis.

We gave publisher-hosted locations the priority in our classifications because we suspect most people would rather read papers there. So, in our article when we say green OA isn’t growing, what we mean is that more recent papers that are only available in repositories are available as Green OA at roughly the same rate as older papers.

It is worth future study to understand this better. I have a suspicion: perhaps much of what would have been Green OA became Bronze and what we call “shadowed green” – where there is a copy in a repository and a freely available copy on the publisher’s site as well. I suspect publishers responded to funder mandates that require self-archiving by making the paper free on the publisher sites as well, in synchronized timing.

Specifically, Biomed doesn’t look like it has as much Green as I’d expect, given the success of the NIH mandate and the number of articles in PMC. We do know many biomed journals have Delayed OA policies, which we categorized as Bronze in our analysis. Did they implement these Delayed OA policies in response to the PMC mandates? Perhaps others already know this to be true... I haven’t had a chance to look it up. Anyway. I think the interplay between Green and Bronze is especially worth more exploration.

We do also report on all the articles that are deposited in repositories, Green plus shadowed green, in the article’s Appendices. We found the proportion of the literature that is deposited in repositories to be higher for recent publication years.

One final note: We actually changed the sentence that you quoted in the final version of our paper, because we were wrong to talk about “growing” as we did. Our study didn’t measure when articles were deposited in repositories, but just looked at their publication year. Other studies have demonstrated that people often upload papers from earlier years, a practice called backfilling.

I suppose in some ways these have less value, because they are read less often. That said, anyone who really needs a particular paper and doesn’t otherwise have access to it is surely happy to find it.

RP: You also looked at the so-called citation advantage and estimate that an OA article is likely to attract 18% more citations than average. The citation advantage is a controversial topic. I don’t want to appear too cynical, but is not the idea of trying to demonstrate a citation advantage more an advocacy tool than a meaningful notion. I note, for instance, that has claimed that posting papers to its network provides a 73% citation advantage. Surely the real point here is that if all papers were open access there would be no advantage to open access from a citation point of view?

HP: That’s true! And that’s the world I’d love to see – one where the citation playing field is flat, because everyone can read everything.

RP: What would you say were the implications of your study for the research community, for librarians, for publishers and for open access policies?

HP: For the research community: Install Unpaywall! You’ll be able to read half the literature for free. Self-archive your papers, or publish OA.

For OA/bibliometrics researchers: Build on our open data and code, let’s learn more about OA and where it’s going.

For librarians: Use this data to negotiate with publishers: Half the literature is free. Don’t pay full price for it.

For publishers: Half the literature is now free to read. That percentage is growing. You don’t need a weathervane to know which way the wind blows: long term, there’s no money in selling things that people can get for free. Flip your journals. Sell services to authors, not access to content – it’s an increasingly smart business decision, as well as the Right Thing To Do.

For open access policy makers: We need to understand more about Bronze. Bronze OA doesn’t safeguard a paper’s free-to-read status, and it isn’t licensed for reuse. This isn’t good enough for the noble and useful content that is Scholarly Research. Also: let’s accelerate the growth.

You didn’t ask about tool developers. An increasing number of people are making tools that they can integrate OA into. They should use the oaDOI service. Now that such a large chunk of the literature is free, there are a lot of really transformative things we can build and do – in terms of knowledge extraction, indexing, search, recommendation, machine learning etc.

RP: OA was at the beginning as much (in fact more) about affordability as about access (certainly from the perspective of librarians). I note the recently published analysis of the RCUK open access policy reports that the average APC paid by RCUK rose by 14% between 2014 and 2016, and that the increase was greater for those publishers below the top 10 (who are presumably focused on catching up with their larger competitors). Likewise, the various flipping deals we are seeing emerge are focused on no more than transferring costs from subscriptions to APCs, with no realistic expectation of prices falling in the future. If the research community could not afford the subscription system (which OA advocates have always maintained) how can it afford open access in the long-term?

HP: If the rising APCs are because small publishers are catching up with the leaders by raising prices, that won’t continue forever – they’ll catch up. Then it’ll work like other competitive marketplaces.

The main issue is freeing up the money that is currently spent on subscriptions. We think studies like this, and tools like Unpaywall, can be helpful in lowering subscription rates, and foregoing Big Deals, as libraries are increasingly doing.

RP: As you say, in your study you ignored social networking sites like and ResearchGate “in accordance with an emerging consensus from the OA community, and based largely on concerns about long-term persistence and copyright compliance.” And you also say, “The growing proportion of OA, along with its increased availability using tools like oaDOI and Unpaywall, may make toll-access publishing increasingly unprofitable, and encourage publishers to flip to Gold OA models.” I am wondering, however, if it is not more likely that sites like (which researchers much prefer to use than paying to publish or depositing in their repository) and Sci-Hub (which is said to contain most of the scientific literature now) will be the trigger that will finally force legacy publishers to flip their journals to open access, whatever one’s views on the copyright issues Would you agree?

HP: It won’t be any one trigger, but rather an increasingly inhospitable environment. Sci-Hub is a huge contributor to that, and Academic Social Networks are too. Unpaywall opens up another front: a best-practice, legal approach to bypassing paywalls that librarians and others can unabashedly recommend. It all combines to make it easier and more profitable for publishers to flip, and for the future to be OA.

RP: Thank you for answering my questions.

Monday, July 17, 2017

On sponsorship, transparency, scholarly publishing, and open access

Sponsorship in the research and library communities is pervasive today, and scholarly publishers are some of the most generous providers of it. This generosity comes at a time when scholarly communication is in sore need of root-and-branch reform. However, since publishers’ interests are no longer aligned with the needs of the research community, and they have a vested interest in the legacy system, the research community might be best to avoid publisher sponsorship. Yet researchers and librarians seek it out on a daily basis.

While the benefits of this sponsorship to the research community at large are debatable, publishers gain a great deal of soft power from dispensing money in this way. And they use this soft power to help them contain, control and shape the changes scholarly communication is undergoing, often in ways that meet their needs more than the needs of science and of scientists. This sponsorship also often takes place without adequate transparency. 

Sponsorship and lobbying (which often amount to the same thing), for instance, have assisted legacy publishers to co-opt open access. This has seen the triumph of the pay-to-publish model, which has been introduced in a way that has enabled publishers to adapt OA to their needs, and to ringfence and port their excessive profits to the new OA environment. Those researchers who do not have the wherewithal to pay article-process charges (APCs), however, are finding themselves increasingly disenfranchised.

Sponsorship has also to be seen in a larger context. With paywalls now viewed askance, and pay-to-read giving way to free-to-read, more and more content is being funded by the producers rather than the readers. This has a number of consequences. Above all, it has made it increasingly difficult to distinguish neutral information and reporting from partisan content created solely to serve the interests of the creator/sponsor. Now commonly referred to as “fake news”, this is normally associated with biased and/or false information about, say, politicians, elections, and celebrity deaths etc., and its origin and purpose is often unknown.

But open access has presented science with the same kind of problem. With many authors now choosing (or having) to pay for the publication of their papers, and publishers’ revenues directly related to the number of articles they publish, unscrupulous authors are now able to find an outlet for any paper regardless of its quality. 

It is therefore becoming increasingly difficult to distinguish legitimate science from pseudoscience. This is in part a consequence of publishers’ use of sponsorship (and lobbying) to foist a flawed business model on the science community. And by continuing to dispense sponsorship, publishers are able to perpetuate and promote this model, and maintain their grip on scholarly communication.

These are the kinds of issues explored in the attached essay (pdf file). It includes some examples of publisher sponsorship, and the associated problems of non-transparency that often go with it. In particular, there is a detailed case study of a series of interviews conducted by Library Journal (LJ) with leading OA advocates that was sponsored by Dove Medical Press

Amongst those interviewed was the de facto leader of the OA movement Peter Suber. Suber gave three separate interviews to LJ, but not once was he informed when invited that the interviews were sponsored, or that they would be flanked with ads for Dove – even though he made it clear after the first interview that he was not happy to be associated with the publisher in this way.

The essay can be accessed as a pdf file here.

Tuesday, May 09, 2017

The Open Access Interviews: Jutta Haider

Many of us join causes and movements at different times in our lives, if only because we like to feel part of something bigger than ourselves, and because most of us have a healthy desire to improve the world. Unfortunately, movements often fail to achieve their objectives, or their objectives are significantly watered down – or lost sight of – along the way. Sometimes they fail completely.

When their movement hits a roadblock, advocates will respond in a variety of ways: “True believers” tend to carry on regardless, continuing to repeat their favoured mantras ad nauseam. Some will give up and move on to the next worthy cause. Others will take stock, seek to understand the problem, and try to find another way forward.

Jutta Haider, an associate professor in Information Studies at Lund University, would appear to be in the third category. Initially a proponent of open access, Haider subsequently “turned into a sceptic”. This was not, she says, because she no longer sees merit in making the scientific literature freely available, but because the term open access “has gained meanings and tied itself to areas in science, science policy-making, and the societal and economic development of society that I find deeply problematic.”

Above all, she says, she worries that open access has become “a business model, an indicator for performance measurement, tied to notions of development purely imagined as economic growth and so on.”

This is not how open access was envisaged when the movement began.

Monday, March 13, 2017

The OA interviews: Philip Cohen, founder of SocArXiv

(A print version of this interview is available here)

Fifteen years after the launch of the Budapest Open Access Initiative (BOAI) the OA revolution has yet to achieve its objectives. It does not help that legacy publishers are busy appropriating open access, and diluting it in ways that benefit them more than the research community. As things stand we could end up with a half revolution.

But could a new development help recover the situation? More specifically, can the newly reinvigorated preprint movement gain sufficient traction, impetus, and focus to push the revolution the OA movement began in a more desirable direction?

This was the dominant question in my mind after doing the Q&A below with Philip Cohen, founder of the new social sciences preprint server SocArXiv.

Preprint servers are by no means a new phenomenon. The highly-successful physics preprint server arXiv (formally referred to as an e-print service) was founded way back in 1991, and today it hosts 1.2 million e-prints in physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. Currently around 9,000-10,000 new papers each month are submitted to arXiv.

Yet arXiv has tended to complement – rather than compete with – the legacy publishing system, with the vast majority of deposited papers subsequently being published in legacy journals. As such, it has not disrupted the status quo in ways that are necessary if the OA movement is to achieve its objectives – a point that has (somewhat bizarrely) at times been celebrated by open access advocates.

In any case, subsequent attempts to propagate the arXiv model have generally proved elusive. In 2000, for instance, Elsevier launched a chemistry preprint server called ChemWeb, but closed it in 2003. In 2007, Nature launched Nature Precedings, but closed it in 2012.

Hope springs eternal

Fortunately, hope springs eternal in academia, and new attempts to build on the success of arXiv are regularly made. Notably, in 2013 Cold Spring Harbor Laboratory (CSHL) launched a preprint server for the biological sciences called bioRxiv. To the joy of preprint enthusiasts, it looks as if this may prove a long-term success. As of March 8th 2017, some 8,850 papers had been posted, and the number of monthly submissions has grown to around 620.

Buoyed up by bioRxiv’s success, and convinced that the widespread posting of preprints on the open Web has great potential for improving scholarly communication, last year life scientists launched the ASAPbio initiative. The initial meeting was deemed so successful that the normally acerbic PLOS co-founder Michael Eisen penned an uncharacteristically upbeat blog post about it (here).  

Has something significant changed since Elsevier and Nature unsuccessfully sought to monetise the arXiv model. If so, what? Perhaps the key word here is “monetise”. We can see rising anger at the way in which legacy publishers have come to dominate and control open access (see here, here, and here for instance), anger that has been amplified by a dawning realisation that the entire scholarly communication infrastructure is now in danger of being – in the words of  Geoffrey Bilderenclosed by private interests, both by commercial publishers like Elsevier, and by for-profit upstarts like ResearchGate and (see here, here and here for instance).

CSHL/bioRxiv and arXiv are, by contrast, non-profit initiatives whose primary focus is on research, and facilitating research, not the pursuit of profit. Many feel that this is a more worthy and appropriate mission, and so should be supported. Perhaps, therefore, what has changed is that there is a new awareness that while legacy publishers contribute very little to the scholarly communication process, they nevertheless profit from it, and excessively at that. And for this reason they are a barrier to achieving the objectives of the OA movement.

Reproducibility crisis

But what is the case for making preprints freely available online? After all, the research community has always insisted that it is far preferable (and safer) for scholars to rely on papers that have been through the peer-review process, and published in respectable scholarly journals, in order to stay up to date in their field, not on self-deposited early versions of papers that might or might not go on to be published.

Advocates for open access, however, now argue that making preprints widely available enables research to be shared with colleagues much more quickly. Moreover, they say, it enables papers to potentially be scrutinised by a much greater number of eyeballs than with the traditional peer review system. As such, they add, the published version of a paper is likely to be of higher quality if it has first been made available as a preprint. In addition, they say, posting preprints allows researchers to establish priority in their discoveries and ideas that much earlier. Finally, they argue, the widespread sharing of preprints would benefit the world at large, since it would speed up the entire research process and maximise the use of taxpayer money (which funds the research process).

Many had assumed that OA would provide these kind of benefits. In addition to making papers freely available, it was assumed that open access would introduce a quicker time-to-publish process. This has not proved the case. For instance, while the peer review “lite” model pioneered by PLOS ONE did initially lead to faster publication times, these have subsequently begun to lengthen again.

Above all, open access has failed to address the so-called reproducibility crisis (also referred to as the replication crisis). By utilising a more transparent publishing process (sometimes including open peer review) it was assumed that open access would increase the quality of published research. Unfortunately, the introduction of pay-to-publish gold OA has undermined this, not least because it has encouraged the emergence of so-called predatory OA publishers (or article brokers), who gull researchers into paying (or sometimes researchers willingly pay) to have their papers published in journals that wave papers past any review process.

The reproducibility crisis is by no means confined to open access publishing (the problem is far bigger), but it could hold out the greatest hope for the budding preprint movement.

Why do I say this? And what is the reproducibility crisis? Stanford Professor of Medicine John Ioannidis neatly summarised the reproducibility crisis in 2005, when he called his seminal paper on the topic “Why most published research findings are false”. In this and subsequent papers Ioannidis has consistently argued that the findings of many published papers are simply wrong.

Shocked at Ioannidis’ findings, other researchers set about trying to size the problem and to develop solutions. In 2011, for instance, social psychologist Brian Nosek launched the Reproducibility Project, whose first assignment consisted of a collaboration of 270 contributing authors who sought to repeat 100 published experimental and correlational psychological studies. Their conclusion: only 36.1% of the studies could be replicated, and where they did replicate their effects were smaller than the initial studies effects, seemingly confirming Ioannidis’ findings.

The Reproducibility Project has subsequently moved on to examine the situation in cancer biology (with similar initial results). Meanwhile, a survey undertaken by Nature last year would appear to confirm that there is a serious problem.

Whatever the cause and extent of the reproducibility crisis, Nosek’s work soon attracted the attention of John Arnold, a former Enron trader who has committed a large chunk of his personal fortune to funding those working to – as Wired puts it – “fix science”. In 2013, Arnold awarded Nosek a $5.25 million grant to allow him and colleague Jeffrey Spies to found the Center for Open Science (COS).

COS is a non-profit organisation based in Charlottesville, Virginia. Its mission is to “increase openness, integrity, and reproducibility of scientific research”. To this end, it has developed a set of tools that enable researchers to make their work open and transparent throughout the research cycle. So they can register their initial hypotheses, maintain a public log of all the experiments they run, and the methods and workflows they use, and then post their data online. And the whole process can be made open for all to review.

Monday, February 20, 2017

Copyright: the immoveable barrier that open access advocates underestimated

In calling for research papers to be made freely available open access advocates promised that doing so would lead to a simpler, less costly, more democratic, and more effective scholarly communication system. 

To achieve their objectives they proposed two different ways of providing open access: green OA (self-archiving) and gold OA (open access publishing).

However, while the OA movement has succeeded in persuading research institutions and funders of the merits of open access, it has failed to win the hearts and minds of most researchers. 

More importantly, it is not achieving its objectives. There are various reasons for this, but above all it is because OA advocates underestimated the extent to which copyright would subvert their cause. That is the argument I make in the text I link to below, and I include a personal case study that demonstrates the kind of problems copyright poses for open access.

I also argue that in underestimating the extent to which copyright would be a barrier to their objectives, OA advocates have enabled legacy publishers to appropriate the movement for their own benefit, rather than for the benefit of the research community, and to pervert both the practice and the concept of open access.

As usual, it is a long document and I have published it in a pdf file that can be access here

I have inserted a link to the case study at the top for those who might wish only to read that.

For those who prefer paper, a print version is available here.

Friday, January 20, 2017

The NIH Public Access Policy: A triumph of green open access?

There has always been a contradiction at the heart of the open access movement. Let me explain.

The Budapest Open Access Initiative (BOAI) defined open access as being the:

“free availability [of research papers] on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.”

BOAI then proceeded to outline two strategies for achieving open access: (I) Self-archiving; (II) a new generation of open-access journals. These two strategies later became known, respectively, as green OA and gold OA.

At the time of the BOAI meeting the Creative Commons licences had not been released. When they were, OA advocates began to insist that to meet the BOAI definition, research papers had to have a CC BY licence attached, thereby signalling to the world that anyone was free to share, adapt and reuse the work for any purpose, even commercially.

For OA purists, therefore, a research paper can only be described as open access if it has a CC BY licence attached.

The problem here, of course, is that the vast majority of papers deposited in repositories cannot be made available on a CC BY basis, because green OA assumes authors continue to publish in subscription journals and then self-archive a copy of their work in an open repository.

Since publishing in a subscription journal requires assigning copyright (or exclusive publishing rights) to a publisher, and few (if any) subscription publishers will allow papers that are earning them subscription revenues to be made available with a CC BY licence attached, we can see the contradiction built into the open access movement. Quite simply, green OA cannot meet the definition of open access prescribed by BOAI.

To see how this works in practice, let’s consider the National Institutes of Health (NIH) Public Access Policy. This is described on Wikipedia as an “open access mandate”, and by Nature as a green OA policy, since it requires that all papers published as a result of NIH funding have to be made freely available in the NIH repository PubMed Central (PMC) within 12 months of publication. In fact, the NIH policy is viewed as the premier green OA policy.

But how many of the papers being deposited in PMC in order to comply with the Policy have a CC BY licence attached and so are, strictly speaking, open access?

There are currently 4.2 million articles in PMC. Of these around 1.5 million consist of pre-2000 historical content being deposited as part of the NIH’s scanning projects. Some of these papers are still under copyright, some are in the public domain, and some are available CC BY-NC. However, since this is historical material pre-dating both the open access movement and the NIH Policy let’s put it aside.

That leaves us with around 2.7 million papers in PMC that have been published since 2000. Today around 24% of these papers have a CC BY licence attached. In other words, some 76% of the papers in PMC are not open access as defined by BOAI.

The good news is that the percentage with a CC BY licence is growing, and the table below (kindly put together for me by PMC) shows this growth. In 2008, just 8% of the papers in PMC had a CC BY licence attached. Since then the percentage has grown to 12% in 2010, 14% in 2012, 19% in 2014 and, as noted, it stands at 24% today. 

So, although the majority of papers in PMC today are not strictly speaking open access, the percentage that are is growing over time. Is this a triumph of green OA? Let’s consider.

There are two submission routes to PMC. Where there is an agreement between NIH and a publisher, research papers can be input directly into PMC by that publisher. Authors, and publishers with no PMC agreement, have to use the NIH Manuscript Submission System (NIHMS, overview here).

The table above shows that the number of “author manuscripts” that came via the NIHMS route represents just 19% of the content in PMC. And since some publishers do not have an agreement with PMC, the number that will have been self-archived by authors will be that much lower. So the overwhelming majority of papers being uploaded to PMC are being uploaded not by authors, but by publishers, and it seems safe to assume that those papers with a CC BY licence attached (currently 24% of the total) will have been published as gold OA rather than under the subscription model.

We could also note that just 0.06% of the papers in PMC today that were deposited via the NIHMS have a CC BY licence attached, and we can assume that these were submitted by gold publishers that do not have an agreement allowing for direct deposit, rather than by authors. 

In short, it would seem that the growth in CC BY papers in PMC is a function of the growth of gold OA, not green OA. As such, we might want to conclude that the success of PMC is a triumph of gold OA rather than of green OA.

Does this matter? The answer will probably depend on one’s views of the merits of article-processing charges, which I think it safe to assume most of the papers in PMC with a CC BY licence will have incurred.

Either way, that today 76% of the content in PMC – the world’s premier open repository – still cannot meet the BOAI definition of open access suggests that the OA movement still has a way to go.