Unpaywall improvements: more gold, better green

We recently announced that we’d completely rewritten Unpaywall to make it faster, more accurate, and (most importantly) easier to fix and improve. We wanted to move Unpaywall from product to process, something we could continuously improve along with the community.

Well, we’ve been working hard on that over the last few months and here’s an update!

Better Gold coverage

By far the most common OA color is gold. In fact, based on our manual sampling, 25% of Crossref DOIs are gold OA, which is much higher than I’d expected and much higher than it used to be. (note: in this and all following stats we exclude component DOIs, which aren’t indexed in Unpaywall).

Coverage of gold is very tricky, because it’s all about the status of the work’s source, not the work itself. So we need very comprehensive coverage of sources, which is as hard as it sounds.

Of course there’s DOAJ which is fantastic but they only cover a small subset of gold OA journals. And even for those journals, DOAJ often only tells us that a given journal is fully OA since a certain date—we still need to figure out if the back catalog is open or not.

In recent weeks, we’ve finished several projects to add the “this is gold OA” flag to new journals:

  • We crawled 50k OJS journals, adding gold status to 17,000 of them (many thanks to Juan Pablo Alperin and Diego Chavarro for their help in getting a list of OJS journals!)
  • We marked 1,200 new journals gold using data from J-STAGE.
  • We marked 100 new journals gold using data from SciELO
  • We added gold status to several dozen journals from fully-OA publishers including including MDPI, Academic Journals, and Edorium.

We also modified our algorithm to assign gold instead of bronze when we know an article is OA, but we can’t figure out its source. Since gold is 2.5x more common than bronze, this will result in fewer errors overall.

Overall, this has made a big change in our gold coverage: now 19% of Unpaywall is gold, compared to 14% in May.

Green OA

We’ve made several changes in our green OA approach. These have not increased our total green percentage, but they have made our assignment of colors more consistent.

The rule for green has always been that if the best OA location is in a repository, it’s green. But, like gold, this is very dependent on us correctly describing the source as a repository. We’re very good at this for institutional repositories—but we’ve not been so good for preprint and data repositories, which are both much more common today then they were when we started Unpaywall.

Other changes

We fixed a bug causing us to list works published under the Elsevier User License as Hybrid. Since we don’t consider that to be an OA license, we moved these to bronze.

We marked SSRN as an open repository…it’s on the bubble but since all works are available free right away, for us it counts.

Results

The “ground truth” dataset is a random sample of 500 DOIs from Crossref. It excludes component DOIs and DOIs that don’t resolve. Each DOI is manually annotated by our team, which often includes doing lots of research on the journals and repositories that host the content. The definitions of oa_status colors come from here, which is in turn based on the original 2018 Unpaywall paper in PeerJ.

As you can see, we’re moving in the correct direction when it comes to gold and hybrid, green isn’t changing, and bronze coverage is going backwards a bit, although it’s still pretty close to the ground truth number. Our roadmap will prioritize green and gold for the next few months at least.

The future

The most important change for Unpaywall moving forward is the upcoming rewrite of OpenAlex, which will be gradually rolled out October-November of this year. That’s because when this rewrite is deployed, OpenAlex and Unpaywall will finally share the exact same codebase. Of course this will eliminate those pesky, embarrassing bugs where Unpaywall and OpenAlex disagree. But more importantly, it’ll link the large Unpaywall and OpenAlex communities, allowing everyone to improve both products together.

Even before that, though, we’ll be unveiling another exciting change: a new and improved curation portal. This will make it easier to fix article-level bugs in Unpaywall, including bugs that current curation solution doesn’t address (like missing PDF URLs and incorrect licenses). Even cooler, though it’ll allow users to fix source-level bugs, particularly fixing journals that should be marked gold, but aren’t. Although someday AI might let us automate this, for now, we think that active community curation is the only viable way to keep that data accurate and up to date. The unification of OpenAlex and Unpaywall codebases means that all these changes will propagate to both systems within days.

Ok, that’s all for now! Thanks for your support and as always, please get in touch with any suggestions or feedback!

Thankful for Repositories and OA advocates

It’s American Thanksgiving this week, and we sure are thankful. We’re thankful for so many people and what they do — those who fight for open data, those who release their software and photos openly, the folks who ask and answer Stack Overflow questions, the amazing people behind the Crossref API…. the list is long and rich.

But today I want to shout out a special big thank you to OA advocates and the people behind repositories. Without your early and continued work, it wouldn’t be true that half of all views to scholarly articles are to an article that has an OA copy somewhere, and even better this number is growing to 70% of articles five years from now. That changes the game. For researchers and the public who are looking for papers, and for the whole scholarly communication system in how we think about paying for publishing in the years ahead in ways that make it more efficient and equitable.

I gave the closing keynote at Open Repositories 2019 this year, and my talk highlighted how the success of Unpaywall is really the success of all of you — and how we are set for institutional repositories to be even more impactful in the years ahead. It’s online here if you want to see it. We mean it.

Thank you.

Green OA lag

Ok I know for maximum impact we should probably spread all these blog posts out over multiple days, but I’m way too eager to share — I think people interested in Green OA will be really interested in this, I know I am.

It’s from the supplementary information section of the preprint, Section 11.1:

In the figure below we plot the number of Green OA papers made available each year vs their date of publication. The first plot is a histogram of number of papers made available each year (one row for each year).

The next plot is the same, but superimposes the articles made available in previous years. This stacked area represents the total cumulative number of Green OA papers that are available in that year — if you were in that year and wondering what was available as Green OA that’s what you’d find.

The third plot is a larger version of the availability as of 2018, showing the accumulation of availability. It allows us to appreciate that less than half of papers papers published in, say, 2015, were made available the same year — most of the papers have been made available in subsequent years. The fourth plot is a slice in isolation, for clarity: the Green OA for articles with a Publication Date of 2015.

Again, this last plot is when articles that were published in 2015 were actually made available in repositories. As you can see at the bottom of the stacked bar, a very few articles that were published in 2015 were actually posted in a repository in 2014. Those are preprints. A lot of articles published in 2015 appeared in a repository in 2015, but even more had a delay and didn’t appear in a repository until 2016. A full 40% of articles had an OA lag of more than a year, including some with an OA lag of four years!

More details on data collection are in the paper — just wanted to dig this out of Supplementary Information so that fellow nerds who’d enjoy this data don’t miss it 🙂

The Future of OA: what did we find?

Here are some of the key findings from the recent preprint on the Future of OA:

  • By 2025 we predict that 70% of all article views will be to articles available as OA — only 30% of article view attempts will be to content available only via subscription.
    • This compares to 52% of views available as OA right now, so it’ll be a big change in the next five years.
  • The numbers of Green, Gold, and Hybrid articles have been growing exponentially, and growing faster than Delayed OA or Closed access articles:
    • articles by year of observation, with exponential best fit line:
  • The average Green, Gold, and Hybrid paper receives more views than its Closed or Bronze counterpart, particularly Green papers made available within a year of publication.
    • views per article, by age of article:
  • Most Green OA articles become OA within their first two years of publication, but there is a long tail.
    • articles made newly Green OA in each the last four years, histograms by year of publication:
  • One interesting realization from the modeling we’ve done is that when the proportion of papers that are OA increases, or when the OA lag decreases, the total number of views increase — the scholarly literature becomes more heavily viewed and thus more valuable to society. This is intuitive, but could be explored quantitatively in future work using this model or ones like it.

Anyway, there are more findings too, but those are some of the main ones.

New perspective for OA: Date of Observation

We’d like to share one of the fun parts of our recent preprint. It’s fun because the concept of Date of Observation helps to untangle issues around embargoes — and also because we think we came up with a neat way to explain what is otherwise a fairly complicated concept, and hopefully make it accessible to everybody.

See what you think — here is our description of the Date of Observation, from section 3.3 of the preprint:

Let’s imagine two observers, Alice (blue) and Bob (red), shown by the two stick figures at the top of the figure:

Alice lives at the end of Year 1–that’s her “Date Of Observation.” Looking down, she can see all 8 articles (represented by solid colored dots) published in Year 1, along with their access status: Gold OA, Green OA, or Closed. The Year of Publication for all eight of these articles is Year 1.

Alice likes reading articles, so she decides to read all eight Year 1 articles, one by one.

She starts with Article A. This article started its life early in the year as Closed. Later that year, though–after an OA Lag of about six months–Article A became Green OA as its author deposited a manuscript (the green circle) in their institutional repository. Now, at Alice’s Date of Observation, it’s open! Excellent. Since Alice is inclined toward organization, she puts Article A article in a stack of Green articles she’s keeping below.

Now let’s look at Bob. Bob lives in Alice’s future, in Year 3 (ie, his “Date of Observation” is Year 3). Like Alice, he’s happy to discover that Article A is open. He puts it in his stack of Green OA articles, which he’s further organized by date of their publication (it goes in the Year 1 stack).

Next, Alice and Bob come to Article B, which is a tricky one. Alice is sad: she can’t read the article, and places it in her Closed stack. Unbeknownst to poor Alice, she is a victim of OA Lag, since Article B will become OA in Year 2. By contrast, Bob, from his comfortable perch in the future, is able to read the article. He places it in his Green Year 1 stack. He now has two articles in this stack, since he’s found two Green OA articles in Year 1.

Finally, Alice and Bob both find Article C is closed, and place it in the closed stack for Year 1. We can model this behavior for a hypothetical reader at each year of observation, giving us their view on the world–and that’s exactly the approach we take in this paper.

Now, let’s say that Bob has decided he’s going to figure out what OA will look like in Year 4. He starts with Gold. This is easy, since Gold article are open immediately upon publication, and publication date is easy to find from article metadata. So, he figures out how many articles were Gold for Alice (1), how many in Year 2 (3), and how many in his own Year 3 (6). Then he computes percentages, and graphs them out using the stacked area chart at the bottom of the figure. From there, it’s easy to extrapolate forward a year.

For Green, he does the same thing–but he makes sure to account for OA Lag. Bob is trying to draw a picture of the world every year, as it appeared to the denizens of that world. He wants Alice’s world as it appeared to Alice, and the same for Year 2, and so on. So he includes OA Lag in his calculations for Green OA, in addition to publication year. Once he has a good picture from each Date Of Observation, and a good understanding of what the OA Lag looks like, he can once again extrapolate to find Year 4 numbers.

Bob is using the same approach we will use in this paper–although in practice, we will find it to be rather more complex, due to varying lengths of OA Lag, additional colors of OA, and a lack of stick figures.

The Future of OA: A large-scale analysis projecting Open Access publication and readership

We are excited to announce our most recent study has just been posted on bioRxiv:

Piwowar, Priem, Orr (2019) The Future of OA: A large-scale analysis projecting Open Access publication and readership. bioRxiv: https://doi.org/10.1101/795310

This is the largest, most comprehensive analysis ever to predict the future of Open Access. Importantly, we look not only at publication trends but also at *viewership* — what do people want to read, and how much of it is OA?

The abstract is included below, we’ll be highlighting a few of the cool findings in subsequent blog posts, and you can read the full paper here (DOI not resolving yet). All the raw data and code is available, as is our style: http://doi.org/10.5281/zenodo.3474007. Enjoy, and let us know what you think!


Understanding the growth of open access (OA) is important for deciding funder policy, subscription allocation, and infrastructure planning.

This study analyses the number of papers available as OA over time. The models includes both OA embargo data and the relative growth rates of different OA types over time, based on the OA status of 70 million journal articles published between 1950 and 2019.

The study also looks at article usage data, analyzing the proportion of views to OA articles vs views to articles which are closed access. Signal processing techniques are used to model how these viewership patterns change over time. Viewership data is based on 2.8 million uses of the Unpaywall browser extension in July 2019.

We found that Green, Gold, and Hybrid papers receive more views than their Closed or Bronze counterparts, particularly Green papers made available within a year of publication. We also found that the proportion of Green, Gold, and Hybrid articles is growing most quickly.

In 2019:

  • 31% of all journal articles are available as OA
  • 52% of all article views are to OA articles

Given existing trends, we estimate that by 2025:

  • 44% of all journal articles will be available as OA
  • 70% of all article views will be to OA articles

The declining relevance of closed access articles is likely to change the landscape of scholarly communication in the years to come.


Additional blog posts about this paper:

Podcast episode about Unpaywall


 

I recently had a fun conversation with @ORION_opensci for their just-launched podcast.

The episode is about half an hour long, and covers what @Unpaywall is, who uses it, how it came about, a bit about how it works, thoughts on the importance of #openinfrastructure, the sustainability model, how open jives with getting money from Elsevier, #PlanS, how to help the #openscience revolution…

Anyway, here’s where you can listen (you can either load it into your Podcast app, or just press “play” on the webpage player):

https://orionopenscience.podbean.com/e/scaling-the-paywall-how-unpaywall-improved-open-access/

(Or here’s the MP3.)

Thanks for having me @OOSP_ORIONPod, it was super fun!  And do check out the rest of the episodes as well, they are covering great topics:

 

Unpaywall extension adds 200,000th active user

We’re thrilled to announce that we’re now supporting over 200,000 active users of the Unpaywall extension for Chrome and Firefox!

The extension, which debuted nearly two years ago, helps users find legal, open access copies of paywalled scholarly articles. Since its release, the extension has been used more than 45 million times, finding an open access copy in about half of those. We’ve also been featured in The Chronicle of Higher Ed, TechCrunch, Lifehacker, Boing Boing, and Nature (twice).

However, although the extension gets the press, the database powering the extension is the real star. There are millions of people using the Unpaywall database every day:

  • We deliver nearly one million OA papers every day to users worldwide via our open API…that’s 10 papers every second!
  • Over 1,600 academic libraries use our SFX integration to automatically find and deliver OA copies of articles when they have no subscription access.
  • If you’re using an academic discovery tool, it probably includes Unpaywall data…we’re integrated into Web of Science, Europe PubMed Central, WorldCat, Scopus, Dimensions, and many others.
  • Our data is used to inform and monitor OA policy at organizations like the US NIH, UK Research and Innovation, the Swiss National Science Foundation, the Wellcome Trust, the European Open Science Monitor, and many others.

The Unpaywall database gets information from over 50,000 academic journals and 5000 scholarly repositories and archives, tracking OA status for more than 100 million articles. You can access this data for free using our open API, or user our free web-based query tool. Or if you prefer, you can just download the whole database for free.

Unpaywall is supported via subscriptions to the Unpaywall Data Feed, a high-throughput pipeline providing weekly updates to our free database dump. Thanks to Data Feed subscribers, Unpaywall is completely self-sustaining and uses no grant funding. That makes us real optimistic about our ability to stick around and provide open infrastructure for lots of other cool projects.

Thanks to everyone who has supported this project, and even more, thanks to everyone who has fought for open access. Without y’all, Unpaywall wouldn’t matter. With you: we’re changing the world. Together. Next stop 300k!

Elsevier becomes newest customer of Unpaywall Data Feed

Posted on


We’re pleased to announce that Elsevier has become the newest customer of Impactstory’s Unpaywall Data Feed, which provides a weekly feed of changes in Unpaywall, our open database of 20 million open access articles. Elsevier will use the Unpaywall database to make open access content easier to find on Scopus.

Elsevier joins Clarivate Analytics, Digital Science, Zotero, and many other organizations as paying subscribers to the Data Feed.  Paying subscribers provide sustainability for Unpaywall, and fund the many free ways to access Unpaywall data, including complete database snapshots as well as our open API, Simple Query Tool, and browser extension. We’re proud that thousands of academic libraries and other institutions, as well as over 150,000 individual extension users, are using these free tools.

Impactstory’s mission is to help all people access all research products. Adding Elsevier as a Data Feed customer helps us further that mission. Specifically, the new agreement injects OA from our index into the workflows of the many Scopus users worldwide, helping them find and use open research they may never have seen before. So, we’re happy to welcome Elsevier as our latest Data Feed customer.

We’re building a search engine for academic literature–for everyone


Huzzah! Today we’re announcing an $850k grant from the Arcadia Fund to build a new way for folks to find, read, and understand the scholarly literature.

Wait, another search engine? Really?

Yep. But this one’s a little different: there are already a lot of ways for academic researchers to find academic literature…we’re building one for everyone else.

We’re aiming to meet the information needs of citizen scientists, patients, K-12 teachers, medical practitioners, social workers, community college students, policy makers, and millions more. What they all have in common: they’re folks who’d benefit from access to the scholarly record, but they’ve historically been locked out. They’ve had no access to the content or the context of the scholarly conversation.

Problem: it’s hard to access to content

Traditionaly, the scholarly literature was paywalled, cutting off access to the content. The Open Access movement is on the way to solving this: Half of new articles are now free to read somewhere, and that number is growing. The catch is that there are more than 50,000 different “somewheres” on web servers around the world, so we need a central index to find it. No one’s done a good job of this yet (Google Scholar gets close, but it’s aimed at specialists, not regular people. It’s also 100% proprietary, closed-source, closed-data, and subject to disappearing at Google’s whim.)

Problem: it’s hard to access to context

Context is the stuff that makes an article understandable for a specialist, but gobbledegook to the rest of us. So that includes everything from field-specific jargon, to strategies for on how to skim to the key findings, to knowledge of core concepts like p-values. Specialists have access to context. Regular folks don’t. This makes reading the scholarly literature like reading Shakespeare without notes: you get glimmers of beauty, but without some help it’s mostly just frustrating.

Solution: easy access to the content and context of research literature.

Our plan: provide access to both content and context, for free, in one place. To do that, we’re going to bring together an open a database of OA papers with a suite AI-powered support tools we’re calling an Explanation Engine.

We’ve already finished the database of OA papers. So that’s good. With the free Unpaywall database, we’ve now got 20 million OA articles from 50k sources, built on open source, available as open data, and with a working nonprofit sustainability model.

We’re building the “AI-powered support tools” now. What kind of tools? Well, let’s go back to the Hamlet example…today, publishers solve the context problem for readers of Shakespeare by adding notes to the text that define and explain difficult words and phrases. We’re gonna do the same thing for 20 million scholarly articles. And that’s just the start…we’re also working on concept maps, automated plain-language translations (think automatic Simple Wikipedia), structured abstracts, topic guides, and more. Thanks to recent progress in AI, all this can be automated, so we can do it at scale. That’s new. And it’s big.

The payoff

When Microsoft launched Altair BASIC for the new “personal computers,” there were already plenty of programming environments for experts. But here was one accessible to everyone else. That was new. And ultimately it launched the PC revolution, bringing computing the lives of regular folks. We think it’s time that same kind of movement happened in the world of knowledge.

From a business perspective, you might call this a blue ocean strategy. From a social perspective (ours), this is a chance to finally cash the cheques written by the Open Access movement. It’s a chance to truly open up access to the frontiers of human knowledge to all humans.

If that sounds like your jam, we’d love your support: tell your friends, sign up for early access, and follow us for updates. It’s gonna be quite an adventure.

Here’s the press release.