Major Update to Unpaywall Database

We recently announced major changes to Unpaywall on our Unpaywall google group (https://groups.google.com/g/unpaywall) and via email to Unpaywall Premium Subscribers. A lot of folks aren’t on the group so we’re announcing here as well.


TL;DR
Unpaywall has migrated to a new codebase that helps us address data quality issues faster, and you may notice some changes.

  • The API is way faster → 10× faster API responses (avg 500 ms → 50 ms).
  • Some data has changed → About 23% of works saw data change, with about 10% seeing changes in oa_status (green, gold, etc) and 5% in is_oa (closed or open).
  • Overall accuracy is similar → Overall, precision remains constant. We have better recall of some Gold articles and worse detection of some Green articles.
  • Tiny schema changes→ Your scripts, API calls, and data feeds keep running, but two fields are now deprecated (oa_locations.evidence & oa_locations.updated)
  • Community curation → Users can now report and fix errors at unpaywall.org/fix.
  • Action required only if you host the full dataset locally (details below).

Why rewrite a perfectly good tool?

A decade ago we developed Unpaywall to:

  1. make open access research in institutional repositories discoverable by users globally,
  2. track open access behaviours and generate evidence for effective open access policies, and 
  3. raise the bar for open infrastructures by ensuring that the industry standard for determining open access status, was itself completely open. 

We’re happy to report it has been very effective at achieving those goals: 

  • Our Chrome and Firefox extensions are used by 800k monthly active users around the world, 
  • Unpaywall sees an average of 200 API calls per second every second of the year, 
  • Unpaywall now underpins every major open access monitoring and tracking initiative globally, and 
  • Unpaywall has demonstrated an effective model for operating open research infrastructure. 

Over the years, Open Access has become increasingly important to researchers, institutions, funders, and publishers. And steady changes over the years brought us to a publishing system that looks differently than the one we started in. At first, it was exceedingly rare for a publication that was open access to later become closed access. It was rare for publishers to make closed access works openly available for short times (like during COVID). And with the exception of embargo periods, it was rare for closed journals to later be made completely open. 

All of these are common now, and at the scale of millions of publications. And publication landing pages aren’t just about providing the user with access to information– they also now collect information on users. As scholarly communication has evolved, it was clear that Unpaywall needed to evolve from a product into a process. And unfortunately, the code base that supported Unpaywall was struggling to adapted. With every change, we introduced new bugs and fixing each new bug kept creating more bugs. To continue delivering high quality open access metadata in an efficient way, we needed to start from scratch.

We spent the last year completely re-writing the code base for Unpaywall to make it: 

  • faster; 
  • easier to fix when it breaks; and 
  • easier for users and publishers to curate.

On May 20, 2025 we launched the update. We have been working with our premium subscribers to implement the changes of their locally hosted databases that rely on Unpaywall. Most of our users switched to the new code base without even noticing– and that was intentional. Still, we think it is important for our users, especially those whose work depends on the Unpaywall database to understand these changes.


What didn’t change

Stable as everDetails
Data format & schemaAll keys stay the same (only the fields: oa_locations.evidence and oa_locations.updated are now marked “deprecated”).
API & data feed URLsZero downtime, same endpoints.
Aggregate metrics
10% of records saw a change in oa_status (i.e., color) and 5% saw a change in is_oa (open access vs. closed). Some changes were improvements and some were degradations, but overall precision remains the same

What did change

Better than beforeHow it helps you
SpeedAverage API now returns in 50 ms, compared with 500ms before–10x speedup! ⚡
AccuracyWe detect more Gold OA, licenses, fresh OA URLs, and works that were once open access but are now closed. We detect less Green OA (but we’ll be able to improve that soon).
Curation UIUsers around the world can submit fixes via a web form; they go live in days.
Bulk CurationPublishers can now directly submit to us bulk changes when their journals change from closed to open (or vice versa); they go live within 2 weeks.
Bug-fix velocityCleaner code = faster bug fixes.

Do you need to do anything?

Your setupRequired action
API-onlyNothing. You’re already on the new code and likely didn’t even notice
Data-feed mirrorDownload our one-time “May 20 Snapshot” and overwrite your current database—too many small tweaks for a changefile.

Meet the new Curation Portal

We heard loud and clear from our users that they need to be able to fix open access metadata errors when they find them. And that’s why we developed a community curation pipeline for Unpaywall. 

Found a record that still looks off? Head to unpaywall.org/fix, flag the issue, and we’ll merge your correction shortly (typically within 3 business days). Your expertise powers continual data quality improvements. 

If you have ideas on how to improve the functionality of the curation user interface, please send them to brett@ourresearch.org


Looking ahead

  • Community curation of Unpaywall will become increasingly important for overall database accuracy and fixing in Unpaywall will fix in all downstream users (Web of Science, Scopus, Dimensions, and more).
  • We will collaborate more closely with publishers directly to make large-scale changes associated with journal policy changes more quickly and accurately.
  • We will continue refining specific parts of our pipelines to increase their overall reliability, including better detection of OA status, journal OA status, license information, and fulltext links.
  • Users will see faster patch cycles for reported issues.
  • We will increase repository coverage and enhance linkage between publisher and repository versions.
  • Later this summer, we’ll be launching a full re-write of OpenAlex to bring the databases into closer alignment where they overlap (i.e., OA status metadata for publications with Crossref DOIs)

Thank you

We heard loud and clear from our communities of users that timely fixes of data quality issues is critical for them to be able to rely on Unpaywall. And we know that our response times slipped while we tackled this rewrite—thanks for sticking with us! 

If you spot an error in the Unpaywall database that you would like to see fixed, the fastest way is to do that at unpaywall.org/fix. If you have other questions, send a note to  support@unpaywall.org.

Here’s to a faster, cleaner, and ever-more-useful Unpaywall!

The OurResearch Team

OpenAlex: 2024 in Review

As 2024 comes to a close, we’re taking the opportunity to reflect on the year behind us. And what a year it has been for OpenAlex!

It’s hard to believe that it was only one year ago when we launched the Beta of our web interface and the first University, The Sorbonne, announced that they were replacing their proprietary database with OpenAlex. 

Since then the team has worked hard to meet the evolving needs of our communities of users. Below are some of the highlights of 2024.

Organization:

  • We received a 5-year grant from Arcadia totalling $7.5M to establish OpenAlex as a sustainably open index of the global research ecosystem
  • We received a 2-year grant from the Navigation Fund totaling $688k to enhance the OpenAlex user interface
  • We hired a Chief Operating Officer (Kyle Demes) and Senior Frontend Developer (Brett Lockspeiser)
  • Our Premium subscriptions exceeded our first year’s sustainability target by 25%

Data:

  • We started parsing fulltext PDFs to add more affiliation and reference metadata
  • We started matching references without DOIs (we now have 2.5B citations)
  • We added HAL as a primary source for new works
  • We started ingesting DataCite as a primary source. We now have 6.4M DataCite records (we’ll have them all in a few months)
  • We enhanced metadata accuracy for work type, publication year, author, institution, source, open access status, and more
  • Our data was adopted by three major University rankings:
  • Our data was featured in a Science News article examining the sustainability of APC feeds paid by researchers

Product:

  • We launched our Beta User Interface
  • We launched a new aboutness classification system (topics → subfields → fields → domains)
  • We launched new normalized citation metrics (field-weighted citation impact and citation percentiles) to facilitate comparison across fields and years.
  • We introduced user curation for affiliation, author, source, and work-level metadata and have already received more than 10k requests
  • We expanded our offerings of paid services to help us get to sustainability faster
  • We laid the foundation for an exciting new analytics product we’re looking forward to showing off early next year

Community (you):

  • Monthly users of OpenAlex.org have grown from 28k at the beginning of the year to 78k, now representing 440k visits per month!
  • Our first OpenAlex User Meeting was a huge success with 27 presentations from OpenAlex users in diverse organizations around the world
  • We attended 9 conferences to promote OpenAlex and engage with our user community globally: Research Analytics Summit, CARA, BRIC, ICSSI, Make Data Count, LIS, STI, SRAI, The Charleston Conference and were truly humbled to see presenters and vendors at every conference using OpenAlex data!
  • We launched a YouTube channel which now has 49 videos, 736 subscribers, and almost 25,000 views!
  • Over 500 publications mention or reference OpenAlex and that number grows daily!
  • We hosted an open call for our first Community Advisory Board where 50+ stellar nominees received almost 1,400 votes from the community– stay tuned for an announcement of results in early 2025

None of this would have been possible without all of you. So thank you! For your continued support, ideas, engagement, criticism, cheerleading, and collaboration. We’re looking forward to continuing to work together to build off these successes in 2025. Until then, Happy Holidays to you and yours.

Sincerely,

The OpenAlex Team

OurResearch receives $688k grant from Navigation Fund to enhance the OpenAlex User Interface

OurResearch is proud to announce a grant of $688,800 from The Navigation Fund to develop and launch an open, sustainable, web-based research intelligence (RI) module for the OpenAlex website. Our goal is to support expert finding, trend detection, and knowledge gap identification for researchers and research users. The RI module will serve as a map of the research landscape that’s easy to use for non-technical users, powerful for technical users, and supportive in helping all users increase their technical skills.

OpenAlex is the world’s first completely open and comprehensive index of the world’s research ecosystem. For the first time, everyone in the world has unrestricted access to a graph of the research ecosystem connecting hundreds of millions of scholarly outputs from thousands of fields across the globe to 100+ million authors from 100,000+ institutions. Hundreds of academic studies have already used OpenAlex to accelerate their research and to study science itself  (link); universities and governments are adopting OpenAlex for their research intelligence needs, disrupting the established proprietary model (example); University rankings are switching to OpenAlex (example); and companies big and small around the world are using OpenAlex data to drive innovation previously not possible. 

While we are thrilled by the early success of OpenAlex, we have noticed two significant barriers that are impeding more widespread adoption of OpenAlex: (1) many people (especially decision-makers) struggle to understand the promise of OpenAlex without seeing its potential first hand and (2) even when people can imagine how OpenAlex can benefit their work, most do not have the technical resources and capacity to create the desired insights from the openly available data. 

With this grant, we have just hired a Senior Frontend Developer, starting December 2, 2024 to design and iteratively release new UI features. The expected UI enhancements will lead to better and more timely research-based outcomes in both enterprise and academia, including more productive collaborations, faster investigation of promising research fronts, and quicker time-to-market for new discoveries. Stay tuned for exciting updates to the OpenAlex web interface in 2025!

— — — — 

OurResearch is a nonprofit that builds tools to help accelerate the transition to universal Open Science. Started at a hackathon in 2011, they remain committed to creating open, sustainable research infrastructure that solves real-world problems, like Unpaywall, Unsub, and OpenAlex.

The Navigation Fund is a 501(c)(3) nonprofit organization seeking to advance bold solutions to the world’s most urgent problems. 

Coverage in the Financial Times of OpenAlex and the Sorbonne

The Financial Times recently published an article detailing Sorbonne University’s “radical decision” to switch to OpenAlex for its publication database and bibliometric analytics. The article (behind a paywall, unfortunately 😞) came out a little while ago, but we wanted to highlight it here in case you missed it.

The news comes in the context of “a wider pushback against the current model in academic publishing, where researchers publish and review papers for free but have to buy expensive subscriptions to the journals in which they are published to analyse data relating to their work.” It includes a quote from OurResearch/OpenAlex co-founder and CEO Jason Priem: “We felt there’s a mismatch between the values of the academy and the shareholder boardroom. Research is fundamentally about sharing, while for-profits are fundamentally about capturing and enclosing. We aim to create and sustain research infrastructure that’s truly aligned with . . . the values of the research community.”

Exciting times for OpenAlex and open science!

Jack, Andrew. “Sorbonne’s Embrace of Free Research Platform Shakes up Academic Publishing.” Financial Times, December 27, 2023. https://www.ft.com/content/89098b25-78af-4539-ba24-c770cf9ec7c3.

Assigning Institutions — New England Journal of Medicine Case Study

The New England Journal of Medicine uses a non-standard format when presenting authors and their institutional affiliations, which is a problem when we want to keep track of these links in our data. We developed a custom algorithm to solve this problem, preserving more than a hundred thousand author-institution links.

Linking works, authors, and institutions

Part of a diagram from the OpenAlex docs, showing how authors and institutions are linked to works through authorships.
OpenAlex data has links between works, authors, and institutions.

Works, authors, and institutions are three of the basic entities in the OpenAlex data. Keeping track of the relationships between these entities is one of the core things we do. It’s important that we identify these links correctly, so they can be used for downstream tasks like university research intelligence, ranking, etc. Often, this information comes to us via structured data which is not difficult to ingest. Many times, however, the data is messy, and using it is not so straightforward.

Affiliation data in the New England Journal of Medicine

Publications from the New England Journal of Medicine (NEJM) are an example of this messiness. Author affiliations in these papers are presented in a format that is human-readable, but not straightforward for a computer to parse automatically. In most other journals, authors are listed alongside their affiliated institutions, and so it is relatively easy for a program to link them together. NEJM does it a different way—as shown in the screenshot of a paper from the journal’s website, institutions are listed together with the initials of the authors, which in turn correspond to the full author names at the top of the paper.

Screenshot of the affiliations of a paper from the New England Journal of Medicine's website.
Author affiliations in NEJM come in a nonstandard format that is not easy for a computer to parse.

We might hope that the structured metadata we get from Crossref would have the data in a more standard format. But alas, this isn’t the case, as shown in the screenshot of data from the Crossref API.

Screenshot of JSON data from the Crossref API
Data about the paper from the Crossref API is also in the nonstandard format.

There are around 170,000 works from this journal. This is a relatively tiny proportion of the total number of works in OpenAlex. However, NEJM is a highly influential journal in medicine, so it’s a priority that we get this right.

Custom OpenAlex solution to assign institutions to NEJM authors

OpenAlex team member Nolan created a bespoke algorithm specifically for NEJM papers to parse the affiliation strings and assign authors to institutions. This rule-based algorithm identifies the author initials that might correspond to the full names, and uses those as a mapping to get the link from institution to author, as shown in the screenshot from the OpenAlex API of the example paper from above. The full data for this work can be found at https://api.openalex.org/works/W4386208393.

We have been able to apply this to around 35,000 articles, amounting to 158,000 institutional affiliations. Additionally, we identified about ten thousand raw affiliation strings that we couldn’t match to an institution, but can still prove useful to our users.

The NEJM case is an example of the attention to data and extra effort that is part of the value that OpenAlex hopes to provide. The data can be messy sometimes. It’s our mission to help make sense of it, so the world can have access to high-quality, free and open data.

Screenshot of JSON data from the OpenAlex API
OpenAlex data has institutional affiliations as structured, fully linked data.

Introducing Jason Portenoy, newest full-time team member at OpenAlex

Photo of Jason Portenoy

Hi, I’m Jason Portenoy, and I’m very happy to be joining OurResearch as the newest full-time team member! As a data engineer, I will be focusing my efforts on user engagement and outreach for OpenAlex. It is my responsibility to understand the OpenAlex dataset—its strengths and limitations—and work with the user community to improve it and make it easier to use.

I completed my PhD in Information Science at the University of Washington, studying the use of the scholarly literature as data to curate, explore, and evaluate scientific research. This field—known by various terms including scientometrics, science of science, metascience, and Big Scholarly Data—captivated me from the moment I learned about it. As the scale of scientific output continues to increase well beyond the capacity of any individual to make sense of it, the need for new tools and techniques to help becomes more and more pronounced. Working with Dr. Jevin West at the UW Datalab, I developed these tools and techniques—analyzing and visualizing scholarly data, and building recommender systems to connect scientists to new research and ideas. I extended this work through projects with Semantic Scholar, the Chan-Zuckerberg Initiative, and JSTOR.

While working on these tools and analyses, I came to rely on several scholarly data sets, such as Web of Science and Microsoft Academic Graph. Through my experience, I became an advocate for having high-quality, open, and accessible data for researchers and builders to use. A solid foundation of quality data will strengthen all downstream applications, from simple counts and bibliometric statistics, to advanced natural language processing and complex systems approaches.

Joining the OpenAlex team is a fantastic opportunity for me to contribute to the future of scholarly data. When Microsoft decided to end its academic service, myself and many others in the community wondered what would come next. It has become clear that OpenAlex will play a key role in the future of this field. I come to this position with technical training as a data engineer and data scientist, as well as experience with scholarly data. My goal is to work with the community of users to continually improve the OpenAlex data and experience. If there’s anything you think I might be able to help with, please let us know!

OpenAlex documentation improvements

It’s a new year and at OurResearch we’re starting off 2023 full steam ahead! We’ve revamped the OpenAlex documentation so that it’s easier to get started, and easier to find the fields and filters that are available in the OpenAlex API. It should take less “clicks” to find what you need.

Poised for growth

The major change we made was to highlight the core entities (works, authors, etc) in OpenAlex, giving them their own up-front space. OpenAlex grew considerably in 2022, not only in number records, but also by the number of ways that you can filter, group, and search scholarly data. This new approach provides more room to add and document filters. We can better describe the unique search capabilities available in each entity. Overall, it sets us up to grow again in 2023.

Our goal is to maintain friendly and approachable documentation, so hopefully we’ve kept that up as well. If you find something broken, or have some suggested improvements, let us know!

Author search in OpenAlex: improved handling of diacritics within names

We’ve improved the author search feature within OpenAlex, so you get more results when searching for author names that may or may not include diacritics. For example, a search for the name “David Tarragó” will return the same number of results as the the version that is converted via Lucene’s ASCII folding filter, which in this case is “David Tarrago”.

When searching with diacritics, results with the queried diacritics are more likely to be ranked towards the top. So the two searches may have slightly different rankings. You can see the results of these two searches in the API:

These queries return the same number of results, with diacritic and non-diacritic names included. Keep in mind that results are weighted by the author’s works count, so that has an impact on relevance as well.

Why make this change?

When creating the OpenAlex author search capability, it was important for us to honor author’s names by respecting diacritics. So searching with a diacritic returned results with diacritics. However, this strict approach makes it harder to find some authors. We’re comfortable with the compromise of searching with and without diacritics at the same time, while giving priority to the intended search query. Hopefully this improved feature is helpful!

Fetch multiple DOIs in one OpenAlex API request

Did you know that you can request up to 50 DOIs in a single API call? That’s possible due to the OR query in the OpenAlex API and looks like this:

https://api.openalex.org/works?filter=doi:10.3322/caac.21660|https://doi.org/10.1136/bmj.n71|10.3322/caac.21654&mailto=support@openalex.org

We simply separate our DOIs with the pipe symbol ‘|’. That query will return three works associated with the three DOIs we entered. As you can see in the query, a short form DOI or long form DOI (as a URL) are both supported.

This will save time and resources when requesting many DOIs. This technique works with all IDs in OpenAlex, to include OpenAlex IDs and PubMed Central IDs (PMID).

Example with python requests

Let’s write an example python script to show how we can get DOIs in batches of 50 using requests:

import requests

dois = ["10.3322/caac.21660", "https://doi.org/10.1136/bmj.n71", "10.3322/caac.21654"]
pipe_separated_dois = "|".join(dois)
r = requests.get(f"https://api.openalex.org/works?filter=doi:{pipe_separated_dois}&per-page=50&mailto=support@openalex.org")
works = r.json()["results"]

for work in works:
  print(work["doi"], work["display_name"])

# results
https://doi.org/10.3322/caac.21660 Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries
https://doi.org/10.1136/bmj.n71 The PRISMA 2020 statement: an updated guideline for reporting systematic reviews
https://doi.org/10.3322/caac.21654 Cancer Statistics, 2021

Hope this is helpful!

Meet Casey – Now full time with OurResearch

Hi I’m Casey. I am excited to announce that I am now full time with OurResearch as a software engineer working on OpenAlex and Unpaywall!

My Journey

I freelanced for OurResearch prior to joining full time this summer. With Jason and Heather’s help I maintained Paperbuzz, Cite-As, and also built out a project to catalog academic journal pricing. With freelancing I was able to improve my python and data management skills in order to tackle bigger projects.

Prior to freelancing I enjoyed a career in the US Air Force, which I am proud of. I’m fortunate to have hundreds of hours as aircrew on multiple aircraft, as well as a variety of technical and leadership assignments. So if you ever want to talk airplanes be ready because I might talk your ear off!

My academic experience comes from my time in university pursuing advanced education.

My Vision with OurResearch

In December I helped build the API and set up Elasticsearch for a project called OpenAlex. That project has continued to grow and I love to see how many people are using it. My core job with OpenAlex is to provide front-line customer support, as well as maintain and improve the API and search infrastructure. I’m also working on several parts of UnPaywall.

It’s incredible that OurResearch tools are freely open and available. I find OurResearch has similar core values as my time in the Air Force: small teams empowered to make decisions, humble and accepting of feedback in order to make things better. That’s why we believe our community of users are invaluable and important in keeping those tools free, open, and easy to use.

So we will listen to your feedback, fix bugs and implement features quickly, and continue to maintain our documentation so the dataset and APIs are as frictionless as they can be. We welcome and need your help with this mission! So do not hesitate to contact me or the team.

I look forward to improving OpenAlex and Unpaywall, and to meeting those of you using OurResearch products!

– Casey