OpenAlex 2026 Roadmap

Posted on January 16, 2026January 16, 2026 by Jason

We just wrapped up our Q1 2026 Town Hall. You can watch the full recording here, but this post covers the highlights: what we shipped last quarter, what’s coming this quarter, and why we think 2026 is a pivotal year for open science.

What we shipped in Q4

The Walden rewrite is done. OpenAlex now runs on a modern Databricks infrastructure that lets us ship faster and iterate on data quality in days instead of months.

We added 192 million new works from DataCite and repositories. OpenAlex now indexes 477 million works—the largest connected repository of scholarship ever published.

On funders and awards: we created Awards as a first-class entity, extracted 27 million funder links from fulltext PDFs, and integrated 15 new funders directly.

What’s coming in Q1

For enterprise users: Credit-based API pricing launches this month. Different calls cost different amounts:

a singleton (/works/w123) is 1 credit,
a list (/works?filter=foo:bar) is 10,
PDF content (coming this month!) is 100,
vector search is 1,000. (coming soon! email steve@ourresearch.org for early access!)

We’re also launching a sync service so you can pull daily updates in one chunk instead of polling millions of records.

For institutions: Affiliation matching curation launches in February. Members can edit the matching algorithm that links affiliation strings to their institution. Changes propagate to the API within a day—permanently improving the dataset for everyone.

We’re also launching two membership tiers at $5k and $20k/year that include ability to curate your own data in OpenAlex, training/consulting, and pro API keys with higher API access for your faculty.

For researchers: A complete rewrite of author name disambiguation ships by end of Q1. This has always been the hardest problem in bibliometrics. With today’s AI, we think we can build the most accurate system ever made.

The bigger picture

There’s a lot more I want to say about why 2026 feels like a pivotal year—why we think the GUI is dead, why open data wins the AI era, and what that means for OpenAlex. I’ll save that for a follow-up post. For now: watch the town hall to hear the full argument, and try the vibe-coded demo I built live during the talk. And join our mailing list to stay up-to-date on all the wild stuff we’re doing this year. It’s going to be, by far, our biggest year ever. You ain’t seen nothing yet.

OpenAlex and NORA Collaborate to connect publications to the OECD FORD Taxonomy

Posted on January 16, 2026January 16, 2026 by Kyle Demes

OpenAlex and NORA (the Danish National Open Research Analytics team) are pleased to announce a collaboration mapping the OpenAlex research classification system to the OECD Fields of Research and Development (FORD) taxonomy. This alignment supports the upcoming launch of the new Danish Research Portal, but also enables OpenAlex users globally to use the taxonomy in their research analytics.

🎯 Why This Matters for Research Analytics

Widely adopted taxonomies like OECD FORD are critical for international benchmarking, reporting, and policy alignment. At the same time, national governments, research institutions, and regional bodies often rely on their own classification schemes that reflect local research priorities and funding strategies.

By linking OpenAlex’s aboutness classification system with the OECD FORD taxonomy, this collaboration creates:

A bridge between global standards and national strategy
An open and transparent alternative to proprietary classification systems
A pathway for countries and institutions to conduct policy-relevant analytics using fully open data
A blueprint for creating crosswalks between OpenAlex and additional research taxonomies

This mapping supports both broader interoperability and regionally specific analysis—without compromising either goal.

🧭 How We Built the Mapping

The mapping was developed using a systematic methodology that relates OpenAlex research subfields with OECD FORD categories. OpenAlex uses metadata about research articles (e.g., title, abstract, journal) to classify research outputs into research topics, subfields, fields, and domains (full documentation here).

OpenAlex subfields were successfully mapped to 38 out of 42 two-digit FORD fields.
The four remaining categories did not have direct equivalents given the current OpenAlex taxonomy structure.
The resulting crosswalk supports comprehensive coverage of major research areas across the OECD framework.

The figure below shows the number of OpenAlex subfields that were mapped to each FORD category. A full table listing each OpenAlex subfield and its corresponding FORD categories is available here.

🤖 Combining Expert Knowledge with AI

To ensure quality and scalability, we employed a dual approach:

A human expert (from OpenAlex) manually assigned OpenAlex subfields to FORD categories.
The same task was conducted using ChatGPT to test whether AI could reliably assist in classification alignment.

Out of 250+ assignments, the two approaches differed in only 11 cases. These were reviewed in collaboration with researchers in those fields: ChatGPT’s classification was determined a better fit in 7 of the 11 cases, while the human’s classification was a better fit only 4 times!

This result gives both teams confidence in using AI to assist with future classification crosswalks—especially as a way to accelerate mappings between OpenAlex and other national or domain-specific taxonomies.

📊 What the Mapping Enables

Once mapped, the classifications were applied by NORA to publications in the Danish Research Portal, which aggregates research outputs from across Denmark’s institutions. The FORD classifications derived from OpenAlex were then compared with classifications from Scopus and Web of Science.

While proprietary licensing prevents sharing of detailed comparisons, results from the three systems were broadly aligned, with some differences reflecting their underlying methodologies. Importantly, this confirms that open infrastructure can meet the same analytical needs traditionally served by closed systems.

🚀 What’s Next

OpenAlex users around the world can apply the crosswalk in their own analyses. If you think it’s useful for us to expose the OECD directly in our public API, let us know! If there is enough interest, we’ll add it this year.
The Danish Research Portal will launch in mid 2026, showcasing Danish research outputs across the OECD FORD classifications.

With the new OpenAlex Walden system, we look forward to expanding support for multiple taxonomies to meet the needs of different countries, research communities, and policy environments.

⚠️ Important Note on Use

This mapping is not formally endorsed by the OECD. We consulted with the OECD team and shared preliminary results to ensure accuracy and transparency. However, users conducting official reporting should validate the mapping according to their institutional or national guidance.

🌍 A Shared Vision for Open, Interoperable Research Infrastructure

This collaboration demonstrates what is possible when national research infrastructure and open data providers work together to align global and local needs. By combining methodological rigor, AI-assisted innovation, and a commitment to openness, NORA and OpenAlex are helping advance a more interoperable and transparent research ecosystem.

If your organization or country uses its own classification system and is interested in implementing it in OpenAlex, we invite you to reach out and collaborate with us.

— The OpenAlex and NORA Teams

OpenAlex: 2025 in Review

Posted on January 5, 2026January 7, 2026 by Kyle Demes

2025 was a defining year for OpenAlex. After two years of learning what the world needs from OpenAlex, we spent last year rebuilding our entire foundation and massively expanding our coverage. During this rebuild, we served exponential growth across academia, government, and industry, solidifying OpenAlex as essential global infrastructure for research.

A New Foundation: Walden Launch

At the end of the year, we launched Walden, the complete rewrite of the OpenAlex system.

On day one, Walden added more than 190 million new works, including records from DataCite and thousands of institutional repositories. For the first time, OpenAlex now creates records even when research exists only in repositories—making millions of previously hard-to-find works truly discoverable. These new records currently live as a dedicated subset (xpac) while we continue strengthening metadata before full integration into the core index.

Walden also gives OpenAlex a modern, flexible architecture making it faster to add new sources, easier to improve quality at scale, and ready for the next generation of features and curation.

Unprecedented Adoption & Global Reach

Use of OpenAlex grew dramatically, ending the year with:

350,000+ monthly unique visitors to our UI
3+ million monthly pageviews on our UI
1.5 billion monthly API calls across OpenAlex (1B) + Unpaywall (0.5B), exceeding Crossref for the first time!
1,100+ Research outputs in 2025 referencing OpenAlex

Rebranding and Clarifying the Mission

As OpenAlex continued to expand, it became clear that OpenAlex is not just one of our products—it is our mission. And in 2025, we reorganized to reflect that realization.

Today:

OpenAlex is the purpose and platform.
Unpaywall is a slice of the OpenAlex database delivered in a specific format.
Unsub is a dashboard built on top of OpenAlex, supporting specific use cases.

This unified identity makes it clearer for our users, clearer for our partners, and clearer for ourselves what we are collectively building together.

Financial Progress & Sustainability

We achieved major sustainability milestones in 2025:

Reached our year 2 $800k ARR target—three months ahead of schedule
Received a $3.5M Wellcome grant to integrate global research funding metadata
Continued strong renewal rates and growing institutional engagement

Running both the old and new systems in parallel, supporting unprecedented usage growth, and delivering Walden led to higher costs than projected. But these were intentional investments to make OpenAlex stronger, more scalable, and more valuable for the long term.

Looking Ahead

With Walden now live, we’re excited to start our next chapter. In 2026, we will:

Launch full community curation pipelines
Integrate global funding metadata
Begin integrating research software as first class research objects
Deepen partnerships with governments, universities, and industry, rolling out new support models and new features.
Continue strengthening sustainability and reliability

Thank You

To everyone who contributed, partnered, advocated, experimented, and trusted OpenAlex this year: thank you! We are thrilled and humbled to watch OpenAlex become the open, global scholarly knowledge graph the world depends on and are deeply aware that none of this happens without you.

Here’s to an even bigger 2026.

— The OpenAlex Team

A Better Way to Detect Language in OpenAlex—and a Better Way to Collaborate

Posted on October 20, 2025October 20, 2025 by Kyle Demes

As part of the recent Walden system launch, we’ve improved how OpenAlex detects the language of scholarly works. The results are immediately visible in the data: many more works are now correctly recognized as non-English, new languages appear that weren’t represented at all before, and previously unclassified works now have accurate language assignments.

The chart below (source) shows the number of works attributed to each language in the Classic vs. Walden OpenAlex. Most languages fall above the diagonal line, meaning more works in Walden are classified with that language and the cluster of languages on the y-axis are all languages that had no works in Classic OpenAlex but now have works in Walden.

We’re excited about this improvement. But the story behind this improvement is just as important as the technical result—it’s a model for how the research community and open infrastructures like OpenAlex can collaborate to make real, shared progress.

From helpful critique to a true collaboration

Last year, a group of researchers published a preprint evaluating OpenAlex’s language-classification system using a large multilingual gold standard (Céspedes et al., arXiv:2409.10633v2, now published as https://doi.org/10.1002/asi.24979). We were excited to see that an international research collaborative had undertaken such a significant project using OpenAlex with the aim of improving its usefulness for the global research community. Their study was rigorous and thoughtful, and it confirmed something we already knew: our approach to language detection could be improved.

However, the paper stopped short of evaluating and recommending the concrete next steps we could take to improve language detection in OpenAlex. We hadn’t been involved at the beginning of the study to provide the authors with the kinds of metrics or performance comparisons that would actually let us deploy a better model in production. But after publication, we met with some of the authors to discuss what we needed to be able to turn their work into improvements in OpenAlex.

We needed precision and recall metrics for multiple competing candidate algorithms (with a bias towards precision); and
We needed analysis that considered cost and runtime, given that any model we deploy must scale to 400 million+ records.

The researchers enthusiastically took on the additional work— checking in with us throughout the process to make sure they were on the right track. The result was a preprint from their follow-on study, (Sainte-Marie et al., arXiv:2502.03627), that provided exactly the applied, scalable insight we needed.

Turning research into real-world impact

As part of the Walden rewrite, we implemented one of the top-recommended approaches from their study. The improvement has been dramatic:

More works are now correctly classified as non-English languages, instead of being incorrectly labeled as English.
New languages, previously absent from OpenAlex, are now detected for the first time.
Previously “null” records now have reliable language tags.

Before deploying the new model in production, we already knew from the researchers’ analyses and their multilingual gold-standard sets that it would yield a strong overall improvement across the corpus. But we wanted to confirm that in practice. So we manually reviewed a random sample of works whose language classification differed between the old and new systems—and in the vast majority of those cases, the new system was correct.

We also validated against real-world feedback. For instance, the NORA team at Research Portal Denmark had previously submitted support tickets detailing mix-ups between Danish and Norwegian, two languages that are notoriously similar in writing. In ~75% of those cases, the new system now gets it right.

A model for future collaboration

To be clear– we value and learn from every independent evaluation of OpenAlex. One-way critiques from researchers are a vital part of the open-infrastructure ecosystem, and we deeply appreciate the time and expertise the global research community is investing in making OpenAlex better.

What made this case stand out was the second step: turning that critique into a direct collaboration that produced immediately deployable improvements. By working together, we created a fast-tracked feedback loop—from identifying issues in OpenAlex, to developing and testing solutions, to rolling out fixes across hundreds of millions of records. It’s a model we’d love to repeat.

And this is only the beginning. In the next few weeks, we’ll be launching a new community curation system letting researchers and metadata experts around the world submit corrections directly to OpenAlex—creating an even faster, more transparent, and more collaborative way to improve research metadata at scale.

Stay tuned—and thank you to everyone helping make open research information better, one contribution (and one collaboration) at a time.

Major Update to Unpaywall Database

Posted on July 29, 2025July 29, 2025 by Kyle Demes

We recently announced major changes to Unpaywall on our Unpaywall google group (https://groups.google.com/g/unpaywall) and via email to Unpaywall Premium Subscribers. A lot of folks aren’t on the group so we’re announcing here as well.

TL;DR
Unpaywall has migrated to a new codebase that helps us address data quality issues faster, and you may notice some changes.

The API is way faster → 10× faster API responses (avg 500 ms → 50 ms).
Some data has changed → About 23% of works saw data change, with about 10% seeing changes in oa_status (green, gold, etc) and 5% in is_oa (closed or open).
Overall accuracy is similar → Overall, precision remains constant. We have better recall of some Gold articles and worse detection of some Green articles.
Tiny schema changes→ Your scripts, API calls, and data feeds keep running, but two fields are now deprecated (oa_locations.evidence & oa_locations.updated)
Community curation → Users can now report and fix errors at unpaywall.org/fix.
Action required only if you host the full dataset locally (details below).

Why rewrite a perfectly good tool?

A decade ago we developed Unpaywall to:

make open access research in institutional repositories discoverable by users globally,
track open access behaviours and generate evidence for effective open access policies, and
raise the bar for open infrastructures by ensuring that the industry standard for determining open access status, was itself completely open.

We’re happy to report it has been very effective at achieving those goals:

Our Chrome and Firefox extensions are used by 800k monthly active users around the world,
Unpaywall sees an average of 200 API calls per second every second of the year,
Unpaywall now underpins every major open access monitoring and tracking initiative globally, and
Unpaywall has demonstrated an effective model for operating open research infrastructure.

Over the years, Open Access has become increasingly important to researchers, institutions, funders, and publishers. And steady changes over the years brought us to a publishing system that looks differently than the one we started in. At first, it was exceedingly rare for a publication that was open access to later become closed access. It was rare for publishers to make closed access works openly available for short times (like during COVID). And with the exception of embargo periods, it was rare for closed journals to later be made completely open.

All of these are common now, and at the scale of millions of publications. And publication landing pages aren’t just about providing the user with access to information– they also now collect information on users. As scholarly communication has evolved, it was clear that Unpaywall needed to evolve from a product into a process. And unfortunately, the code base that supported Unpaywall was struggling to adapted. With every change, we introduced new bugs and fixing each new bug kept creating more bugs. To continue delivering high quality open access metadata in an efficient way, we needed to start from scratch.

We spent the last year completely re-writing the code base for Unpaywall to make it:

faster;
easier to fix when it breaks; and
easier for users and publishers to curate.

On May 20, 2025 we launched the update. We have been working with our premium subscribers to implement the changes of their locally hosted databases that rely on Unpaywall. Most of our users switched to the new code base without even noticing– and that was intentional. Still, we think it is important for our users, especially those whose work depends on the Unpaywall database to understand these changes.

What didn’t change

Stable as ever	Details
Data format & schema	All keys stay the same (only the fields: oa_locations.evidence and oa_locations.updated are now marked “deprecated”).
API & data feed URLs	Zero downtime, same endpoints.
Aggregate metrics	10% of records saw a change in oa_status (i.e., color) and 5% saw a change in is_oa (open access vs. closed). Some changes were improvements and some were degradations, but overall precision remains the same

What did change

Better than before	How it helps you
Speed	Average API now returns in 50 ms, compared with 500ms before–10x speedup! ⚡
Accuracy	We detect more Gold OA, licenses, fresh OA URLs, and works that were once open access but are now closed. We detect less Green OA (but we’ll be able to improve that soon).
Curation UI	Users around the world can submit fixes via a web form; they go live in days.
Bulk Curation	Publishers can now directly submit to us bulk changes when their journals change from closed to open (or vice versa); they go live within 2 weeks.
Bug-fix velocity	Cleaner code = faster bug fixes.

Do you need to do anything?

Your setup	Required action
API-only	Nothing. You’re already on the new code and likely didn’t even notice
Data-feed mirror	Download our one-time “May 20 Snapshot” and overwrite your current database—too many small tweaks for a changefile.

Meet the new Curation Portal

We heard loud and clear from our users that they need to be able to fix open access metadata errors when they find them. And that’s why we developed a community curation pipeline for Unpaywall.

Found a record that still looks off? Head to unpaywall.org/fix, flag the issue, and we’ll merge your correction shortly (typically within 3 business days). Your expertise powers continual data quality improvements.

If you have ideas on how to improve the functionality of the curation user interface, please send them to brett@ourresearch.org.

Looking ahead

Community curation of Unpaywall will become increasingly important for overall database accuracy and fixing in Unpaywall will fix in all downstream users (Web of Science, Scopus, Dimensions, and more).
We will collaborate more closely with publishers directly to make large-scale changes associated with journal policy changes more quickly and accurately.
We will continue refining specific parts of our pipelines to increase their overall reliability, including better detection of OA status, journal OA status, license information, and fulltext links.
Users will see faster patch cycles for reported issues.
We will increase repository coverage and enhance linkage between publisher and repository versions.
Later this summer, we’ll be launching a full re-write of OpenAlex to bring the databases into closer alignment where they overlap (i.e., OA status metadata for publications with Crossref DOIs)

Thank you

We heard loud and clear from our communities of users that timely fixes of data quality issues is critical for them to be able to rely on Unpaywall. And we know that our response times slipped while we tackled this rewrite—thanks for sticking with us!

If you spot an error in the Unpaywall database that you would like to see fixed, the fastest way is to do that at unpaywall.org/fix. If you have other questions, send a note to support@unpaywall.org.

Here’s to a faster, cleaner, and ever-more-useful Unpaywall!

— The OurResearch Team

OpenAlex: 2024 in Review

Posted on December 24, 2024December 24, 2024 by Kyle Demes

As 2024 comes to a close, we’re taking the opportunity to reflect on the year behind us. And what a year it has been for OpenAlex!

It’s hard to believe that it was only one year ago when we launched the Beta of our web interface and the first University, The Sorbonne, announced that they were replacing their proprietary database with OpenAlex.

Since then the team has worked hard to meet the evolving needs of our communities of users. Below are some of the highlights of 2024.

Organization:

We received a 5-year grant from Arcadia totalling $7.5M to establish OpenAlex as a sustainably open index of the global research ecosystem
We received a 2-year grant from the Navigation Fund totaling $688k to enhance the OpenAlex user interface
We hired a Chief Operating Officer (Kyle Demes) and Senior Frontend Developer (Brett Lockspeiser)
Our Premium subscriptions exceeded our first year’s sustainability target by 25%

Data:

We started parsing fulltext PDFs to add more affiliation and reference metadata
We started matching references without DOIs (we now have 2.5B citations)
We added HAL as a primary source for new works
We started ingesting DataCite as a primary source. We now have 6.4M DataCite records (we’ll have them all in a few months)
We enhanced metadata accuracy for work type, publication year, author, institution, source, open access status, and more
Our data was adopted by three major University rankings:
Our data was featured in a Science News article examining the sustainability of APC feeds paid by researchers

Product:

We launched our Beta User Interface
We launched a new aboutness classification system (topics → subfields → fields → domains)
We launched new normalized citation metrics (field-weighted citation impact and citation percentiles) to facilitate comparison across fields and years.
We introduced user curation for affiliation, author, source, and work-level metadata and have already received more than 10k requests
We expanded our offerings of paid services to help us get to sustainability faster
We laid the foundation for an exciting new analytics product we’re looking forward to showing off early next year

Community (you):

Monthly users of OpenAlex.org have grown from 28k at the beginning of the year to 78k, now representing 440k visits per month!
Our first OpenAlex User Meeting was a huge success with 27 presentations from OpenAlex users in diverse organizations around the world
We attended 9 conferences to promote OpenAlex and engage with our user community globally: Research Analytics Summit, CARA, BRIC, ICSSI, Make Data Count, LIS, STI, SRAI, The Charleston Conference and were truly humbled to see presenters and vendors at every conference using OpenAlex data!
We launched a YouTube channel which now has 49 videos, 736 subscribers, and almost 25,000 views!
Over 500 publications mention or reference OpenAlex and that number grows daily!
We hosted an open call for our first Community Advisory Board where 50+ stellar nominees received almost 1,400 votes from the community– stay tuned for an announcement of results in early 2025

None of this would have been possible without all of you. So thank you! For your continued support, ideas, engagement, criticism, cheerleading, and collaboration. We’re looking forward to continuing to work together to build off these successes in 2025. Until then, Happy Holidays to you and yours.

Sincerely,

The OpenAlex Team

OurResearch receives $688k grant from Navigation Fund to enhance the OpenAlex User Interface

Posted on November 25, 2024November 25, 2024 by Kyle Demes

OurResearch is proud to announce a grant of $688,800 from The Navigation Fund to develop and launch an open, sustainable, web-based research intelligence (RI) module for the OpenAlex website. Our goal is to support expert finding, trend detection, and knowledge gap identification for researchers and research users. The RI module will serve as a map of the research landscape that’s easy to use for non-technical users, powerful for technical users, and supportive in helping all users increase their technical skills.

OpenAlex is the world’s first completely open and comprehensive index of the world’s research ecosystem. For the first time, everyone in the world has unrestricted access to a graph of the research ecosystem connecting hundreds of millions of scholarly outputs from thousands of fields across the globe to 100+ million authors from 100,000+ institutions. Hundreds of academic studies have already used OpenAlex to accelerate their research and to study science itself (link); universities and governments are adopting OpenAlex for their research intelligence needs, disrupting the established proprietary model (example); University rankings are switching to OpenAlex (example); and companies big and small around the world are using OpenAlex data to drive innovation previously not possible.

While we are thrilled by the early success of OpenAlex, we have noticed two significant barriers that are impeding more widespread adoption of OpenAlex: (1) many people (especially decision-makers) struggle to understand the promise of OpenAlex without seeing its potential first hand and (2) even when people can imagine how OpenAlex can benefit their work, most do not have the technical resources and capacity to create the desired insights from the openly available data.

With this grant, we have just hired a Senior Frontend Developer, starting December 2, 2024 to design and iteratively release new UI features. The expected UI enhancements will lead to better and more timely research-based outcomes in both enterprise and academia, including more productive collaborations, faster investigation of promising research fronts, and quicker time-to-market for new discoveries. Stay tuned for exciting updates to the OpenAlex web interface in 2025!

— — — —

OurResearch is a nonprofit that builds tools to help accelerate the transition to universal Open Science. Started at a hackathon in 2011, they remain committed to creating open, sustainable research infrastructure that solves real-world problems, like Unpaywall, Unsub, and OpenAlex.

The Navigation Fund is a 501(c)(3) nonprofit organization seeking to advance bold solutions to the world’s most urgent problems.

Coverage in the Financial Times of OpenAlex and the Sorbonne

Posted on January 19, 2024January 19, 2024 by Jason Portenoy

The Financial Times recently published an article detailing Sorbonne University’s “radical decision” to switch to OpenAlex for its publication database and bibliometric analytics. The article (behind a paywall, unfortunately 😞) came out a little while ago, but we wanted to highlight it here in case you missed it.

The news comes in the context of “a wider pushback against the current model in academic publishing, where researchers publish and review papers for free but have to buy expensive subscriptions to the journals in which they are published to analyse data relating to their work.” It includes a quote from OurResearch/OpenAlex co-founder and CEO Jason Priem: “We felt there’s a mismatch between the values of the academy and the shareholder boardroom. Research is fundamentally about sharing, while for-profits are fundamentally about capturing and enclosing. We aim to create and sustain research infrastructure that’s truly aligned with . . . the values of the research community.”

Exciting times for OpenAlex and open science!

Jack, Andrew. “Sorbonne’s Embrace of Free Research Platform Shakes up Academic Publishing.” Financial Times, December 27, 2023. https://www.ft.com/content/89098b25-78af-4539-ba24-c770cf9ec7c3.

Assigning Institutions — New England Journal of Medicine Case Study

Posted on December 11, 2023January 11, 2024 by Jason Portenoy

The New England Journal of Medicine uses a non-standard format when presenting authors and their institutional affiliations, which is a problem when we want to keep track of these links in our data. We developed a custom algorithm to solve this problem, preserving more than a hundred thousand author-institution links.

Linking works, authors, and institutions

Part of a diagram from the OpenAlex docs, showing how authors and institutions are linked to works through authorships. — OpenAlex data has links between works, authors, and institutions.

Works, authors, and institutions are three of the basic entities in the OpenAlex data. Keeping track of the relationships between these entities is one of the core things we do. It’s important that we identify these links correctly, so they can be used for downstream tasks like university research intelligence, ranking, etc. Often, this information comes to us via structured data which is not difficult to ingest. Many times, however, the data is messy, and using it is not so straightforward.

Affiliation data in the New England Journal of Medicine

Publications from the New England Journal of Medicine (NEJM) are an example of this messiness. Author affiliations in these papers are presented in a format that is human-readable, but not straightforward for a computer to parse automatically. In most other journals, authors are listed alongside their affiliated institutions, and so it is relatively easy for a program to link them together. NEJM does it a different way—as shown in the screenshot of a paper from the journal’s website, institutions are listed together with the initials of the authors, which in turn correspond to the full author names at the top of the paper.

Screenshot of the affiliations of a paper from the New England Journal of Medicine's website. — Author affiliations in NEJM come in a nonstandard format that is not easy for a computer to parse.

We might hope that the structured metadata we get from Crossref would have the data in a more standard format. But alas, this isn’t the case, as shown in the screenshot of data from the Crossref API.

Screenshot of JSON data from the Crossref API — Data about the paper from the Crossref API is also in the nonstandard format.

There are around 170,000 works from this journal. This is a relatively tiny proportion of the total number of works in OpenAlex. However, NEJM is a highly influential journal in medicine, so it’s a priority that we get this right.

Custom OpenAlex solution to assign institutions to NEJM authors

OpenAlex team member Nolan created a bespoke algorithm specifically for NEJM papers to parse the affiliation strings and assign authors to institutions. This rule-based algorithm identifies the author initials that might correspond to the full names, and uses those as a mapping to get the link from institution to author, as shown in the screenshot from the OpenAlex API of the example paper from above. The full data for this work can be found at https://api.openalex.org/works/W4386208393.

We have been able to apply this to around 35,000 articles, amounting to 158,000 institutional affiliations. Additionally, we identified about ten thousand raw affiliation strings that we couldn’t match to an institution, but can still prove useful to our users.

The NEJM case is an example of the attention to data and extra effort that is part of the value that OpenAlex hopes to provide. The data can be messy sometimes. It’s our mission to help make sense of it, so the world can have access to high-quality, free and open data.

Screenshot of JSON data from the OpenAlex API — OpenAlex data has institutional affiliations as structured, fully linked data.

Introducing Jason Portenoy, newest full-time team member at OpenAlex

Posted on March 1, 2023March 1, 2023 by Jason Portenoy

Hi, I’m Jason Portenoy, and I’m very happy to be joining OurResearch as the newest full-time team member! As a data engineer, I will be focusing my efforts on user engagement and outreach for OpenAlex. It is my responsibility to understand the OpenAlex dataset—its strengths and limitations—and work with the user community to improve it and make it easier to use.

I completed my PhD in Information Science at the University of Washington, studying the use of the scholarly literature as data to curate, explore, and evaluate scientific research. This field—known by various terms including scientometrics, science of science, metascience, and Big Scholarly Data—captivated me from the moment I learned about it. As the scale of scientific output continues to increase well beyond the capacity of any individual to make sense of it, the need for new tools and techniques to help becomes more and more pronounced. Working with Dr. Jevin West at the UW Datalab, I developed these tools and techniques—analyzing and visualizing scholarly data, and building recommender systems to connect scientists to new research and ideas. I extended this work through projects with Semantic Scholar, the Chan-Zuckerberg Initiative, and JSTOR.

*Nautilus visualization showing scholarly impact (J. Portenoy)*

While working on these tools and analyses, I came to rely on several scholarly data sets, such as Web of Science and Microsoft Academic Graph. Through my experience, I became an advocate for having high-quality, open, and accessible data for researchers and builders to use. A solid foundation of quality data will strengthen all downstream applications, from simple counts and bibliometric statistics, to advanced natural language processing and complex systems approaches.

Joining the OpenAlex team is a fantastic opportunity for me to contribute to the future of scholarly data. When Microsoft decided to end its academic service, myself and many others in the community wondered what would come next. It has become clear that OpenAlex will play a key role in the future of this field. I come to this position with technical training as a data engineer and data scientist, as well as experience with scholarly data. My goal is to work with the community of users to continually improve the OpenAlex data and experience. If there’s anything you think I might be able to help with, please let us know!