We’ve made a major, systematic improvement to how OpenAlex finds and assigns corresponding authors, and the corresponding institutions tied to them. On a hand-checked gold standard, precision rose from 0.60 to 0.92 while recall held nearly steady (0.91 to 0.88), lifting our F1 score from 0.72 to 0.90. We also added real corresponding authors to roughly 7 million works that were missing them. If you use OpenAlex to track transformative agreements, negotiate with publishers, or attribute output to the institution that led a paper, this one matters.
Here’s what we did, how we measured it, and what it means for your work.
Why corresponding authors matter
The corresponding author is usually the person who submitted the paper and who handles publishing decisions, including who pays any article-processing charge. For libraries and consortia, that makes corresponding-author and corresponding-institution data the backbone of transformative-agreement (read-and-publish) tracking: eligibility for most of these deals is determined by the corresponding author’s affiliation. If that field is wrong or missing, the analysis underneath a negotiation is wrong or missing too.
This work grew directly out of a collaboration with the University of California’s California Digital Library (CDL), who resourced the project. CDL flagged that gaps and errors in corresponding-author coverage were constraining their publishing analyses and negotiation strategy, and partnered with us to fix it at the source rather than work around it. We’re grateful for their support and for pushing us toward a problem whose solution benefits the whole community.
Building a gold standard, then improving against it
The foundation of this project was measurement. We built a hand-checked gold standard of corresponding author assignments and used it as ground truth, both to see how we were actually doing and to drive systematic improvements.
OpenAlex marks corresponding authors with the is_corresponding flag on each authorship, and exposes them on a work through corresponding_author_ids and corresponding_institution_ids. Against the gold standard, our starting point was a precision of just 0.60 (paired with a recall of 0.91), for an F1 of 0.72. In plain terms: while we correctly found 91% of true corresponding authors, when we labeled an author as corresponding it was only correct 60% of the time. That gave us a clear baseline and a way to test every change we made.
With that baseline in hand, we focused on the core problem: getting much better at recognizing and attributing corresponding-author information from the unstructured text of the records and documents we ingest. Corresponding-author details are usually present somewhere, but they’re expressed in messy, inconsistent, free-text ways rather than in a clean structured field. We substantially improved our ability to recognize and extract that information, which let us assign genuine corresponding authors to millions of works where we previously had none, and correct many where we had the wrong one.
What the gold standard taught us about an old assumption
The gold standard also let us test an assumption baked into our older system. When we couldn’t find explicit corresponding-author information for a paper, we used to fall back on treating the first author as corresponding. We knew that this assumption wasn’t perfect and was becoming less reliable in recent years (particularly after transformative agreements linked APC fees and payments to corresponding authors), but this exercise helped show how unreliable that assumption had become. The first author does turn out to be the corresponding author more than half the time, but that fallback was wrong in nearly half of cases. Leaning on that assumption propped up our previous recall (0.91) of corresponding authors, but is why our previous precision was so low (0.6).
Through this work, we were able to remove the first-author fallback entirely, and our recall barely moved: it went from 0.91 to 0.88. Normally, throwing away a blanket assumption like that would tank recall. It didn’t, because our improved text recognition now finds the real corresponding author in nearly all the cases the assumption used to paper over. Precision, meanwhile, climbed from 0.60 to 0.92. We kept almost all of our coverage because we no longer needed to rely on the old assumption. Where we now have genuine evidence we use it; where we still don’t, we say so (i.e., null values over assumed ones). The result is more accurate and reliable data.
The results
Measured against the hand-checked gold standard:
| Metric | Before | After |
| Precision | 0.60 | 0.92 |
| Recall | 0.91 | 0.88 |
| F1 | 0.72 | 0.90 |
- Precision rose from 0.60 to 0.92. Wrong corresponding-author assignments dropped from about 40% to about 8%, roughly an 80% cut in errors.
- Recall held nearly steady, 0.91 to 0.88, even though we removed the first-author fallback that used to inflate it. Our improved text recognition recovered almost all of that coverage on its own merits.
- F1 climbed from 0.72 to 0.90, reflecting both gains at once.
- We added real corresponding authors to about 7 million works (about 8.2 million new corresponding-author assignments), a roughly 21% increase in the share of multi-author works that now carry a corresponding author.
- The biggest gains came across the major publishers, including Elsevier, Springer Nature, Wiley, and Oxford University Press, which together account for a large share of the recovery.
What’s still hard
We want to be straight about the limits. The cases we still struggle with are mostly very small publishers, where the corresponding author is mentioned only inside an email address or tucked into an obscure corner of the record, with no consistent pattern to recognize. We’ll keep working on the long tail, and as always, you can help by curating records where you spot a problem (send error reports to support@openalex.org).
What this means for you
If you rely on corresponding_author_ids or corresponding_institution_ids, your results should be both more accurate and more complete than they were.
If you’ve run these analyses recently, you’ll want to repeat them. The nice part is that all of the queries you’ve already built will work exactly the same way; they’ll just return more accurate data now.
A couple of things to keep in mind:
- Because we now assign corresponding authors from real evidence instead of a fallback guess, you’ll occasionally see works with no corresponding author rather than an incorrect one. That’s intentional: blanks are more useful than a confident error when identifying room for further improvements. If you’d like to help close those gaps for your own institution, you can look for works by your authors and filter to the ones with a null corresponding institution, then curate them directly.
- Coverage is strongest at the large publishers and lighter at the smallest ones, so factor that into any long-tail analysis.
For transformative-agreement tracking specifically, the combination of more correct corresponding authors and more correct corresponding institutions should give a cleaner picture of which articles belong to which institution, and a firmer footing for the analyses that support your negotiations.
Questions, or a case where it still looks wrong? We’d love to hear it: support@openalex.org.
Finally, our thanks again to the University of California and the California Digital Library for resourcing this work and for collaborating with us on a problem that helps the entire OpenAlex community. If your institution is keen to explore a similar collaboration with us, we’d love to hear from you: reach out to kyle@openalex.org.