Last week, we held our quarterly community town hall. Jason Priem (founder & CEO) walked through everything OpenAlex shipped in Q1 2026 and laid out the roadmap for Q2. If you’d like to watch the full recording, it’s on YouTube.
This post recaps the highlights for anyone who couldn’t make it.
A new kind of transparency
One thing we tried for the first time this quarter: instead of slides, the entire town hall was a walk-through of Markdown files in a public GitHub repo. The retrospective itself was generated by Claude Code from a single prompt pointed at our open-source repositories, our blog, and our internal job tracker (oxjobs). The prompt is right there at the top of the document — anyone can re-run.
That seems like a small thing, but it represents something we’re really excited about. For two decades, the open community has been writing a check that says “if we keep building in the open, eventually the machines will get here and this will pay off.” That check is finally cashing. Because OpenAlex is built in the open, you don’t have to wait for the next town hall to find out what we’ve been doing — you can run the same prompt next week, next month, and see for yourself. That kind of legibility just isn’t available from closed databases.
Okay, on to the actual work.
Q1 in review
Alice — our biggest release since Walden
In February we shipped Alice, a sweeping update that touched search, content delivery, pricing, and docs:
- Semantic search (beta). Search by meaning, not just keywords — query neoplasm and get articles about cancer. Behind the scenes that’s a custom Elasticsearch vector index with 413M embeddings (including 197M title-only embeddings so even works without abstracts are searchable). Queries return in under 250ms.
- Advanced search. Proximity operators, exact matching, wildcards, and queries up to 8KB. Most of the syntax you’d use in legacy databases for a systematic review now works in OpenAlex.
- Content API. Direct access to 60M+ open-access PDFs and parsed GROBID TEI XML at predictable URLs like
content.openalex.org/works/{id}.pdf. There’s a 62M-row Parquet manifest for bulk sync. - Usage-based pricing. This one matters most for our long-term sustainability. We replaced blunt rate limits with a transparent credit system priced in dollars: $1/day free for individuals, plus Pro, Max, Member, and Partner tiers, and one-time Stripe top-ups for burst usage. Even in these early days, we’re tracking toward $100K+/year in metered revenue — and the vast majority of those charges are under $10. People who want to run a million searches can pay a few bucks for it; users exploring the database or with lower volume data needs still get a free daily allowance.
- Completely rebuilt documentation at developers.openalex.org, built on Mintlify. It’s optimized for agent legibility — humans aren’t really going to read documentation in 2026, but agents will, and we want our docs to be the ones they pull into context.
Data quality: tens of millions of records improved
Less glamorous, just as important. Highlights:
- Repository attribution: corrected 44M works misattributed to the wrong repository source. Created 797 new source records, fixed 1,035 existing ones.
any_repository_has_fulltexthad never actually been implemented — it now correctly flags 163M works. - Affiliations: backfilled 23.9M affiliation strings missing from our lookup table, restoring institution IDs for ~20M works. Applied ~165K corrections from the French Ministry’s Works-Magnet tool.
- Abstracts: fixed a pipeline gap where ~3M landing-page abstracts (2024–2026) were extracted but never reached the API. Re-ran topic and SDG models on 1.27M works that gained abstracts.
- FWCI: replaced a stale external lookup with an inline calculation that guarantees average FWCI = 1.0 within any cohort. Fixed
is_in_top_1_percentto be derived directly from citation percentiles. - Topics: unified the topics pipeline and cleared a 17.9M-work backlog.
A lot of this was only practical because of our move to Walden, our new Databricks-based codebase. Fixes that used to take two weeks now take a day.
Funder & awards expansion
Backed by a $3.6M Wellcome grant, awards are now first-class objects in OpenAlex. We’ve ingested grants from 30+ funders worldwide — NIH, NSF, DOE, UKRI, Wellcome, ERC, DFG, ANR, SNSF, NSERC, CIHR, KAKEN/JSPS (873K grants), ARC, FAPESP, ANID, the Swedish Research Council, Gates, NWO, and more. A recent analysis compared our early progress with legacy databases — we’re closing the gap quickly and by end of quarter we expect to be the most comprehensive funder/awards database available.
Japanese repositories (IRDB)
We completed a contract deliverable to ingest ~4.6M records from IRDB via JPCOAR 2.0 / OAI-PMH, replacing 74 failing individual NII endpoints. Plus ~700 new Japanese repository source records and ~500 new OAI-PMH endpoints from OpenAIRE. Big step forward for our coverage of non-English scholarship.
Author disambiguation foundations (AER v4)
We didn’t fully finish what we set out to do here — author disambiguation is hard — but we laid serious groundwork:
- A new deterministic Python name parser (89–93% accuracy on a 15K gold standard) — interesting story here: we used AI to build a large gold standard, then iteratively prompted Claude to write a Python parser until it scored well against it. End result is fully deterministic Python that would have taken months to write by hand.
- 106.7M author-level embeddings and 718M per-authorship similarity scores.
- ~3.2M overmerged author profiles split using raw ORCID conflicts.
- Fixed an author-sequence bug that was corrupting author ordering.
- Tightened the firewalls between ORCID data and other authorship signals — ORCID metadata is right ~95% of the time, but when older versions of our pipeline trusted it as a gold standard, the 5% of errors metastasized.
Affiliation curation
A complete community-driven curation system, initially for Member institutions. Match/unmatch UI, role-based access, status tracking, and corrections that propagate to the API the next night. Curations that used to take months now land in ~24 hours, and we’ve already received one hundred thousand of them.
GUI modernization
- Novice / Expert modes with a one-click toggle — once you switch to Expert, you stay there.
- All 21 entity types now have first-class search, browse, zoom drawers, and CSV export (was 6).
- Sidebar nav, repository operator dashboard, accessibility audit (Vuetify upgrade, WCAG 2.1).
- New landing page positioning OpenAlex as “the universal research database — built for agents, scripts, and spreadsheets.”
A frank note on the GUI from Jason: we’re going to keep it healthy, but our long-term focus is the API and the data. Increasingly, the right interface for OpenAlex is the one your agent builds for you on demand.
Daily snapshots, OECD/FORD mapping, and infra
- Daily incremental snapshots for all 21 entity types (JSONL + Parquet, hash-based change detection).
- Mapped OpenAlex subfields to OECD FORD in collaboration with NORA — covers 38 of 42 two-digit FORD fields and feeds the upcoming Danish Research Portal.
- $135K/year in infrastructure savings by shutting down legacy Heroku apps, migrating Unpaywall to RDS, and consolidating services. Honestly, a lot of this was AI-assisted: “find me where we’re wasting money” turns out to be a great prompt when your codebase is open.
- A public status page at status.openalex.org, GitHub Actions CI/CD for Databricks, and a fresh POSI v2 self-assessment.
Pricing & community
We collapsed our pricing to three clear tiers: Member, Member+, and Partner — with a PDF Sync add-on. The new $5K/year Member tier (admin dashboard, affiliation editor, Unsub access, CAB nomination rights) launched with the University of Victoria, joined by Université de Montréal, KISTI, University of Queensland, and Statistics Denmark.
Q2 roadmap: fewer things, deeper focus
This quarter we’re deliberately doing fewer big things. Two real focuses:
1. Author accuracy — finishing what we started
Roughly a six-week push. The work packages:
- ORCID-driven merges of incorrectly split author profiles (splits are largely done; merges are next).
- Exposing
raw_orcidin the works API — if this has been bothering you, you’ll be happy. - Name-based splits and joins, building on the new parser. This is where we expect to clear out the most egregious errors — the cases where two people with totally different names got clustered together because the algorithm weighted other features over the name itself. Those situations are going to go away.
- Curation UI for authors and institutions to fix their own profiles. We’ve actually built this three times and walked away from it — it touches everything in the system and has to propagate fast. We’re going to land it this quarter. Long-term this should also be drivable from agents (“hey, look me up in OpenAlex and fix the things that aren’t me”), but we need the UI first.
- A “fancy” ML-based split algorithm is out of scope — there’s enough low-hanging fruit that combining the basic improvements with self-curation will get us most of the way there.
2. Data quality — raking the lawn
Driven by the thousands of Zendesk tickets the community has filed. No single one of these is huge, but collectively they’re who we are. Examples we’re working through:
- arXiv bugs (language detection, OAI-PMH locations, missing PDFs)
- Abstract coverage gaps
- Missing
institution.country_codevalues publication_year > 2026(should be null)- Crosswalks for the many idiosyncratic repository
work.typetaxonomies
We’re hoping to report dozens of these fixed by next town hall.
Process & community
- Semi-automated ticket solving with AI in the loop — the guardrails matter, but the productivity wins are real.
- More work on oxjobs, our homegrown issue tracker, plus internal QA dashboards.
- London funder workshop next week, hosted at Wellcome and bringing funders from around the world together. Tied to the Wellcome grant.
A few themes worth calling out
AI is multiplying our throughput. This was the busiest quarter we’ve had, by a lot, and we’re a small team. Walden + AI tooling is the reason. The quarterly retro itself was AI-generated from our open repos. We don’t think AI replaces the work — but it does mean a small open team can punch well above its weight.
Open data is the foundation, not the application. What’s exciting right now isn’t that AI exists — it’s that AI plus open data finally decouples intelligence from data. You can swap in whichever model you want over OpenAlex. We’ve spent twenty years building toward this moment.
We listen to what the community pays for. Talk is cheap. When someone signs up at the Member tier, that’s a real signal about what’s working. Conversely, semantic search has gotten less excitement than we’d hoped given its cost (~$10K/month to serve) — so if you’re using it and want it to do more (custom vectors, higher result limits), please tell us. Every request is a vote.
Get involved
- Browse the Q2 retro and Q2 roadmap directly.
- Watch the town hall recording.
- Have a feature request or bug? File a ticket — support@openalex.org. Even if we can’t get to it this quarter, every report is a vote that helps prioritize the next one.
- Member institutions: try the new affiliation curator.
- Everyone: try developers.openalex.org with your favorite agent. Let it write your queries.