Recognition: 2 theorem links
· Lean TheoremOpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts
Pith reviewed 2026-05-16 06:37 UTC · model grok-4.3
The pith
OpenAlex supplies a free, fully open scientific knowledge graph with metadata on 209 million works to replace the discontinued Microsoft Academic Graph.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OpenAlex is a new, fully-open scientific knowledge graph launched to replace the discontinued Microsoft Academic Graph. It contains metadata for 209M works, 2013M disambiguated authors, 124k venues, 109k institutions, and 65k Wikidata concepts linked to works via an automated hierarchical multi-tag classifier. The dataset is available via a web-based GUI, a full data dump, and a high-volume REST API, with ongoing work to improve citation information and entity disambiguation.
What carries the argument
The OpenAlex knowledge graph, which connects works to disambiguated authors and institutions, venues, and Wikidata concepts through an automated hierarchical multi-tag classifier.
If this is right
- Any researcher can download or query the full citation network and author records without licenses or fees.
- New tools for science mapping and impact measurement can be built directly on the public data.
- Institutions gain the ability to track their publication output using open rather than proprietary sources.
- Analyses of research trends across disciplines become feasible at the scale previously limited to closed datasets.
Where Pith is reading between the lines
- Sustained community contributions could expand the graph beyond its current automated tagging to include more fine-grained topic links.
- If the API remains stable and high-volume, it could support real-time dashboards that monitor emerging research fields.
- The explicit link to Wikidata concepts opens the possibility of cross-walking OpenAlex records with other open knowledge bases for richer semantic queries.
Load-bearing premise
The automated classifier and disambiguation routines produce data accurate and complete enough to serve as a practical replacement for the discontinued Microsoft Academic Graph.
What would settle it
A side-by-side audit that finds OpenAlex omits a large share of known works or shows substantially higher error rates in author and institution matching than the prior graph would show the replacement claim does not hold.
read the original abstract
OpenAlex is a new, fully-open scientific knowledge graph (SKG), launched to replace the discontinued Microsoft Academic Graph (MAG). It contains metadata for 209M works (journal articles, books, etc); 2013M disambiguated authors; 124k venues (places that host works, such as journals and online repositories); 109k institutions; and 65k Wikidata concepts (linked to works via an automated hierarchical multi-tag classifier). The dataset is fully and freely available via a web-based GUI, a full data dump, and high-volume REST API. The resource is under active development and future work will improve accuracy and coverage of citation information and author/institution parsing and deduplication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents OpenAlex as a new, fully-open scientific knowledge graph (SKG) launched to replace the discontinued Microsoft Academic Graph (MAG). It reports the dataset contents including metadata for 209M works, 2013M disambiguated authors, 124k venues, 109k institutions, and 65k Wikidata concepts linked via an automated hierarchical multi-tag classifier, with access provided through a web-based GUI, full data dump, and high-volume REST API. The resource is described as under active development, with future work planned to improve accuracy and coverage of citation information and author/institution parsing and deduplication.
Significance. If the underlying data processing achieves usable quality levels, OpenAlex would constitute a valuable large-scale open alternative to proprietary or discontinued scholarly indexes, supporting research in scientometrics, digital libraries, and related areas. The explicit provision of multiple access channels and the transparent note on ongoing development are strengths that increase the resource's practical utility and long-term potential impact.
minor comments (2)
- [Abstract] Abstract: the notation '2013M' for authors is potentially ambiguous (could be read as 2013 million or 2.013 billion); clarify with standard billion notation or exact figure for precision.
- The description of the automated hierarchical multi-tag classifier and disambiguation steps would benefit from a brief high-level overview of the approach or data sources used, even if high-level, to aid reader understanding of how the reported counts were obtained.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the manuscript, the recognition of OpenAlex's potential value as a fully open scholarly knowledge graph, and the recommendation for minor revision. We note that the report contains no specific major comments requiring point-by-point rebuttal.
Circularity Check
No significant circularity
full rationale
The paper is a descriptive announcement of a constructed open dataset (OpenAlex) with stated counts and access methods. It contains no mathematical derivations, equations, fitted parameters, predictions, or uniqueness theorems. All core claims rest on external data sources (e.g., Wikidata concepts, prior MAG data) and processing pipelines whose accuracy is explicitly noted as future work rather than asserted by self-reference. No load-bearing step reduces to a self-citation chain or input-by-construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An automated hierarchical multi-tag classifier can reliably link works to 65k Wikidata concepts.
Lean theorems connected to this paper
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The dataset is fully and freely available via a web-based GUI, a full data dump, and high-volume REST API.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Beyond coauthorship: semantic structure and phantom collaborators in transportation research, 1967--2025
Phantom collaborators—topically similar authors distant in the coauthor graph—become actual coauthors 16-33 times more often than baselines, with a 68-fold similarity gradient.
-
A Large-Scale, Cross-Disciplinary Corpus of Systematic Reviews
A new corpus of 301,871 systematic reviews across all sciences is released with extracted method artifacts to support retrieval benchmarking and meta-research.
-
Market Dynamics, Governance and Open Research Metadata in the AI Era
The innovation annulus is a functional, persistent feature of scholarly metadata production whose width reflects production inefficiency, reshaped by AI and best managed through calibrated governance analogous to opti...
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
Camyla: Scaling Autonomous Research in Medical Image Segmentation
Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.
-
Scalable Agentic Reasoning for Designing Biologics Targeting Intrinsically Disordered Proteins
StructBioReasoner is a scalable multi-agent system that designs IDP-targeting biologics, with over 50% of 787 candidates for Der f 21 showing better binding free energy than human-designed references.
-
Faculty mobility reallocates research capacity within persistent institutional hierarchies
Faculty mobility follows a persistent institutional prestige hierarchy but yields little evidence of lasting improvements in movers' research productivity or citation impact.
-
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-d...
-
CiteRadar: A Citation Intelligence Platform for Researcher Profiling and Geographic Visualization
CiteRadar is a new open-source pipeline that enriches Google Scholar citations using five external data sources and produces ranked tables plus an offline interactive geographic map from a single command.
-
AI-assisted writing and the reorganization of scientific knowledge
Post-2023, AI-assisted writing intensity positively associates with scientific disruption but shows weakened links to cross-field citation breadth and attenuated negative links to citation concentration.
-
AgentSPEX: An Agent SPecification and EXecution Language
AgentSPEX is a new language and harness for explicitly specifying and running structured LLM-agent workflows with typed steps, control flow, parallel execution, and a visual editor.
-
Towards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics
An LLM agent autonomously runs read-plan-compute-compare loops on 111 computational physics papers, raising substantive concerns in 42% of them (97.7% only after execution), and generates a full publishable Comment re...
-
Structural Diversity Drives Disruptive Scientific Innovation
Structural diversity in a team's prior collaboration network predicts disruptive scientific innovation more strongly than team freshness or edge density and turns large team size from a liability into an advantage via...
-
pAI/MSc: ML Theory Research with Humans on the Loop
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript dra...
-
Scientific tools and Innovation: Big Science Facilities Yield More Novel and Interdisciplinary Knowledge
Big Science Facilities produce publications with greater recombinant novelty and interdisciplinary integration than matched controls, with stronger effects in fields outside their traditional physical-sciences focus.
-
Polarization and Integration in Global AI Research
Over three decades, global AI research has polarized into US and China poles, with UK/Germany aligning with US, some Europeans with both, and developing countries with China.
-
Can We Still Hear the Accent? Investigating the Resilience of Native Language Signals in the LLM Era
NLI accuracy on research papers declined steadily over time, with Chinese and French showing unexpected resistance while Japanese and Korean declined more sharply in the post-LLM era.
-
Mapping the Landscape of Open Access Dashboards -- A Dataset for Research and Infrastructure Development
A survey identifies nearly 60 open access dashboards and supplies a structured metadata dataset plus community contribution process for open science research.
-
Construction of a Battery Research Knowledge Graph using a Global Open Catalog
A pipeline builds a battery research knowledge graph from 189k OpenAlex papers using author vectors weighted by OpenAlex concepts, KeyBERT/ChatGPT keyphrases, authorship position, and recency, then serializes it as RD...
-
Auditing automated research assessment: an interpretable machine learning approach to validate funding criteria
ML models show Brazilian PQ grant levels are predicted well by a small set of bibliographic and supervision features but not by the full set of official criteria.
Reference graph
Works this paper leans on
-
[1]
The OpenAlex project was created toaddress this concern
OpenAlex: A fully-open index of scholarly works, authors, venues,institutions, and concepts Jason Priem*, Heather Piwowar*, Richard Orr* *jason@ourresearch.org; heather@ourresearch.org; richard@ourresearch.orgOurResearch, 500 Westover Dr #8234, Sanford, NC, 27330 (USA) Introduction In May 2021, Microsoft announced that it was discontinuing support for Mic...
work page 2021
-
[2]
Although still in its nascency, as a fully-open (100% open data, open API, open-source code)source of scholarly metadata, OpenAlex has potential to improve the transparency of researchevaluation, navigation, representation, and discovery, adding to the growing list of other openand partly-open SKGs such as OpenCitations (Peroni, Shotton, & Vitali, 2017), ...
work page 2017
-
[3]
which provide guidance for sustainablyopen development. STI Conference 2022 · Granada Limitations and future workThe OpenAlex project is still quite young, and there are many areas for improvement.Foremost is continued improvement in the parsing, normalisation, and disambiguation ofentities, especially authors and institutions. This is particularly import...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.