Unraveling the Ai2 Asta Scholarly Research Assistant Citation System
Pith reviewed 2026-06-27 18:39 UTC · model grok-4.3
The pith
Ai2 Asta's citation system selects references with notable instability across identical queries and shows low concordance with its own retrieved documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ai2 Asta produces reports with high citation intensity and a diverse yet concentrated set of venues, yet the composition of cited references varies across identical queries and shows little concordance with the documents retrieved during the process, indicating additional opaque selection mechanisms during report generation that undermine reproducibility and transparency in quantitative science studies.
What carries the argument
The citation selection process inside Ai2 Asta's Summarise Literature feature, which decouples final cited references from both query results and retrieval steps.
If this is right
- Reports integrate numerous in-text citations grounded in retrieved evidence.
- Cited venues form a diverse yet concentrated set.
- Reference lists change across identical queries, reducing reproducibility.
- Retrieved documents and ultimately cited ones often fail to match, implying extra selection steps.
- These traits create challenges for quantitative science studies that rely on stable citation data.
Where Pith is reading between the lines
- Users who treat Asta outputs as stable bibliometric sources may need to run repeated queries and cross-check references manually.
- Similar hidden selection layers could appear in other AI research assistants, suggesting a need for comparative tests across tools.
- Quantitative studies of AI-generated literature may have to model citation variability as an additional source of noise.
- Developers could reduce opacity by exposing the full retrieval-to-citation pipeline or by making selection rules deterministic.
Load-bearing premise
Two independent rounds of data collection on ten domain-specific queries are enough to establish the existence and character of instability and opacity.
What would settle it
Running the same ten queries many more times and finding that the exact set of cited references stays identical while matching the retrieved documents in every case would falsify the instability and opacity claim.
read the original abstract
Despite the growing integration of Deep Research tools into academic workflows, empirical evidence on the operation, stability, and potential biases of their citation systems remains scarce. This study addresses this gap by evaluating the intensity, consistency, and bibliographic characteristics of references cited in the literature reports generated by Ai2 Asta, with the aim of understanding how its citation system operates and assessing its implications for scholarly communication. To this end, ten domain-specific queries were submitted to Asta's Summarise Literature feature, and two independent rounds of data collection were conducted. From each report, in-text citations, cited references, as well as other metrics related to the response process were extracted and examined. The results reveal high citation intensity, with reports integrating numerous in-text citations grounded in retrieved evidence and a diverse yet concentrated set of venues. However, notable instability is observed in the composition of cited references across identical queries, alongside a lack of concordance between retrieved documents and those ultimately cited, suggesting additional opaque selection mechanisms during report generation. These findings indicate that, while Ai2 Asta produces well-structured and quality reports, its instability and opacity in the citation process pose challenges in quantitative science studies due to their lack of reproducibility and transparency. Despite the restricted number of queries and disciplinary scope, the results offer valuable insights for researchers, bibliometricians, developers, and research evaluators seeking to understand, use or regulate AI-based scholarly assistants responsibly.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an observational study of Ai2 Asta's Summarise Literature feature. Ten domain-specific queries were submitted in two independent rounds; in-text citations, cited references, and process metrics were extracted from the generated reports. The authors observe high citation intensity with reports drawing on numerous retrieved sources and a diverse but venue-concentrated reference set. They also report instability in the composition of cited references across the two rounds on identical queries and a lack of concordance between documents retrieved during the process and those ultimately cited, which they interpret as evidence of additional opaque selection mechanisms. The study concludes that these properties undermine reproducibility and transparency for quantitative science studies while still producing well-structured reports.
Significance. If the reported patterns of instability and opacity are confirmed, the work supplies one of the first empirical descriptions of citation behavior inside a deployed AI scholarly assistant. This is directly relevant to bibliometricians, tool developers, and research evaluators concerned with reproducibility and transparency in AI-mediated literature synthesis. The purely observational design avoids circularity or fitted parameters and records tool outputs directly.
major comments (2)
- [Abstract, §3] Abstract and §3 (data collection): the central claims of 'notable instability' in cited-reference composition and 'lack of concordance' between retrieved and cited documents rest on a sample of only ten queries run in two rounds. No set-overlap statistics, variance estimates, or controls for query phrasing are reported, so it is unclear whether the observed differences exceed normal retrieval variation. This sample-size limitation is load-bearing for the general claims about opacity and reproducibility challenges.
- [§4] §4 (results): the manuscript states that concordance was examined but supplies no explicit measurement protocol (e.g., exact matching criteria, handling of DOIs vs. titles, or threshold for 'lack of concordance'). Without this, the strength of the opacity inference cannot be assessed.
minor comments (2)
- [Abstract, §3] The abstract mentions 'other metrics related to the response process' but does not list them; a short table or bullet list in §3 would improve clarity.
- [§3] The disciplinary scope is described as 'domain-specific' but the ten queries are not enumerated; listing them (even in an appendix) would allow readers to judge representativeness.
Simulated Author's Rebuttal
We are grateful to the referee for the thoughtful comments, which help improve the clarity and rigor of our manuscript. We respond to each major comment below and indicate the revisions we will undertake.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (data collection): the central claims of 'notable instability' in cited-reference composition and 'lack of concordance' between retrieved and cited documents rest on a sample of only ten queries run in two rounds. No set-overlap statistics, variance estimates, or controls for query phrasing are reported, so it is unclear whether the observed differences exceed normal retrieval variation. This sample-size limitation is load-bearing for the general claims about opacity and reproducibility challenges.
Authors: The manuscript already notes the restricted number of queries and disciplinary scope as a limitation. The instability and lack of concordance were observed consistently across all ten queries under identical inputs, providing an initial empirical description rather than a statistically powered test. To strengthen the presentation, we will add set-overlap statistics (Jaccard index between cited-reference sets across the two rounds) in the revised §3 and §4. With only two rounds per query, formal variance estimates are not feasible, but the overlap metrics will quantify the observed differences. The study design deliberately held query phrasing constant to isolate stability under repeated identical queries; introducing phrasing controls would address a different research question. These additions will allow readers to evaluate the reproducibility implications directly from the data. revision: partial
-
Referee: [§4] §4 (results): the manuscript states that concordance was examined but supplies no explicit measurement protocol (e.g., exact matching criteria, handling of DOIs vs. titles, or threshold for 'lack of concordance'). Without this, the strength of the opacity inference cannot be assessed.
Authors: We agree that an explicit measurement protocol is necessary for readers to assess the concordance findings. In the revised manuscript we will insert a dedicated paragraph in §4 that specifies the protocol used: documents were identified by DOI when present; otherwise titles were normalized (lowercase, punctuation and stopword removal) and matched via exact string equality or high string similarity; lack of concordance was recorded when a cited reference did not appear among the documents retrieved for that report. This description will make the basis for the opacity inference transparent and reproducible. revision: yes
Circularity Check
No circularity: purely observational empirical study
full rationale
The paper performs direct data collection by submitting ten queries twice to Ai2 Asta's Summarise Literature feature, then extracts and tabulates in-text citations, references, and process metrics from the generated reports. No equations, parameters, derivations, or first-principles claims appear; the central observations of instability and lack of concordance are reported as raw empirical outcomes rather than reductions of any fitted model or self-citation chain. The limited sample size is a methodological limitation but does not create circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The ten domain-specific queries are representative enough to support general statements about the tool's citation behavior.
- domain assumption Two independent rounds of data collection are adequate to detect and characterize instability in reference composition.
Reference graph
Works this paper leans on
-
[1]
Binz, M., Alaniz, S., Roskies, A., Aczel, B., Bergstrom, C. T., Allen, C ., Schad, D., Wulff, D., West, Jevin D., Zhang, Q., Shiffrin, Richard M., Gershman, Samuel J., Popov, V., Bender, Emily M., Marelli, M., Botvinick, Matthew, M., Akata, Z., & Schulz, E. (2025). How should the advancement of large language models affect the practice of science?. Procee...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1073/pnas.2401227121 2025
-
[2]
https://doi.org/10.1016/j.nlp.2023.100020 20 Kinney, R.M., Anastasiades, C., Authur, R., Beltagy, I., Bragg, J., Buraczynski, A., Cachola, I., Candra, S., Chandrasekhar, Y., Cohan, A., Crawford, M., Downey, D., Dunkelberger, J., E tzioni, O., Evans, R., Feldman, S., Gorney, J., Graham, D.W., Hu, F., Huff, R., King, D., Kohlmeier, S., Kuehl, B., Langan, M....
-
[3]
https://doi.org/10.1016/j.caeai.2021.100041 Orduña-Malea, E., & Cabezas -Clavijo, Á. (2023). ChatGPT and the potential growing of ghost bibliographic references. Scientometrics, 128(9): 5351-5355. Rane, N.L ., Tawde, A., Choudhary, S.P., & Rane, J. (2023). Contribution and performance of ChatGPT and other Large Language Models (LLM) for scientific and res...
-
[4]
Walters and Esther Isabelle Wilder
Aaron Tay’s Musings About Librarianship [blog]. https://aarontay.substack.com/p/the-rise-of-agent-based-deep-research Walters, W.H., & Wilder, E.I. (202 3). Fabrication and errors in the bibliographic citations generated by ChatGPT. Scientific Reports, 13(1). https://doi.org/10.1038/s41598-023-41032-5 Xian, J., Teofili, T., Pradeep, R., & Lin, J. (2024). ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.