Unraveling the Ai2 Asta Scholarly Research Assistant Citation System

Carlos Lopezosa; Enrique Ordu\~na-Malea

arxiv: 2606.08301 · v1 · pith:FSB5ZEPSnew · submitted 2026-06-06 · 💻 cs.DL

Unraveling the Ai2 Asta Scholarly Research Assistant Citation System

Enrique Ordu\~na-Malea , Carlos Lopezosa This is my paper

Pith reviewed 2026-06-27 18:39 UTC · model grok-4.3

classification 💻 cs.DL

keywords Ai2 Astacitation systemscholarly research assistantreference instabilitycitation opacityreproducibilityAI literature reportsbibliometric analysis

0 comments

The pith

Ai2 Asta's citation system selects references with notable instability across identical queries and shows low concordance with its own retrieved documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests Ai2 Asta by submitting ten domain-specific queries twice and extracting the citations from the resulting literature reports. It finds that reports contain many in-text citations drawn from a concentrated set of venues, yet the exact references shift between runs and often fail to match the documents the system first retrieved. A sympathetic reader would care because these patterns point to hidden selection steps that make the output hard to reproduce or trace for bibliometric work. The authors conclude that the tool generates structured reports but its citation process lacks the transparency needed for reliable scholarly use.

Core claim

Ai2 Asta produces reports with high citation intensity and a diverse yet concentrated set of venues, yet the composition of cited references varies across identical queries and shows little concordance with the documents retrieved during the process, indicating additional opaque selection mechanisms during report generation that undermine reproducibility and transparency in quantitative science studies.

What carries the argument

The citation selection process inside Ai2 Asta's Summarise Literature feature, which decouples final cited references from both query results and retrieval steps.

If this is right

Reports integrate numerous in-text citations grounded in retrieved evidence.
Cited venues form a diverse yet concentrated set.
Reference lists change across identical queries, reducing reproducibility.
Retrieved documents and ultimately cited ones often fail to match, implying extra selection steps.
These traits create challenges for quantitative science studies that rely on stable citation data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Users who treat Asta outputs as stable bibliometric sources may need to run repeated queries and cross-check references manually.
Similar hidden selection layers could appear in other AI research assistants, suggesting a need for comparative tests across tools.
Quantitative studies of AI-generated literature may have to model citation variability as an additional source of noise.
Developers could reduce opacity by exposing the full retrieval-to-citation pipeline or by making selection rules deterministic.

Load-bearing premise

Two independent rounds of data collection on ten domain-specific queries are enough to establish the existence and character of instability and opacity.

What would settle it

Running the same ten queries many more times and finding that the exact set of cited references stays identical while matching the retrieved documents in every case would falsify the instability and opacity claim.

read the original abstract

Despite the growing integration of Deep Research tools into academic workflows, empirical evidence on the operation, stability, and potential biases of their citation systems remains scarce. This study addresses this gap by evaluating the intensity, consistency, and bibliographic characteristics of references cited in the literature reports generated by Ai2 Asta, with the aim of understanding how its citation system operates and assessing its implications for scholarly communication. To this end, ten domain-specific queries were submitted to Asta's Summarise Literature feature, and two independent rounds of data collection were conducted. From each report, in-text citations, cited references, as well as other metrics related to the response process were extracted and examined. The results reveal high citation intensity, with reports integrating numerous in-text citations grounded in retrieved evidence and a diverse yet concentrated set of venues. However, notable instability is observed in the composition of cited references across identical queries, alongside a lack of concordance between retrieved documents and those ultimately cited, suggesting additional opaque selection mechanisms during report generation. These findings indicate that, while Ai2 Asta produces well-structured and quality reports, its instability and opacity in the citation process pose challenges in quantitative science studies due to their lack of reproducibility and transparency. Despite the restricted number of queries and disciplinary scope, the results offer valuable insights for researchers, bibliometricians, developers, and research evaluators seeking to understand, use or regulate AI-based scholarly assistants responsibly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First look at Ai2 Asta citations shows instability on ten queries but the sample is too small for firm conclusions.

read the letter

The paper's main observation is that Ai2 Asta produces literature reports with high citation counts drawn from retrieved sources, yet the exact references change across repeated identical queries and some retrieved items never appear in the final citations. This points to extra selection steps that are not visible to the user.

What is new is the direct test of this specific tool. No earlier work has examined how Asta's Summarise Literature feature actually assembles its reference lists. The authors describe a clear protocol—ten domain-specific queries run in two independent rounds—and extract in-text citations, reference lists, and basic process metrics.

The work is honest about what it found. It notes the volume of citations, the range of venues, and the observed shifts without inflating the results. The mismatch between retrieval and citation is a useful flag for anyone who needs reproducible outputs from these assistants.

The soft spot is the narrow base. Ten queries in one disciplinary area, run only twice, with no overlap statistics, variance measures, or controls for phrasing, leaves the instability claim preliminary. Differences could be normal retrieval noise rather than a systemic feature. The abstract gives no detail on how concordance was scored, so the strength of that part is hard to judge.

This is for bibliometricians and researchers who already use or evaluate AI research tools and want early evidence of reproducibility limits. It is not yet strong enough for broad claims about all such systems.

I would send it for peer review. The direct observations are worth a closer look, but the authors need to enlarge the query set and add quantitative metrics before the findings can carry much weight.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an observational study of Ai2 Asta's Summarise Literature feature. Ten domain-specific queries were submitted in two independent rounds; in-text citations, cited references, and process metrics were extracted from the generated reports. The authors observe high citation intensity with reports drawing on numerous retrieved sources and a diverse but venue-concentrated reference set. They also report instability in the composition of cited references across the two rounds on identical queries and a lack of concordance between documents retrieved during the process and those ultimately cited, which they interpret as evidence of additional opaque selection mechanisms. The study concludes that these properties undermine reproducibility and transparency for quantitative science studies while still producing well-structured reports.

Significance. If the reported patterns of instability and opacity are confirmed, the work supplies one of the first empirical descriptions of citation behavior inside a deployed AI scholarly assistant. This is directly relevant to bibliometricians, tool developers, and research evaluators concerned with reproducibility and transparency in AI-mediated literature synthesis. The purely observational design avoids circularity or fitted parameters and records tool outputs directly.

major comments (2)

[Abstract, §3] Abstract and §3 (data collection): the central claims of 'notable instability' in cited-reference composition and 'lack of concordance' between retrieved and cited documents rest on a sample of only ten queries run in two rounds. No set-overlap statistics, variance estimates, or controls for query phrasing are reported, so it is unclear whether the observed differences exceed normal retrieval variation. This sample-size limitation is load-bearing for the general claims about opacity and reproducibility challenges.
[§4] §4 (results): the manuscript states that concordance was examined but supplies no explicit measurement protocol (e.g., exact matching criteria, handling of DOIs vs. titles, or threshold for 'lack of concordance'). Without this, the strength of the opacity inference cannot be assessed.

minor comments (2)

[Abstract, §3] The abstract mentions 'other metrics related to the response process' but does not list them; a short table or bullet list in §3 would improve clarity.
[§3] The disciplinary scope is described as 'domain-specific' but the ten queries are not enumerated; listing them (even in an appendix) would allow readers to judge representativeness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thoughtful comments, which help improve the clarity and rigor of our manuscript. We respond to each major comment below and indicate the revisions we will undertake.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (data collection): the central claims of 'notable instability' in cited-reference composition and 'lack of concordance' between retrieved and cited documents rest on a sample of only ten queries run in two rounds. No set-overlap statistics, variance estimates, or controls for query phrasing are reported, so it is unclear whether the observed differences exceed normal retrieval variation. This sample-size limitation is load-bearing for the general claims about opacity and reproducibility challenges.

Authors: The manuscript already notes the restricted number of queries and disciplinary scope as a limitation. The instability and lack of concordance were observed consistently across all ten queries under identical inputs, providing an initial empirical description rather than a statistically powered test. To strengthen the presentation, we will add set-overlap statistics (Jaccard index between cited-reference sets across the two rounds) in the revised §3 and §4. With only two rounds per query, formal variance estimates are not feasible, but the overlap metrics will quantify the observed differences. The study design deliberately held query phrasing constant to isolate stability under repeated identical queries; introducing phrasing controls would address a different research question. These additions will allow readers to evaluate the reproducibility implications directly from the data. revision: partial
Referee: [§4] §4 (results): the manuscript states that concordance was examined but supplies no explicit measurement protocol (e.g., exact matching criteria, handling of DOIs vs. titles, or threshold for 'lack of concordance'). Without this, the strength of the opacity inference cannot be assessed.

Authors: We agree that an explicit measurement protocol is necessary for readers to assess the concordance findings. In the revised manuscript we will insert a dedicated paragraph in §4 that specifies the protocol used: documents were identified by DOI when present; otherwise titles were normalized (lowercase, punctuation and stopword removal) and matched via exact string equality or high string similarity; lack of concordance was recorded when a cited reference did not appear among the documents retrieved for that report. This description will make the basis for the opacity inference transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study

full rationale

The paper performs direct data collection by submitting ten queries twice to Ai2 Asta's Summarise Literature feature, then extracts and tabulates in-text citations, references, and process metrics from the generated reports. No equations, parameters, derivations, or first-principles claims appear; the central observations of instability and lack of concordance are reported as raw empirical outcomes rather than reductions of any fitted model or self-citation chain. The limited sample size is a methodological limitation but does not create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions about the adequacy of the query sample and the sufficiency of two collection rounds; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The ten domain-specific queries are representative enough to support general statements about the tool's citation behavior.
Abstract notes the restricted disciplinary scope yet draws broader conclusions about challenges for quantitative science studies.
domain assumption Two independent rounds of data collection are adequate to detect and characterize instability in reference composition.
Instability claim is based solely on differences observed between these two rounds.

pith-pipeline@v0.9.1-grok · 5779 in / 1307 out tokens · 21872 ms · 2026-06-27T18:39:49.603768+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Binz, M., Alaniz, S., Roskies, A., Aczel, B., Bergstrom, C. T., Allen, C ., Schad, D., Wulff, D., West, Jevin D., Zhang, Q., Shiffrin, Richard M., Gershman, Samuel J., Popov, V., Bender, Emily M., Marelli, M., Botvinick, Matthew, M., Akata, Z., & Schulz, E. (2025). How should the advancement of large language models affect the practice of science?. Procee...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1073/pnas.2401227121 2025
[2]

https://doi.org/10.1016/j.nlp.2023.100020 20 Kinney, R.M., Anastasiades, C., Authur, R., Beltagy, I., Bragg, J., Buraczynski, A., Cachola, I., Candra, S., Chandrasekhar, Y., Cohan, A., Crawford, M., Downey, D., Dunkelberger, J., E tzioni, O., Evans, R., Feldman, S., Gorney, J., Graham, D.W., Hu, F., Huff, R., King, D., Kohlmeier, S., Kuehl, B., Langan, M....

work page doi:10.1016/j.nlp.2023.100020 2023
[3]

https://doi.org/10.1016/j.caeai.2021.100041 Orduña-Malea, E., & Cabezas -Clavijo, Á. (2023). ChatGPT and the potential growing of ghost bibliographic references. Scientometrics, 128(9): 5351-5355. Rane, N.L ., Tawde, A., Choudhary, S.P., & Rane, J. (2023). Contribution and performance of ChatGPT and other Large Language Models (LLM) for scientific and res...

work page doi:10.1016/j.caeai.2021.100041 2021
[4]

Walters and Esther Isabelle Wilder

Aaron Tay’s Musings About Librarianship [blog]. https://aarontay.substack.com/p/the-rise-of-agent-based-deep-research Walters, W.H., & Wilder, E.I. (202 3). Fabrication and errors in the bibliographic citations generated by ChatGPT. Scientific Reports, 13(1). https://doi.org/10.1038/s41598-023-41032-5 Xian, J., Teofili, T., Pradeep, R., & Lin, J. (2024). ...

work page doi:10.1038/s41598-023-41032-5 2024

[1] [1]

Binz, M., Alaniz, S., Roskies, A., Aczel, B., Bergstrom, C. T., Allen, C ., Schad, D., Wulff, D., West, Jevin D., Zhang, Q., Shiffrin, Richard M., Gershman, Samuel J., Popov, V., Bender, Emily M., Marelli, M., Botvinick, Matthew, M., Akata, Z., & Schulz, E. (2025). How should the advancement of large language models affect the practice of science?. Procee...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1073/pnas.2401227121 2025

[2] [2]

https://doi.org/10.1016/j.nlp.2023.100020 20 Kinney, R.M., Anastasiades, C., Authur, R., Beltagy, I., Bragg, J., Buraczynski, A., Cachola, I., Candra, S., Chandrasekhar, Y., Cohan, A., Crawford, M., Downey, D., Dunkelberger, J., E tzioni, O., Evans, R., Feldman, S., Gorney, J., Graham, D.W., Hu, F., Huff, R., King, D., Kohlmeier, S., Kuehl, B., Langan, M....

work page doi:10.1016/j.nlp.2023.100020 2023

[3] [3]

https://doi.org/10.1016/j.caeai.2021.100041 Orduña-Malea, E., & Cabezas -Clavijo, Á. (2023). ChatGPT and the potential growing of ghost bibliographic references. Scientometrics, 128(9): 5351-5355. Rane, N.L ., Tawde, A., Choudhary, S.P., & Rane, J. (2023). Contribution and performance of ChatGPT and other Large Language Models (LLM) for scientific and res...

work page doi:10.1016/j.caeai.2021.100041 2021

[4] [4]

Walters and Esther Isabelle Wilder

Aaron Tay’s Musings About Librarianship [blog]. https://aarontay.substack.com/p/the-rise-of-agent-based-deep-research Walters, W.H., & Wilder, E.I. (202 3). Fabrication and errors in the bibliographic citations generated by ChatGPT. Scientific Reports, 13(1). https://doi.org/10.1038/s41598-023-41032-5 Xian, J., Teofili, T., Pradeep, R., & Lin, J. (2024). ...

work page doi:10.1038/s41598-023-41032-5 2024