pith. sign in

arxiv: 2605.18752 · v1 · pith:WRFJYUNBnew · submitted 2026-05-18 · 💻 cs.IR · astro-ph.IM· cs.DL

Traditional statistical representations outperform generative AI in identifying expert peer reviewers

Pith reviewed 2026-05-20 07:41 UTC · model grok-4.3

classification 💻 cs.IR astro-ph.IMcs.DL
keywords peer reviewexpert identificationTF-IDFlarge language modelsinformation retrievalastronomyreviewer matching
0
0 comments X

The pith

Traditional statistical methods like TF-IDF identify expert peer reviewers more reliably than GPT-4o in astronomy proposals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates methods for automatically identifying expert peer reviewers amid rising scientific submissions. It draws on the distributed peer review system of a major international astronomical observatory, treating proposal authorship as a proxy for domain expertise. The task is framed as an information retrieval problem, and six retrieval approaches are tested against one another. Term Frequency-Inverse Document Frequency places a true expert in the top 25 recommendations 79.5 percent of the time, while GPT-4o mini reaches only 51.5 percent. The performance gap arises because fine-grained subfield vocabulary is preserved in statistical vectors but smoothed away in generative models.

Core claim

Using proposal authorship as proxy ground truth for domain expertise, Term Frequency-Inverse Document Frequency successfully identifies a labeled expert within the top 25 recommendations 79.5 percent of the time. This rate exceeds the 51.5 percent achieved by GPT-4o mini. The advantage is traced to the preservation of fine-grained vocabulary needed to distinguish subfield expertise, which semantic smoothing in generative methods tends to obscure.

What carries the argument

Treating expert identification as an information retrieval problem and comparing sparse statistical vector representations such as Term Frequency-Inverse Document Frequency against generative language model outputs.

If this is right

  • Statistical vector methods can be deployed for scalable reviewer matching without requiring large language model inference.
  • Generative models may systematically underperform when expertise distinctions depend on precise technical terminology rather than broad topical similarity.
  • Transparent statistical representations remain preferable for reproducible, auditable reviewer selection in specialized scientific domains.
  • Evaluation frameworks built on real proposal authorship data supply concrete benchmarks for testing future automated matching systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same vocabulary-preservation advantage might appear in reviewer matching for other technical fields such as high-energy physics or computational biology.
  • Organizations could reduce computational costs by defaulting to statistical baselines before invoking generative models for reviewer tasks.
  • Hybrid pipelines that first rank with TF-IDF and then lightly rerank a shortlist with a language model could be tested as a practical next step.
  • The authorship proxy could be cross-checked against citation networks to measure how much the reported performance gap depends on that particular ground truth choice.

Load-bearing premise

Proposal authorship serves as a valid proxy ground truth for domain expertise in the distributed peer review system of the major international astronomical observatory.

What would settle it

A new evaluation on the same or similar observatory data that uses an independent ground truth such as citation overlap or self-reported expertise and finds the generative model ranking higher than TF-IDF would falsify the reported superiority.

Figures

Figures reproduced from arXiv: 2605.18752 by Ferdinando Patat, Jakub Klencki, John Carpenter, Louis-Gregory Strolger, Mario Mali\v{c}ki, Tereza Jerabkova, Vicente Amado Olivo, Wolfgang Kerzendorf.

Figure 1
Figure 1. Figure 1: Similarity score matrices across methods for Period P110. Rows and columns are [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of the rank assigned to the proposal-designated reviewer across 435 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

The exponential growth of scientific submissions has strained the peer review system. Despite the rapidly expanding global pool of researchers, this unprecedented scale has rendered the previous approach of manual expert identification unfeasible. Therefore, institutions have naturally turned to Large Language Models (LLMs) to automate intricate processes like expert reviewer identification. However, the reliability of these new models in accurately identifying domain experts lacks rigorous evaluation. We conduct a comprehensive empirical evaluation of statistical and AI-driven expertise identification methodologies to benchmark their reliability and limitations. Framing expert identification as an information retrieval problem, we utilize the distributed peer review system of a major international astronomical observatory, where proposal authorship serves as our proxy ground truth for domain expertise. Evaluating six retrieval methodologies utilized across observatories and computer science conferences, we demonstrate that traditional statistical representations outperform generative AI. Specifically, Term Frequency-Inverse Document Frequency successfully identified a labeled expert within the top 25 recommendations 79.5% of the time, compared to 51.5% for GPT-4o mini. Our results highlight that distinguishing subfield expertise requires fine-grained vocabulary, which is obscured by the semantic smoothing in generative methods. By establishing a rigorous evaluation framework for automated peer review, we demonstrate that transparent and reproducible statistical representations still outperform computationally expensive LLMs in specialized scientific tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates six retrieval methodologies for identifying expert peer reviewers in a major international astronomical observatory's distributed peer review system. Framing the task as information retrieval, it uses proposal authorship as proxy ground truth for domain expertise and reports that TF-IDF identifies a labeled expert in the top 25 recommendations 79.5% of the time, compared to 51.5% for GPT-4o mini. The authors conclude that traditional statistical representations outperform generative AI due to better preservation of fine-grained subfield vocabulary.

Significance. If the central empirical comparison holds under a validated proxy, the result offers a practical benchmark showing that interpretable statistical methods can outperform LLMs on specialized scientific retrieval tasks. The work highlights a potential limitation of semantic smoothing in generative models for domain-specific expertise identification and provides a reproducible evaluation framework.

major comments (2)
  1. [Evaluation Framework / Abstract] The evaluation treats proposal authorship as proxy ground truth for domain expertise without reported independent validation, correlation analysis with reviewing suitability, or controls for confounds such as junior co-authors or proposal-specific focus. This assumption is load-bearing for interpreting the 79.5% vs 51.5% gap, as both TF-IDF and GPT-4o mini are scored against the same potentially noisy label (see abstract and evaluation setup).
  2. [Methods and Results] The abstract supplies concrete performance numbers but the manuscript provides no details on the full set of six methodologies, the exact retrieval algorithms, statistical significance tests, error bars, or how the proxy labels were constructed and split. These omissions prevent assessment of whether the reported superiority is robust (see abstract and results).
minor comments (2)
  1. [Evaluation Metrics] Clarify the exact definition of 'top 25 recommendations' and whether it is recall@25 or a ranked retrieval metric with ties broken consistently.
  2. [Related Work] Add a reference or brief description of the standard retrieval algorithms used across observatories and CS conferences for context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below with clarifications and proposed revisions where appropriate. Our goal is to ensure the evaluation framework and methodological details are presented with sufficient rigor and transparency.

read point-by-point responses
  1. Referee: [Evaluation Framework / Abstract] The evaluation treats proposal authorship as proxy ground truth for domain expertise without reported independent validation, correlation analysis with reviewing suitability, or controls for confounds such as junior co-authors or proposal-specific focus. This assumption is load-bearing for interpreting the 79.5% vs 51.5% gap, as both TF-IDF and GPT-4o mini are scored against the same potentially noisy label (see abstract and evaluation setup).

    Authors: We acknowledge that proposal authorship is a proxy rather than a direct measure of reviewing suitability. In the astronomy observatory context, however, successful proposal authorship requires demonstrated expertise in the specific subfield to obtain observing time, providing a direct and objective signal of domain knowledge that aligns with reviewer expertise needs. This proxy is standard in information retrieval studies of scientific expertise. Both methods are evaluated against identical labels, preserving the validity of the relative comparison (79.5% vs 51.5%). We will add a new subsection in the evaluation framework discussing the proxy's rationale, potential confounds such as junior co-authors or proposal focus, and limitations, including the absence of external validation data. revision: partial

  2. Referee: [Methods and Results] The abstract supplies concrete performance numbers but the manuscript provides no details on the full set of six methodologies, the exact retrieval algorithms, statistical significance tests, error bars, or how the proxy labels were constructed and split. These omissions prevent assessment of whether the reported superiority is robust (see abstract and results).

    Authors: The full manuscript describes the six methodologies (TF-IDF, BM25, and four others drawn from observatory and conference practices) and their implementation in the Methods section. We agree that additional details on statistical significance testing, error bars or confidence intervals, and the precise construction and train/test splitting of proxy labels would improve robustness assessment and reproducibility. We will expand the Results section to include these elements, such as p-values for performance differences and a detailed description of label construction from proposal authorship data. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark uses external proxy labels and standard retrieval algorithms

full rationale

The paper presents a direct empirical comparison of retrieval methods (TF-IDF, other statistical representations, and GPT-4o mini) on an information-retrieval framing of expert identification. Proposal authorship from the observatory's distributed peer-review system is adopted as an external proxy ground truth; performance is measured by whether a labeled author appears in the top-25 recommendations. No equations, fitted parameters, or self-citations are invoked to derive the headline result (79.5 % vs 51.5 %). The evaluation therefore rests on independent data and reproducible algorithms rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that proposal authorship reliably indicates subfield expertise and on the standard definitions of TF-IDF and LLM-based retrieval.

axioms (1)
  • domain assumption Proposal authorship is a valid proxy for domain expertise
    Used as ground truth for evaluating all retrieval methods in the observatory dataset.

pith-pipeline@v0.9.0 · 5793 in / 1164 out tokens · 47474 ms · 2026-05-20T07:41:34.959573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · 12 internal anchors

  1. [1]

    Jerabkova, Tereza and Patat, Ferdinando and Primas, Francesca and Dorigo, Dario and Sogni, Fabio and Astolfi, Lucas and Bierwirth, Thomas and Prümm, Michael , year =. The

  2. [4]

    Journal of Astrophysics and Astronomy , author =

    Knowledge discovery through text-based similarity searches for astronomy literature , volume =. Journal of Astrophysics and Astronomy , author =. 2019 , keywords =. doi:10.1007/s12036-019-9590-5 , abstract =

  3. [6]

    Publications of the Astronomical Society of the Pacific , author =

    Peer. Publications of the Astronomical Society of the Pacific , author =. 2018 , note =. doi:10.1088/1538-3873/aac463 , abstract =

  4. [7]

    Incentives, Quality, and Risks: A Look Into the NSF Proposal Review Pilot

    Naghizadeh, Parinaz and Liu, Mingyan , month = jul, year =. Incentives,. doi:10.48550/arXiv.1307.6528 , abstract =

  5. [8]

    Scikit-learn: Machine Learning in Python

    Pedregosa, Fabian and Varoquaux, Gaël and Gramfort, Alexandre and Michel, Vincent and Thirion, Bertrand and Grisel, Olivier and Blondel, Mathieu and Müller, Andreas and Nothman, Joel and Louppe, Gilles and Prettenhofer, Peter and Weiss, Ron and Dubourg, Vincent and Vanderplas, Jake and Passos, Alexandre and Cournapeau, David and Brucher, Matthieu and Perr...

  6. [9]

    Primas, Francesca and Hainaut, Olivier and Bierwirth, Thomas and Patat, Ferdinando and Dorigo, Dario and Hoppe, Elisabeth and Lange, Uwe and Pasquato, Moreno and Sogni, Fabio , year =. The

  7. [10]

    and Boffin, H

    Jerabkova, T. and Boffin, H. M. J. and Patat, F. and Dorigo, D. and Sogni, F. and Primas, F. , month = jul, year =. Scientific. doi:10.48550/arXiv.2407.02992 , abstract =

  8. [13]

    A statistical interpretation of term specificity and its application in retrieval , abstract =

    Jones, Karen Spärck , file =. A statistical interpretation of term specificity and its application in retrieval , abstract =

  9. [15]

    IBM Journal of Research and Development , author =

    The. IBM Journal of Research and Development , author =. 1958 , pages =. doi:10.1147/rd.22.0159 , language =

  10. [17]

    A Framework for Optimizing Paper Matching

    Charlin, Laurent and Zemel, Richard S. and Boutilier, Craig , month = feb, year =. A. doi:10.48550/arXiv.1202.3706 , abstract =

  11. [18]

    Leyton-Brown, Mausam, Y

    Leyton-Brown, Kevin and Mausam and Nandwani, Yatin and Zarkoob, Hedayat and Cameron, Chris and Newman, Neil and Raghu, Dinesh , month = aug, year =. Matching. doi:10.48550/arXiv.2202.12273 , abstract =

  12. [20]

    Zhang, S

    Zhang, Dong and Zhao, Shu and Duan, Zhen and Chen, Jie and Zhang, Yangping and Tang, Jie , month = dec, year =. A multi-label classification method using a hierarchical and transparent representation for paper-reviewer recommendation , url =. doi:10.48550/arXiv.1912.08976 , abstract =

  13. [21]

    Yuan, Weizhe and Liu, Pengfei and Neubig, Graham , month = jan, year =. Can. doi:10.48550/arXiv.2102.00176 , abstract =

  14. [22]

    Stelmakh, N

    Stelmakh, Ivan and Shah, Nihar B. and Singh, Aarti , month = nov, year =. doi:10.48550/arXiv.1806.06237 , abstract =

  15. [24]

    Journal of Astronomical Telescopes, Instruments, and Systems , year=

    End-to-end science operations in the era of Extremely Large Telescopes , author=. Journal of Astronomical Telescopes, Instruments, and Systems , year=

  16. [25]

    and Ravanmehr, Reza and Fatemi, Surena and Altafi, Hamed , month = aug, year =

    Shalchian, Hengameh and Sotoudeh, Mohammad-Hadi and Khosroshahi, Habib G. and Ravanmehr, Reza and Fatemi, Surena and Altafi, Hamed , month = aug, year =. A metaheuristic approach for. Software and. doi:10.1117/12.2628899 , abstract =

  17. [26]

    Stelmakh, J

    Stelmakh, Ivan and Wieting, John and Xi, Sarina and Neubig, Graham and Shah, Nihar B. , month = may, year =. A. doi:10.48550/arXiv.2303.16750 , abstract =

  18. [27]

    and Hainaut, O

    Rejkuba, M. and Hainaut, O. R. and Bierwirth, T. and Pruemm, M. and Weiss, A. , month = jul, year =. Time allocation and long term scheduling of. Observatory. doi:10.1117/12.3019206 , abstract =

  19. [28]

    Nature Astronomy , author =

    How to prepare competitive proposals and job applications , volume =. Nature Astronomy , author =. 2025 , note =. doi:10.1038/s41550-025-02593-9 , abstract =

  20. [29]

    ArXiv , year=

    PaRe: A Paper-Reviewer Matching Approach Using a Common Topic Space , author=. ArXiv , year=

  21. [30]

    2012 , eprint=

    The Author-Topic Model for Authors and Documents , author=. 2012 , eprint=

  22. [32]

    Proceedings of the 8th ACM Conference on Recommender Systems , pages =

    Liu, Xiang and Suel, Torsten and Memon, Nasir , title =. Proceedings of the 8th ACM Conference on Recommender Systems , pages =. 2014 , isbn =. doi:10.1145/2645710.2645749 , abstract =

  23. [33]

    2025 , note =

    OpenReview , title =. 2025 , note =

  24. [34]

    2025 , eprint=

    Practical Author Name Disambiguation under Metadata Constraints: A Contrastive Learning Approach for Astronomy Literature , author=. 2025 , eprint=

  25. [35]

    and Ng, Andrew Y

    Blei, David M. and Ng, Andrew Y. and Jordan, Michael I. , title =. J. Mach. Learn. Res. , month = mar, pages =. 2003 , issue_date =

  26. [36]

    and Kaiser,

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

  27. [37]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

  28. [38]

    Neural Information Processing Systems , year=

    You Are the Best Reviewer of Your Own Papers: An Owner-Assisted Scoring Mechanism , author=. Neural Information Processing Systems , year=

  29. [39]

    Algorithmic Game Theory , year=

    Into the Unknown: Assigning Reviewers to Papers with Uncertain Affinities , author=. Algorithmic Game Theory , year=

  30. [40]

    ArXiv , year=

    What Can Natural Language Processing Do for Peer Review? , author=. ArXiv , year=

  31. [41]

    Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

    Automating the assignment of submitted manuscripts to reviewers , author=. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

  32. [42]

    IEEE Access , year=

    Reviewer Recommendations Using Document Vector Embeddings and a Publisher Database: Implementation and Evaluation , author=. IEEE Access , year=

  33. [43]

    , author=

    Unskilled and unaware of it: how difficulties in recognizing one's own incompetence lead to inflated self-assessments. , author=. Journal of personality and social psychology , year=

  34. [44]

    Autonomous Machine Learning-Based Peer Reviewer Selection System

    Aitymbetov, Nurmukhammed and Zorbas, Dimitrios. Autonomous Machine Learning-Based Peer Reviewer Selection System. Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations. 2025

  35. [45]

    Calçada, L , year =. A

  36. [47]

    2019 , eprint=

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

  37. [48]

    2019 , eprint=

    SciBERT: A Pretrained Language Model for Scientific Text , author=. 2019 , eprint=

  38. [49]

    ArXiv , year=

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. ArXiv , year=

  39. [50]

    European Conference on Information Retrieval , year=

    exHarmony: Authorship and Citations for Benchmarking the Reviewer Assignment Problem , author=. European Conference on Information Retrieval , year=

  40. [51]

    ArXiv , year=

    Top2Vec: Distributed Representations of Topics , author=. ArXiv , year=

  41. [52]

    Biometrics , year=

    Individual Comparisons by Ranking Methods , author=. Biometrics , year=

  42. [53]

    Annals of Statistics , year=

    Bootstrap Methods: Another Look at the Jackknife , author=. Annals of Statistics , year=

  43. [54]

    , author=

    How chronic self-views influence (and potentially mislead) estimates of performance. , author=. Journal of personality and social psychology , year=

  44. [56]

    PLoS ONE , year=

    The Global Burden of Journal Peer Review in the Biomedical Literature: Strong Imbalance in the Collective Enterprise , author=. PLoS ONE , year=

  45. [57]

    Quantitative Science Studies , year=

    The strain on scientific publishing , author=. Quantitative Science Studies , year=

  46. [58]

    Visual Computing for Industry, Biomedicine, and Art , year=

    Artificial intelligence-aided assignment of journal submissions to associate editors—a feasibility study on IEEE transactions on medical imaging , author=. Visual Computing for Industry, Biomedicine, and Art , year=

  47. [59]

    ArXiv , year=

    Counterfactual Evaluation of Peer-Review Assignment Policies , author=. ArXiv , year=

  48. [60]

    Journal of Educational Evaluation for Health Professions , year=

    The role of large language models in the peer-review process: opportunities and challenges for medical journal reviewers and editors , author=. Journal of Educational Evaluation for Health Professions , year=

  49. [61]

    Research Integrity and Peer Review , year=

    AI in peer review: can artificial intelligence be an ally in reducing gender and geographical gaps in peer review? A randomized trial , author=. Research Integrity and Peer Review , year=

  50. [62]

    2024 , eprint=

    GPT-4o System Card , author=. 2024 , eprint=

  51. [63]

    2018 , eprint=

    Scikit-learn: Machine Learning in Python , author=. 2018 , eprint=

  52. [64]

    Data structures for statistical computing in

    McKinney, Wes and others , booktitle=. Data structures for statistical computing in. 2010 , organization=

  53. [65]

    and Millman, K

    Harris, Charles R. and Millman, K. Jarrod and van der Walt, St. Array programming with. Nature , volume=. 2020 , publisher=

  54. [66]

    Computing in Science & Engineering , volume=

    Matplotlib: A 2D graphics environment , author=. Computing in Science & Engineering , volume=. 2007 , publisher=

  55. [68]

    GitHub repository , howpublished =

    OpenAI. GitHub repository , howpublished =. 2020 , publisher =

  56. [69]

    11", year =

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , month = "11", year = "2019", publisher =

  57. [70]

    Scientific Reports , year=

    Expert assignment system based on natural language processing for Marie Sklodowska-Curie actions , author=. Scientific Reports , year=

  58. [71]

    2023 , eprint=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

  59. [72]

    2026 , eprint=

    OpenAI o1 System Card , author=. 2026 , eprint=

  60. [73]

    Aitymbetov and D

    N. Aitymbetov and D. Zorbas. Autonomous machine learning-based peer reviewer selection system. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, S. Schockaert, B. Mather, and M. Dras, editors, Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations, pages 199--207, Abu Dhabi, UAE, Jan. 2025....

  61. [74]

    \'A lvarez-Garc \'i a, D

    E. \'A lvarez-Garc \'i a, D. Garc \'i a-Costa, I. D. Waele, A. Maru s i \'c , and F. Grimaldo. Expert assignment system based on natural language processing for marie sklodowska-curie actions. Scientific Reports, 16, 2026. URL https://api.semanticscholar.org/CorpusID:285068794

  62. [75]

    D. Angelov. Top2vec: Distributed representations of topics. ArXiv, abs/2008.09470, 2020. URL https://api.semanticscholar.org/CorpusID:221246303

  63. [76]

    Anjum, H

    O. Anjum, H. Gong, S. Bhat, W. mei W. Hwu, and J. Xiong. Pare: A paper-reviewer matching approach using a common topic space. ArXiv, abs/1909.11258, 2019. URL https://api.semanticscholar.org/CorpusID:202750013

  64. [77]

    Beltagy, K

    I. Beltagy, K. Lo, and A. Cohan. Scibert: A pretrained language model for scientific text, 2019. URL https://arxiv.org/abs/1903.10676

  65. [78]

    D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3 0 (null): 0 993–1022, Mar. 2003. ISSN 1532-4435

  66. [79]

    J. M. Carpenter, A. Corvillón, and N. B. Shah. Enhancing Peer Review in Astronomy : A Machine Learning and Optimization Approach to Reviewer Assignments for ALMA . Publications of the Astronomical Society of the Pacific, 137 0 (3): 0 034501, Mar. 2025. ISSN 1538-3873. doi:10.1088/1538-3873/adb5c1. URL https://doi.org/10.1088/1538-3873/adb5c1. Publisher: T...

  67. [80]

    A Framework for Optimizing Paper Matching

    L. Charlin, R. S. Zemel, and C. Boutilier. A Framework for Optimizing Paper Matching , Feb. 2012. URL http://arxiv.org/abs/1202.3706. arXiv:1202.3706 [cs]

  68. [81]

    SPECTER : Document-level Representation Learning using Citation-informed Transformers

    A. Cohan, S. Feldman, I. Beltagy, D. Downey, and D. Weld. SPECTER : Document -level Representation Learning using Citation -informed Transformers . In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 2270--2282, Online, July 2020. Association for C...

  69. [82]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. URL https://arxiv.org/abs/1810.04805

  70. [83]

    S. T. Dumais and J. Nielsen. Automating the assignment of submitted manuscripts to reviewers. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992. URL https://api.semanticscholar.org/CorpusID:15038631

  71. [84]

    B. Efron. Bootstrap methods: Another look at the jackknife. Annals of Statistics, 7: 0 1--26, 1979. URL https://api.semanticscholar.org/CorpusID:227312712

  72. [85]

    Ehrlinger and D

    J. Ehrlinger and D. Dunning. How chronic self-views influence (and potentially mislead) estimates of performance. Journal of personality and social psychology, 84 1: 0 5--17, 2003. URL https://api.semanticscholar.org/CorpusID:4143192

  73. [86]

    M. A. Hanson, P. G. Barreiro, P. Crosetto, and D. Brockington. The strain on scientific publishing. Quantitative Science Studies, 5: 0 823--843, 2023. URL https://api.semanticscholar.org/CorpusID:263136473

  74. [87]

    Jerabkova, F

    T. Jerabkova, F. Patat, F. Primas, D. Dorigo, F. Sogni, L. Astolfi, T. Bierwirth, and M. Prümm. The First Results of Distributed Peer Review at ESO Show Promising Outcomes , 2023. URL https://doi.eso.org/10.18727/0722-6691/5316. ISSN: 0722-6691 Publisher: European Southern Observatory (ESO)

  75. [88]

    Jerabkova , F

    T. Jerabkova , F. Patat , D. Dorigo , F. Sogni , F. Primas , A. De Cia , and E. R. Hoppe . Distributed Peer Review at ESO: Demonstrating Success and Evolving Through Period 115 . The Messenger, 194: 0 33--36, Mar. 2025. doi:10.18727/0722-6691/5383

  76. [89]

    K. S. Jones. A statistical interpretation of term specificity and its application in retrieval

  77. [90]

    W. E. Kerzendorf, F. Patat, D. Bordelon, G. van de Ven, and T. A. Pritchard. Distributed peer review enhanced with natural language processing and machine learning. Nature Astronomy, 4 0 (7): 0 711--717, July 2020. ISSN 2397-3366. doi:10.1038/s41550-020-1038-y. URL https://www.nature.com/articles/s41550-020-1038-y. Publisher: Nature Publishing Group

  78. [91]

    Kovanis, R

    M. Kovanis, R. Porcher, P. Ravaud, and L. Trinquart. The global burden of journal peer review in the biomedical literature: Strong imbalance in the collective enterprise. PLoS ONE, 11, 2016. URL https://api.semanticscholar.org/CorpusID:9484241

  79. [92]

    Kruger and D

    J. Kruger and D. Dunning. Unskilled and unaware of it: how difficulties in recognizing one's own incompetence lead to inflated self-assessments. Journal of personality and social psychology, 77 6: 0 1121--34, 1999. URL https://api.semanticscholar.org/CorpusID:2109278

  80. [93]

    J. Lee, J. Lee, and J.-J. Yoo. The role of large language models in the peer-review process: opportunities and challenges for medical journal reviewers and editors. Journal of Educational Evaluation for Health Professions, 22, 2025. URL https://api.semanticscholar.org/CorpusID:275705044

Showing first 80 references.