arxiv: 2603.16659 · v3 · submitted 2026-03-17 · 💻 cs.AI · econ.GN· q-fin.EC

Recognition: no theorem link

LLMs learn scientific taste from institutional traces across the social sciences

Ziqin Gong , Ning Li , Huaikang Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:51 UTC · model grok-4.3

classification 💻 cs.AI econ.GNq-fin.EC

keywords LLM evaluationscientific judgmentsocial sciencesfine-tuningpublication outcomesresearch pitchesevaluative taste

0 comments

The pith

Fine-tuned LLMs learn to predict social science publication tiers from past outcomes, outperforming experts and frontier models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether records of what social science fields have actually published can train AI models to judge which new research ideas deserve attention. Across psychology, economics, sociology and five other disciplines, the authors built benchmarks of research pitches labeled by the tier at which they appeared in journals. They then supervised fine-tuned smaller LLMs on those labels and tested them on held-out pitches. The resulting models beat random guessing in every field and, in management, reached 59 percent accuracy against expert reviewers' 42 percent majority vote. This result matters because many scientific decisions rest on evaluative taste rather than verifiable correctness, and publication histories appear to encode enough signal to scale that taste to machines.

Core claim

Institutional traces consisting of which research pitches were published at which tier supply a usable training signal that lets supervised fine-tuning turn LLMs into field-specific evaluators whose accuracy exceeds both expert majority votes and current frontier models.

What carries the argument

Supervised fine-tuning of LLMs on four-tier research-pitch benchmarks whose labels come directly from observed publication outcomes in each discipline.

If this is right

Fine-tuned models exceed the 25 percent random baseline in all eight disciplines tested.
In management the best model reaches 59.2 percent accuracy, 17.6 points above expert majority vote.
Model confidence rises on correct predictions and falls on errors, producing calibrated scores.
Restricting decisions to the highest-confidence subset yields very high accuracy in every field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same publication-trace approach could be tested in natural-science fields that also keep tiered journal records.
High-confidence triage might be combined with human review to reduce total reviewer hours while preserving quality.
If the signal proves robust, it offers one concrete route to machine assistance in domains where reinforcement learning has no verifiable reward.

Load-bearing premise

Publication outcomes in the training data capture genuine field-specific judgments about idea quality rather than prestige, fashion, or gatekeeping effects.

What would settle it

Track a fresh batch of research pitches through actual journal submissions and rejections; if the fine-tuned model's tier predictions show no better-than-chance correlation with the real outcomes, the claim fails.

read the original abstract

Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say "I'm sure" versus "I'm guessing." Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning on publication records lets LLMs beat experts at rating social-science pitches, but the edge may come from learning prestige and topic biases rather than independent judgment.

read the letter

The paper's core result is that supervised fine-tuning on what actually got published across eight social science fields produces models that evaluate new research pitches better than frontier LLMs and, in management, better than expert majority vote. The management numbers are the clearest: Qwen3-4B reaches 59 percent accuracy while experts sit at 42 percent and frontier models average 31 percent. They also report calibration, with confidence tracking correctness, and show that high-confidence subsets can be triaged at very high accuracy. The scale across disciplines and the use of real institutional outcomes as the training signal are the main novelties here; prior work on LLM evaluators has not done this at this breadth with publication traces as the direct label source. The approach is simple and reproducible in principle, which gives it practical appeal for anyone trying to build scalable filters for low-verifiability domains. The main concern is whether the held-out benchmarks actually test judgment or just reproduce the same non-merit signals that drive publication in the first place. Publication success correlates with author prestige, institutional affiliation, and topic fashion, and the abstract gives no details on how the four-tier pitches were generated, whether author metadata was stripped, or how topic balance was enforced. If those signals leak into the test set, the performance gap over experts could be explained without any claim to superior taste. The expert comparison is also limited to one field. This work is aimed at groups building AI-assisted evaluation tools for social science funding or journals. It is worth sending to referees because the experimental scale and the reported margins are large enough that the benchmark integrity question needs a full check rather than a desk rejection.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs can learn field-specific evaluative judgment ('scientific taste') by supervised fine-tuning on institutional publication traces (what gets published, where, and at what tier) across eight social sciences. It constructs held-out four-tier research-pitch benchmarks and reports that fine-tuned models exceed chance, frontier LLMs, and (in management) expert majority votes, with peak accuracy of 59.2% for Qwen3-4B versus 41.6% experts and 31.1% frontier mean; models also exhibit calibrated confidence and enable high-accuracy selective triage.

Significance. If the benchmarks prove free of leakage and the training signal isolates merit-based judgment rather than prestige or fashion, the result would supply a scalable, data-driven route to training AI evaluators for low-verifiability domains where no oracle exists. This could materially augment peer review and research triage in the social sciences.

major comments (3)

[Benchmark construction] Benchmark construction (abstract and methods): the manuscript states that held-out four-tier research-pitch benchmarks were built but supplies no description of pitch generation, removal of author/institutional metadata, topic balancing, or explicit data-leakage controls between training publication records and the evaluation sets. Without these details the reported gains (e.g., 59.2% in management) cannot be confidently attributed to learned taste rather than replication of the same non-merit signals present in the training data.
[Training-signal validity] Training-signal validity (introduction and evaluation sections): the central claim requires that publication outcomes predominantly encode genuine field-specific evaluative judgment. The paper does not report any controls or robustness checks for known confounders (author prestige, institutional affiliation, topic popularity). If these signals remain in the held-out pitches, the 17.6-point margin over expert majority vote could be explained by the model learning gatekeeping biases rather than superior judgment.
[Expert comparison] Expert comparison (management evaluation): the 59.2% accuracy is contrasted with 48 expert gatekeepers at 41.6% (non-tied majority). The manuscript does not specify whether experts received identical pitch formats stripped of metadata, how ties were resolved, or the exact decision criterion, making it impossible to interpret the gap as evidence of superior scientific taste.

minor comments (2)

[Abstract] The abstract asserts that 'selective triage on this signal reached very high accuracy on the highest-confidence subsets' but does not report the numerical accuracies or confidence thresholds for those subsets.
[Introduction] The term 'institutional traces' is introduced without an early, explicit definition that distinguishes it from raw publication counts or metadata.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These points identify important gaps in methodological transparency that we will address in revision. Below we respond to each major comment and indicate the changes we will make.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (abstract and methods): the manuscript states that held-out four-tier research-pitch benchmarks were built but supplies no description of pitch generation, removal of author/institutional metadata, topic balancing, or explicit data-leakage controls between training publication records and the evaluation sets. Without these details the reported gains (e.g., 59.2% in management) cannot be confidently attributed to learned taste rather than replication of the same non-merit signals present in the training data.

Authors: We agree that the current manuscript provides insufficient detail on benchmark construction. In the revised version we will add a dedicated subsection in Methods that fully describes: (1) pitch generation from publication records (title + abstract rephrased into four-tier choice sets), (2) systematic stripping of all author names, affiliations, and other metadata, (3) topic balancing via stratified sampling across sub-disciplines using embedding-based clustering, and (4) leakage controls consisting of temporal hold-out, author disambiguation, and cosine-similarity thresholds on topic embeddings to ensure no overlap between training publication records and evaluation pitches. These additions will allow readers to assess whether performance gains reflect learned evaluative judgment. revision: yes
Referee: [Training-signal validity] Training-signal validity (introduction and evaluation sections): the central claim requires that publication outcomes predominantly encode genuine field-specific evaluative judgment. The paper does not report any controls or robustness checks for known confounders (author prestige, institutional affiliation, topic popularity). If these signals remain in the held-out pitches, the 17.6-point margin over expert majority vote could be explained by the model learning gatekeeping biases rather than superior judgment.

Authors: We acknowledge that the manuscript does not include explicit robustness checks for prestige, affiliation, or topic-popularity confounds. While the held-out pitches are metadata-stripped and the fine-tuned models outperform both frontier LLMs and domain experts (who are themselves exposed to the same institutional signals), this does not fully isolate merit-based judgment. In revision we will add a new subsection discussing these potential confounds, report any available post-hoc stratification (e.g., by topic popularity proxies derived from citation counts), and explicitly note the limitation that residual prestige or fashion signals may remain. We will also frame the expert outperformance as suggestive rather than conclusive evidence. revision: partial
Referee: [Expert comparison] Expert comparison (management evaluation): the 59.2% accuracy is contrasted with 48 expert gatekeepers at 41.6% (non-tied majority). The manuscript does not specify whether experts received identical pitch formats stripped of metadata, how ties were resolved, or the exact decision criterion, making it impossible to interpret the gap as evidence of superior scientific taste.

Authors: We will revise the management evaluation section to supply the missing protocol details. The revised text will state that the 48 experts received exactly the same metadata-stripped four-tier pitch sets presented to the models, that non-tied majority was computed by discarding cases where no option received a strict majority, and that the decision criterion was to choose the single pitch most likely to merit publication attention at a top-tier venue in the field. These clarifications will make the 17.6-point gap directly interpretable. revision: yes

Circularity Check

0 steps flagged

No circularity: held-out benchmarks and external expert votes keep claims independent

full rationale

The paper trains LLMs via supervised fine-tuning on field-specific publication outcomes as labels, then evaluates accuracy on separate held-out four-tier research-pitch benchmarks against independent expert majority votes and frontier-model baselines. No derivation step reduces a reported prediction to a quantity defined by the model's own fitted parameters, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz or renaming is smuggled in. The central performance numbers (e.g., 59.2 % vs. 41.6 % expert vote) are measured on external data, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that historical publication decisions encode reliable signals of scientific merit that can be learned by models.

axioms (1)

domain assumption Publication outcomes reflect underlying field-specific scientific taste or quality.
Invoked when using publication tier as the supervision signal for training evaluative judgment.

pith-pipeline@v0.9.0 · 5617 in / 1249 out tokens · 61305 ms · 2026-05-15T09:51:44.802857+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research
econ.GN 2026-04 unverdicted novelty 5.0

The quality gap between AI and human economics research is driven primarily by inferior idea generation, which accounts for 71% of the difference.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021)

work page 2021
[2]

Hubert, T. et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature (2025)

work page 2025
[3]

OpenAI 2025 ICPC submissions

OpenAI. OpenAI 2025 ICPC submissions. GitHub https://github.com/openai/openai-icpc-2025 (2025)

work page 2025
[4]

Si, C. et al. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. ICLR (2025)

work page 2025
[5]

Hao, Q. et al. Artificial intelligence tools expand scientists’ impact but contract science’s focus. Nature 649, 1237–1243 (2026)

work page 2026
[6]

Karpatne, A. et al. AI-enabled scientific revolution in the age of generative AI: second NSF workshop report. npj Artif. Intell. (2025)

work page 2025
[7]

Dell’Acqua, F. et al. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School Working Paper 24-013 (2023)

work page 2023
[13]

& Tercier, J

Callaham, M.L. & Tercier, J. The relationship of previous training and experience of journal peer reviewers to subsequent review quality. PLoS Med. 4, e40 (2007)

work page 2007
[14]

& Evans, S

Black, N., van Rooyen, S., Godlee, F., Smith, R. & Evans, S. What makes a good reviewer and a good review for a general medical journal? JAMA 280, 231–233 (1998)

work page 1998
[15]

& McCulloch, C

Callaham, M. & McCulloch, C. Longitudinal trends in the performance of scientific peer reviewers. Ann. Emerg. Med. 57, 141–148 (2011)

work page 2011
[17]

Boudreau, K.J. et al. Looking across and looking beyond the knowledge frontier. Manag. Sci. 62, 2765–2783 (2016)

work page 2016
[18]

& Lakhani, K.R

Teplitskiy, M., Peng, H., Blasco, A. & Lakhani, K.R. Is novel research worth doing? Evidence from peer review at 49 journals. PNAS 119, e2118046119 (2022)

work page 2022
[21]

& Takeuchi, H

Nonaka, I. & Takeuchi, H. The Knowledge-Creating Company (Oxford Univ. Press, 1995)

work page 1995
[22]

More than half of researchers now use AI for peer review—often against guidance

Naddaf, M. More than half of researchers now use AI for peer review—often against guidance. Nature 649, 273–274 (2026)

work page 2026
[23]

& Bak-Coleman, J

Bergstrom, C.T. & Bak-Coleman, J. AI, peer review and the human activity of science. Nature (2025)

work page 2025
[24]

Russo Latona, G. et al. The AI review lottery: Widespread AI-assisted peer reviews boost paper scores and acceptance rates. arXiv 2405.02150 (2024)

work page arXiv 2024
[25]

Zhu, C. et al. When your reviewer is an LLM: Biases, divergence, and prompt injection risks in peer review. arXiv 2509.09912 (2025)

work page arXiv 2025
[26]

Shin, H. et al. Mind the blind spots: A focus-level evaluation framework for LLM reviews. Proc. EMNLP (2025)

work page 2025
[27]

Can ChatGPT evaluate research quality? J

Thelwall, M. Can ChatGPT evaluate research quality? J. Data Inf. Sci. 9, 1–21 (2024)

work page 2024
[32]

Expert Political Judgment (Princeton Univ

Tetlock, P.E. Expert Political Judgment (Princeton Univ. Press, 2005)

work page 2005
[33]

Gallo, S.A. et al. The influence of peer reviewer expertise on the evaluation of research funding applications. PLoS ONE 11, e0165147 (2016)

work page 2016
[38]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. arXiv 1503.02531 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[39]

Why are there still so many jobs? The history and future of workplace automation

Autor, D.H. Why are there still so many jobs? The history and future of workplace automation. J. Econ. Perspect. 29, 3–30 (2015)

work page 2015
[42]

Yu, Q. et al. DAPO: An open-source LLM reinforcement learning system at scale. NeurIPS (2025)

work page 2025
[44]

Analyzing Academy of Management Journal operations with artificial intelligence (2006–2022)

Gruber, M. Analyzing Academy of Management Journal operations with artificial intelligence (2006–2022). Acad. Manag. J. 68, 1–10 (2025)

work page 2006
[45]

& DellaVigna, S

Card, D. & DellaVigna, S. Nine facts about top journals in economics. J. Econ. Lit. 51, 144–161 (2013)

work page 2013
[46]

Yanagizawa-Drott, D., Awuah, K. et al. Project APE: Autonomous policy evaluation with AI-generated economics research papers. Social Catalyst Lab, University of Zurich https://ape.socialcatalystlab.org/ (2026)

work page 2026
[47]

Yamada, Y. et al. The AI Scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv 2504.08066 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Distinction: A Social Critique of the Judgement of Taste (Harvard Univ

Bourdieu, P. Distinction: A Social Critique of the Judgement of Taste (Harvard Univ. Press, 1984). 49.Dijksterhuis, A. et al. On making the right choice: The deliberation-without-attention effect. Science 311, 1005– 1007 (2006). Methods Study design. Evaluative judgment in science is widely recognized as tacit and institutionally distributed[8, 16, 20]: e...

work page 1984
[49]

The Tacit Dimension (Doubleday, 1966)

Polanyi, M. The Tacit Dimension (Doubleday, 1966)

work page 1966
[50]

& Gioia, D.A

Corley, K.G. & Gioia, D.A. Building theory about theory building. Acad. Manag. Rev. 36, 12–32 (2011)

work page 2011
[51]

& George, G

Colquitt, J.A. & George, G. Publishing in AMJ—Part 1: Topic choice. Acad. Manag. J. 54, 432–435 (2011)

work page 2011
[52]

& Daniel, H.-D

Bornmann, L., Mutz, R. & Daniel, H.-D. A reliability-generalization study of journal peer reviews. PLoS ONE 5, e14331 (2010)

work page 2010
[53]

Pier, E.L. et al. Low agreement among reviewers evaluating the same NIH grant applications. PNAS 115, 2952– 2957 (2018)

work page 2018
[54]

How Professors Think: Inside the Curious World of Academic Judgment (Harvard Univ

Lamont, M. How Professors Think: Inside the Curious World of Academic Judgment (Harvard Univ. Press, 2009)

work page 2009
[55]

& Bero, L

Siler, K., Lee, K. & Bero, L. Measuring the effectiveness of scientific gatekeeping. PNAS 112, 360–365 (2015)

work page 2015
[56]

Tacit and Explicit Knowledge (University of Chicago Press, 2010)

Collins, H. Tacit and Explicit Knowledge (University of Chicago Press, 2010)

work page 2010
[57]

Christiano, P.F. et al. Deep reinforcement learning from human preferences. NeurIPS (2017)

work page 2017
[58]

Sharma, M. et al. Towards understanding sycophancy in language models. Proc. ICLR (2024)

work page 2024
[59]

& Sunstein, C.R

Kahneman, D., Sibony, O. & Sunstein, C.R. Noise: A Flaw in Human Judgment (Little, Brown Spark, 2021)

work page 2021
[60]

Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633– 638 (2025)

work page 2025
[61]

Wu, Y. et al. On the generalization of SFT: A reinforcement learning perspective with reward rectification. ICLR (2026)

work page 2026
[62]

& Schooler, J.W

Wilson, T.D. & Schooler, J.W. Thinking too much: Introspection can reduce the quality of preferences and decisions. J. Pers. Soc. Psychol. 60, 181–192 (1991)

work page 1991
[63]

Liu, R. et al. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse. Proc. ICML (2025)

work page 2025
[64]

Sprague, Z. et al. To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning. ICLR (2025)

work page 2025
[65]

Shao, Z. et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv 2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Wei, J. et al. Finetuned language models are zero-shot learners. ICLR (2022)

work page 2022
[67]

Yu, Q. et al. DAPO: An open-source LLM reinforcement learning system at scale. NeurIPS (2025). 27.Qu, Y. et al. POPE: Learning to reason on hard problems via privileged on-policy exploration. arXiv 2601.18779 (2026). Extended Data Extended Data Figures Extended Data Figure 1 | Cross-model prompt-sensitivity landscape a, Accuracy across Simple, Journal-anc...

work page arXiv 2025
[68]

CORE_RQ_SHORT (40–60 words): Distilled essential research question(s)

work page
[69]

RQ_WITH_CONTEXT (120–150 words): Research question with enough context for expert evaluation, including the phenomenon, gap, question, approach, and claimed contribution

work page
[70]

GAP_FOCUSED (100–130 words): What is known, what remains unknown, and how the study addresses it

work page
[71]

THEORY_AND_MODEL (100–130 words): Theoretical framework, key variables and relationships, and theoretical contribution

work page
[72]

The main benchmark uses the RQ_WITH_CONTEXT format

CONTRIBUTION_FOCUSED (80–100 words): Theoretical, empirical/methodological, and practical contributions as claimed by the authors. The main benchmark uses the RQ_WITH_CONTEXT format. Critical extraction rules required focusing on the abstract, introduction, and theoretical development sections; using the authors’ exact terminology for key constructs; pres...

work page
[73]

**Research Questions:** Usually in abstract's and introduction

work page
[74]

**Gaps/problematization:** mostly in introduction and sometimes in theoretical development

work page
[75]

**Theory:** introduced in introduction and often elaborated in theory development sections

work page
[76]

explores

**Contributions:** Abstract, introduction's end ## What to Avoid: ❌ Adding your own theoretical connections ❌ Improving vague or weak language ❌ Creating persuasive hooks not in the original ❌ Inferring contributions not explicitly stated ❌ Making gaps sound more compelling than presented ## Language Rules: ✅ Use the authors' exact terminology for key con...

work page 2026
[77]

Had you encountered this research idea or its source paper before?

Prior exposure: “Had you encountered this research idea or its source paper before?” Response options: Yes / No

work page
[78]

Based on the evaluation criteria, how would you rate the quality of this research idea?

Quality rating: “Based on the evaluation criteria, how would you rate the quality of this research idea?” Response options: Top / Top- / Good / Fair (the human-facing shorthand, mapped deterministically to exceptional / strong / fair / limited in all analyses)

work page
[79]

How confident are you in your rating?

Confidence: “How confident are you in your rating?” Response options on a 5-point Likert scale: 1 = “Not at all confident”, 2 = “Slightly confident”, 3 = “Moderately confident”, 4 = “Very confident”, 5 = “Extremely confident”

work page
[80]

How familiar are you with this research area?

Domain familiarity: “How familiar are you with this research area?” Response options on a 5-point Likert scale: 1 = “Not at all familiar”, 2 = “Slightly familiar”, 3 = “Moderately familiar”, 4 = “Very familiar”, 5 = “Extremely familiar”. Completion duration Median expert completion time was 923 seconds (~15.4 minutes) for 8 pitches. Median junior completi...

work page
[81]

A strong research idea may be published in a lower-tier journal due to poor execution, and a modest idea may reach a top-tier journal through exceptional methods and writing

Execution gap. A strong research idea may be published in a lower-tier journal due to poor execution, and a modest idea may reach a top-tier journal through exceptional methods and writing. Our inputs strip execution information, so the model cannot account for this variance

work page
[82]

Publication decisions depend partly on the match between reviewer expertise and the manuscript’s topic, which introduces stochastic variation unrelated to idea quality

Reviewer–manuscript fit. Publication decisions depend partly on the match between reviewer expertise and the manuscript’s topic, which introduces stochastic variation unrelated to idea quality

work page
[83]

Editors exercise judgment that reflects strategic considerations (journal scope, topic balance, timeliness) beyond pure quality assessment

Editorial discretion. Editors exercise judgment that reflects strategic considerations (journal scope, topic balance, timeliness) beyond pure quality assessment

work page
[84]

Fair” denotes the lowest tier and is mapped to unified tier “Limited

Tier boundary ambiguity. Some journals sit at the boundary between adjacent tiers. While our mapping is deterministic, the underlying quality distribution is continuous, creating inherent disagreement for articles near tier boundaries. Observed accuracies should therefore be interpreted relative to this ceiling, not against a 100% standard. Critically, th...

work page 2024