Recognition: no theorem link
LLMs learn scientific taste from institutional traces across the social sciences
Pith reviewed 2026-05-15 09:51 UTC · model grok-4.3
The pith
Fine-tuned LLMs learn to predict social science publication tiers from past outcomes, outperforming experts and frontier models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Institutional traces consisting of which research pitches were published at which tier supply a usable training signal that lets supervised fine-tuning turn LLMs into field-specific evaluators whose accuracy exceeds both expert majority votes and current frontier models.
What carries the argument
Supervised fine-tuning of LLMs on four-tier research-pitch benchmarks whose labels come directly from observed publication outcomes in each discipline.
If this is right
- Fine-tuned models exceed the 25 percent random baseline in all eight disciplines tested.
- In management the best model reaches 59.2 percent accuracy, 17.6 points above expert majority vote.
- Model confidence rises on correct predictions and falls on errors, producing calibrated scores.
- Restricting decisions to the highest-confidence subset yields very high accuracy in every field.
Where Pith is reading between the lines
- The same publication-trace approach could be tested in natural-science fields that also keep tiered journal records.
- High-confidence triage might be combined with human review to reduce total reviewer hours while preserving quality.
- If the signal proves robust, it offers one concrete route to machine assistance in domains where reinforcement learning has no verifiable reward.
Load-bearing premise
Publication outcomes in the training data capture genuine field-specific judgments about idea quality rather than prestige, fashion, or gatekeeping effects.
What would settle it
Track a fresh batch of research pitches through actual journal submissions and rejections; if the fine-tuned model's tier predictions show no better-than-chance correlation with the real outcomes, the claim fails.
read the original abstract
Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say "I'm sure" versus "I'm guessing." Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs can learn field-specific evaluative judgment ('scientific taste') by supervised fine-tuning on institutional publication traces (what gets published, where, and at what tier) across eight social sciences. It constructs held-out four-tier research-pitch benchmarks and reports that fine-tuned models exceed chance, frontier LLMs, and (in management) expert majority votes, with peak accuracy of 59.2% for Qwen3-4B versus 41.6% experts and 31.1% frontier mean; models also exhibit calibrated confidence and enable high-accuracy selective triage.
Significance. If the benchmarks prove free of leakage and the training signal isolates merit-based judgment rather than prestige or fashion, the result would supply a scalable, data-driven route to training AI evaluators for low-verifiability domains where no oracle exists. This could materially augment peer review and research triage in the social sciences.
major comments (3)
- [Benchmark construction] Benchmark construction (abstract and methods): the manuscript states that held-out four-tier research-pitch benchmarks were built but supplies no description of pitch generation, removal of author/institutional metadata, topic balancing, or explicit data-leakage controls between training publication records and the evaluation sets. Without these details the reported gains (e.g., 59.2% in management) cannot be confidently attributed to learned taste rather than replication of the same non-merit signals present in the training data.
- [Training-signal validity] Training-signal validity (introduction and evaluation sections): the central claim requires that publication outcomes predominantly encode genuine field-specific evaluative judgment. The paper does not report any controls or robustness checks for known confounders (author prestige, institutional affiliation, topic popularity). If these signals remain in the held-out pitches, the 17.6-point margin over expert majority vote could be explained by the model learning gatekeeping biases rather than superior judgment.
- [Expert comparison] Expert comparison (management evaluation): the 59.2% accuracy is contrasted with 48 expert gatekeepers at 41.6% (non-tied majority). The manuscript does not specify whether experts received identical pitch formats stripped of metadata, how ties were resolved, or the exact decision criterion, making it impossible to interpret the gap as evidence of superior scientific taste.
minor comments (2)
- [Abstract] The abstract asserts that 'selective triage on this signal reached very high accuracy on the highest-confidence subsets' but does not report the numerical accuracies or confidence thresholds for those subsets.
- [Introduction] The term 'institutional traces' is introduced without an early, explicit definition that distinguishes it from raw publication counts or metadata.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. These points identify important gaps in methodological transparency that we will address in revision. Below we respond to each major comment and indicate the changes we will make.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction (abstract and methods): the manuscript states that held-out four-tier research-pitch benchmarks were built but supplies no description of pitch generation, removal of author/institutional metadata, topic balancing, or explicit data-leakage controls between training publication records and the evaluation sets. Without these details the reported gains (e.g., 59.2% in management) cannot be confidently attributed to learned taste rather than replication of the same non-merit signals present in the training data.
Authors: We agree that the current manuscript provides insufficient detail on benchmark construction. In the revised version we will add a dedicated subsection in Methods that fully describes: (1) pitch generation from publication records (title + abstract rephrased into four-tier choice sets), (2) systematic stripping of all author names, affiliations, and other metadata, (3) topic balancing via stratified sampling across sub-disciplines using embedding-based clustering, and (4) leakage controls consisting of temporal hold-out, author disambiguation, and cosine-similarity thresholds on topic embeddings to ensure no overlap between training publication records and evaluation pitches. These additions will allow readers to assess whether performance gains reflect learned evaluative judgment. revision: yes
-
Referee: [Training-signal validity] Training-signal validity (introduction and evaluation sections): the central claim requires that publication outcomes predominantly encode genuine field-specific evaluative judgment. The paper does not report any controls or robustness checks for known confounders (author prestige, institutional affiliation, topic popularity). If these signals remain in the held-out pitches, the 17.6-point margin over expert majority vote could be explained by the model learning gatekeeping biases rather than superior judgment.
Authors: We acknowledge that the manuscript does not include explicit robustness checks for prestige, affiliation, or topic-popularity confounds. While the held-out pitches are metadata-stripped and the fine-tuned models outperform both frontier LLMs and domain experts (who are themselves exposed to the same institutional signals), this does not fully isolate merit-based judgment. In revision we will add a new subsection discussing these potential confounds, report any available post-hoc stratification (e.g., by topic popularity proxies derived from citation counts), and explicitly note the limitation that residual prestige or fashion signals may remain. We will also frame the expert outperformance as suggestive rather than conclusive evidence. revision: partial
-
Referee: [Expert comparison] Expert comparison (management evaluation): the 59.2% accuracy is contrasted with 48 expert gatekeepers at 41.6% (non-tied majority). The manuscript does not specify whether experts received identical pitch formats stripped of metadata, how ties were resolved, or the exact decision criterion, making it impossible to interpret the gap as evidence of superior scientific taste.
Authors: We will revise the management evaluation section to supply the missing protocol details. The revised text will state that the 48 experts received exactly the same metadata-stripped four-tier pitch sets presented to the models, that non-tied majority was computed by discarding cases where no option received a strict majority, and that the decision criterion was to choose the single pitch most likely to merit publication attention at a top-tier venue in the field. These clarifications will make the 17.6-point gap directly interpretable. revision: yes
Circularity Check
No circularity: held-out benchmarks and external expert votes keep claims independent
full rationale
The paper trains LLMs via supervised fine-tuning on field-specific publication outcomes as labels, then evaluates accuracy on separate held-out four-tier research-pitch benchmarks against independent expert majority votes and frontier-model baselines. No derivation step reduces a reported prediction to a quantity defined by the model's own fitted parameters, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz or renaming is smuggled in. The central performance numbers (e.g., 59.2 % vs. 41.6 % expert vote) are measured on external data, satisfying the self-contained benchmark criterion.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Publication outcomes reflect underlying field-specific scientific taste or quality.
Forward citations
Cited by 1 Pith paper
-
The Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research
The quality gap between AI and human economics research is driven primarily by inferior idea generation, which accounts for 71% of the difference.
Reference graph
Works this paper leans on
-
[1]
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021)
work page 2021
-
[2]
Hubert, T. et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature (2025)
work page 2025
-
[3]
OpenAI. OpenAI 2025 ICPC submissions. GitHub https://github.com/openai/openai-icpc-2025 (2025)
work page 2025
-
[4]
Si, C. et al. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. ICLR (2025)
work page 2025
-
[5]
Hao, Q. et al. Artificial intelligence tools expand scientists’ impact but contract science’s focus. Nature 649, 1237–1243 (2026)
work page 2026
-
[6]
Karpatne, A. et al. AI-enabled scientific revolution in the age of generative AI: second NSF workshop report. npj Artif. Intell. (2025)
work page 2025
-
[7]
Dell’Acqua, F. et al. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School Working Paper 24-013 (2023)
work page 2023
-
[13]
Callaham, M.L. & Tercier, J. The relationship of previous training and experience of journal peer reviewers to subsequent review quality. PLoS Med. 4, e40 (2007)
work page 2007
-
[14]
Black, N., van Rooyen, S., Godlee, F., Smith, R. & Evans, S. What makes a good reviewer and a good review for a general medical journal? JAMA 280, 231–233 (1998)
work page 1998
-
[15]
Callaham, M. & McCulloch, C. Longitudinal trends in the performance of scientific peer reviewers. Ann. Emerg. Med. 57, 141–148 (2011)
work page 2011
-
[17]
Boudreau, K.J. et al. Looking across and looking beyond the knowledge frontier. Manag. Sci. 62, 2765–2783 (2016)
work page 2016
-
[18]
Teplitskiy, M., Peng, H., Blasco, A. & Lakhani, K.R. Is novel research worth doing? Evidence from peer review at 49 journals. PNAS 119, e2118046119 (2022)
work page 2022
-
[21]
Nonaka, I. & Takeuchi, H. The Knowledge-Creating Company (Oxford Univ. Press, 1995)
work page 1995
-
[22]
More than half of researchers now use AI for peer review—often against guidance
Naddaf, M. More than half of researchers now use AI for peer review—often against guidance. Nature 649, 273–274 (2026)
work page 2026
-
[23]
Bergstrom, C.T. & Bak-Coleman, J. AI, peer review and the human activity of science. Nature (2025)
work page 2025
- [24]
- [25]
-
[26]
Shin, H. et al. Mind the blind spots: A focus-level evaluation framework for LLM reviews. Proc. EMNLP (2025)
work page 2025
-
[27]
Can ChatGPT evaluate research quality? J
Thelwall, M. Can ChatGPT evaluate research quality? J. Data Inf. Sci. 9, 1–21 (2024)
work page 2024
-
[32]
Expert Political Judgment (Princeton Univ
Tetlock, P.E. Expert Political Judgment (Princeton Univ. Press, 2005)
work page 2005
-
[33]
Gallo, S.A. et al. The influence of peer reviewer expertise on the evaluation of research funding applications. PLoS ONE 11, e0165147 (2016)
work page 2016
-
[38]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. arXiv 1503.02531 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[39]
Why are there still so many jobs? The history and future of workplace automation
Autor, D.H. Why are there still so many jobs? The history and future of workplace automation. J. Econ. Perspect. 29, 3–30 (2015)
work page 2015
-
[42]
Yu, Q. et al. DAPO: An open-source LLM reinforcement learning system at scale. NeurIPS (2025)
work page 2025
-
[44]
Analyzing Academy of Management Journal operations with artificial intelligence (2006–2022)
Gruber, M. Analyzing Academy of Management Journal operations with artificial intelligence (2006–2022). Acad. Manag. J. 68, 1–10 (2025)
work page 2006
-
[45]
Card, D. & DellaVigna, S. Nine facts about top journals in economics. J. Econ. Lit. 51, 144–161 (2013)
work page 2013
-
[46]
Yanagizawa-Drott, D., Awuah, K. et al. Project APE: Autonomous policy evaluation with AI-generated economics research papers. Social Catalyst Lab, University of Zurich https://ape.socialcatalystlab.org/ (2026)
work page 2026
-
[47]
Yamada, Y. et al. The AI Scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv 2504.08066 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Distinction: A Social Critique of the Judgement of Taste (Harvard Univ
Bourdieu, P. Distinction: A Social Critique of the Judgement of Taste (Harvard Univ. Press, 1984). 49.Dijksterhuis, A. et al. On making the right choice: The deliberation-without-attention effect. Science 311, 1005– 1007 (2006). Methods Study design. Evaluative judgment in science is widely recognized as tacit and institutionally distributed[8, 16, 20]: e...
work page 1984
-
[49]
The Tacit Dimension (Doubleday, 1966)
Polanyi, M. The Tacit Dimension (Doubleday, 1966)
work page 1966
-
[50]
Corley, K.G. & Gioia, D.A. Building theory about theory building. Acad. Manag. Rev. 36, 12–32 (2011)
work page 2011
-
[51]
Colquitt, J.A. & George, G. Publishing in AMJ—Part 1: Topic choice. Acad. Manag. J. 54, 432–435 (2011)
work page 2011
-
[52]
Bornmann, L., Mutz, R. & Daniel, H.-D. A reliability-generalization study of journal peer reviews. PLoS ONE 5, e14331 (2010)
work page 2010
-
[53]
Pier, E.L. et al. Low agreement among reviewers evaluating the same NIH grant applications. PNAS 115, 2952– 2957 (2018)
work page 2018
-
[54]
How Professors Think: Inside the Curious World of Academic Judgment (Harvard Univ
Lamont, M. How Professors Think: Inside the Curious World of Academic Judgment (Harvard Univ. Press, 2009)
work page 2009
- [55]
-
[56]
Tacit and Explicit Knowledge (University of Chicago Press, 2010)
Collins, H. Tacit and Explicit Knowledge (University of Chicago Press, 2010)
work page 2010
-
[57]
Christiano, P.F. et al. Deep reinforcement learning from human preferences. NeurIPS (2017)
work page 2017
-
[58]
Sharma, M. et al. Towards understanding sycophancy in language models. Proc. ICLR (2024)
work page 2024
-
[59]
Kahneman, D., Sibony, O. & Sunstein, C.R. Noise: A Flaw in Human Judgment (Little, Brown Spark, 2021)
work page 2021
-
[60]
Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633– 638 (2025)
work page 2025
-
[61]
Wu, Y. et al. On the generalization of SFT: A reinforcement learning perspective with reward rectification. ICLR (2026)
work page 2026
-
[62]
Wilson, T.D. & Schooler, J.W. Thinking too much: Introspection can reduce the quality of preferences and decisions. J. Pers. Soc. Psychol. 60, 181–192 (1991)
work page 1991
-
[63]
Liu, R. et al. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse. Proc. ICML (2025)
work page 2025
-
[64]
Sprague, Z. et al. To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning. ICLR (2025)
work page 2025
-
[65]
Shao, Z. et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv 2402.03300 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Wei, J. et al. Finetuned language models are zero-shot learners. ICLR (2022)
work page 2022
-
[67]
Yu, Q. et al. DAPO: An open-source LLM reinforcement learning system at scale. NeurIPS (2025). 27.Qu, Y. et al. POPE: Learning to reason on hard problems via privileged on-policy exploration. arXiv 2601.18779 (2026). Extended Data Extended Data Figures Extended Data Figure 1 | Cross-model prompt-sensitivity landscape a, Accuracy across Simple, Journal-anc...
-
[68]
CORE_RQ_SHORT (40–60 words): Distilled essential research question(s)
-
[69]
RQ_WITH_CONTEXT (120–150 words): Research question with enough context for expert evaluation, including the phenomenon, gap, question, approach, and claimed contribution
-
[70]
GAP_FOCUSED (100–130 words): What is known, what remains unknown, and how the study addresses it
-
[71]
THEORY_AND_MODEL (100–130 words): Theoretical framework, key variables and relationships, and theoretical contribution
-
[72]
The main benchmark uses the RQ_WITH_CONTEXT format
CONTRIBUTION_FOCUSED (80–100 words): Theoretical, empirical/methodological, and practical contributions as claimed by the authors. The main benchmark uses the RQ_WITH_CONTEXT format. Critical extraction rules required focusing on the abstract, introduction, and theoretical development sections; using the authors’ exact terminology for key constructs; pres...
-
[73]
**Research Questions:** Usually in abstract's and introduction
-
[74]
**Gaps/problematization:** mostly in introduction and sometimes in theoretical development
-
[75]
**Theory:** introduced in introduction and often elaborated in theory development sections
-
[76]
**Contributions:** Abstract, introduction's end ## What to Avoid: ❌ Adding your own theoretical connections ❌ Improving vague or weak language ❌ Creating persuasive hooks not in the original ❌ Inferring contributions not explicitly stated ❌ Making gaps sound more compelling than presented ## Language Rules: ✅ Use the authors' exact terminology for key con...
work page 2026
-
[77]
Had you encountered this research idea or its source paper before?
Prior exposure: “Had you encountered this research idea or its source paper before?” Response options: Yes / No
-
[78]
Based on the evaluation criteria, how would you rate the quality of this research idea?
Quality rating: “Based on the evaluation criteria, how would you rate the quality of this research idea?” Response options: Top / Top- / Good / Fair (the human-facing shorthand, mapped deterministically to exceptional / strong / fair / limited in all analyses)
-
[79]
How confident are you in your rating?
Confidence: “How confident are you in your rating?” Response options on a 5-point Likert scale: 1 = “Not at all confident”, 2 = “Slightly confident”, 3 = “Moderately confident”, 4 = “Very confident”, 5 = “Extremely confident”
-
[80]
How familiar are you with this research area?
Domain familiarity: “How familiar are you with this research area?” Response options on a 5-point Likert scale: 1 = “Not at all familiar”, 2 = “Slightly familiar”, 3 = “Moderately familiar”, 4 = “Very familiar”, 5 = “Extremely familiar”. Completion duration Median expert completion time was 923 seconds (~15.4 minutes) for 8 pitches. Median junior completi...
-
[81]
Execution gap. A strong research idea may be published in a lower-tier journal due to poor execution, and a modest idea may reach a top-tier journal through exceptional methods and writing. Our inputs strip execution information, so the model cannot account for this variance
-
[82]
Reviewer–manuscript fit. Publication decisions depend partly on the match between reviewer expertise and the manuscript’s topic, which introduces stochastic variation unrelated to idea quality
-
[83]
Editorial discretion. Editors exercise judgment that reflects strategic considerations (journal scope, topic balance, timeliness) beyond pure quality assessment
-
[84]
Fair” denotes the lowest tier and is mapped to unified tier “Limited
Tier boundary ambiguity. Some journals sit at the boundary between adjacent tiers. While our mapping is deterministic, the underlying quality distribution is continuous, creating inherent disagreement for articles near tier boundaries. Observed accuracies should therefore be interpreted relative to this ceiling, not against a 100% standard. Critically, th...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.