pith. machine review for the scientific record. sign in

arxiv: 2603.16659 · v3 · submitted 2026-03-17 · 💻 cs.AI · econ.GN· q-fin.EC

Recognition: no theorem link

LLMs learn scientific taste from institutional traces across the social sciences

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:51 UTC · model grok-4.3

classification 💻 cs.AI econ.GNq-fin.EC
keywords LLM evaluationscientific judgmentsocial sciencesfine-tuningpublication outcomesresearch pitchesevaluative taste
0
0 comments X

The pith

Fine-tuned LLMs learn to predict social science publication tiers from past outcomes, outperforming experts and frontier models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether records of what social science fields have actually published can train AI models to judge which new research ideas deserve attention. Across psychology, economics, sociology and five other disciplines, the authors built benchmarks of research pitches labeled by the tier at which they appeared in journals. They then supervised fine-tuned smaller LLMs on those labels and tested them on held-out pitches. The resulting models beat random guessing in every field and, in management, reached 59 percent accuracy against expert reviewers' 42 percent majority vote. This result matters because many scientific decisions rest on evaluative taste rather than verifiable correctness, and publication histories appear to encode enough signal to scale that taste to machines.

Core claim

Institutional traces consisting of which research pitches were published at which tier supply a usable training signal that lets supervised fine-tuning turn LLMs into field-specific evaluators whose accuracy exceeds both expert majority votes and current frontier models.

What carries the argument

Supervised fine-tuning of LLMs on four-tier research-pitch benchmarks whose labels come directly from observed publication outcomes in each discipline.

If this is right

  • Fine-tuned models exceed the 25 percent random baseline in all eight disciplines tested.
  • In management the best model reaches 59.2 percent accuracy, 17.6 points above expert majority vote.
  • Model confidence rises on correct predictions and falls on errors, producing calibrated scores.
  • Restricting decisions to the highest-confidence subset yields very high accuracy in every field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same publication-trace approach could be tested in natural-science fields that also keep tiered journal records.
  • High-confidence triage might be combined with human review to reduce total reviewer hours while preserving quality.
  • If the signal proves robust, it offers one concrete route to machine assistance in domains where reinforcement learning has no verifiable reward.

Load-bearing premise

Publication outcomes in the training data capture genuine field-specific judgments about idea quality rather than prestige, fashion, or gatekeeping effects.

What would settle it

Track a fresh batch of research pitches through actual journal submissions and rejections; if the fine-tuned model's tier predictions show no better-than-chance correlation with the real outcomes, the claim fails.

read the original abstract

Reinforcement-learned reasoning has powered recent AI leaps on verifiable tasks, including mathematics, code, and structure prediction. The harder bottleneck is evaluative judgment in low-verifiability domains, where no oracle anchors reward and the core question is which untested ideas deserve attention. We test whether institutional traces, the record of what fields published, where, and at which tier, can serve as a training signal for AI evaluators. Across eight social science disciplines (psychology, economics, communication, sociology, political science, management, business and finance, public administration), we built held-out four-tier research-pitch benchmarks and supervised-fine-tuned (SFT) LLMs on field-specific publication outcomes. The fine-tuned models cleared the 25 percent chance baseline and exceeded frontier-model performance by wide margins, with best single-model accuracy ranging from 55.0 percent in public administration to 85.5 percent in psychology. In management, evaluated against 48 expert gatekeepers, 174 junior researchers, and 11 frontier reasoning models, the best single fine-tuned model (Qwen3-4B) reached 59.2 percent, 17.6 percentage points above expert majority vote (41.6 percent, non-tied) and 28.1 percentage points above the frontier mean (31.1 percent). The fine-tuned models also showed calibrated confidence: confidence rose when predictions were correct and fell when wrong, mirroring how a skilled reviewer can say "I'm sure" versus "I'm guessing." Selective triage on this signal reached very high accuracy on the highest-confidence subsets in every field. Institutional traces, we conclude, encode a scalable training signal for the low-verifiability judgment on which science depends.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs can learn field-specific evaluative judgment ('scientific taste') by supervised fine-tuning on institutional publication traces (what gets published, where, and at what tier) across eight social sciences. It constructs held-out four-tier research-pitch benchmarks and reports that fine-tuned models exceed chance, frontier LLMs, and (in management) expert majority votes, with peak accuracy of 59.2% for Qwen3-4B versus 41.6% experts and 31.1% frontier mean; models also exhibit calibrated confidence and enable high-accuracy selective triage.

Significance. If the benchmarks prove free of leakage and the training signal isolates merit-based judgment rather than prestige or fashion, the result would supply a scalable, data-driven route to training AI evaluators for low-verifiability domains where no oracle exists. This could materially augment peer review and research triage in the social sciences.

major comments (3)
  1. [Benchmark construction] Benchmark construction (abstract and methods): the manuscript states that held-out four-tier research-pitch benchmarks were built but supplies no description of pitch generation, removal of author/institutional metadata, topic balancing, or explicit data-leakage controls between training publication records and the evaluation sets. Without these details the reported gains (e.g., 59.2% in management) cannot be confidently attributed to learned taste rather than replication of the same non-merit signals present in the training data.
  2. [Training-signal validity] Training-signal validity (introduction and evaluation sections): the central claim requires that publication outcomes predominantly encode genuine field-specific evaluative judgment. The paper does not report any controls or robustness checks for known confounders (author prestige, institutional affiliation, topic popularity). If these signals remain in the held-out pitches, the 17.6-point margin over expert majority vote could be explained by the model learning gatekeeping biases rather than superior judgment.
  3. [Expert comparison] Expert comparison (management evaluation): the 59.2% accuracy is contrasted with 48 expert gatekeepers at 41.6% (non-tied majority). The manuscript does not specify whether experts received identical pitch formats stripped of metadata, how ties were resolved, or the exact decision criterion, making it impossible to interpret the gap as evidence of superior scientific taste.
minor comments (2)
  1. [Abstract] The abstract asserts that 'selective triage on this signal reached very high accuracy on the highest-confidence subsets' but does not report the numerical accuracies or confidence thresholds for those subsets.
  2. [Introduction] The term 'institutional traces' is introduced without an early, explicit definition that distinguishes it from raw publication counts or metadata.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These points identify important gaps in methodological transparency that we will address in revision. Below we respond to each major comment and indicate the changes we will make.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction (abstract and methods): the manuscript states that held-out four-tier research-pitch benchmarks were built but supplies no description of pitch generation, removal of author/institutional metadata, topic balancing, or explicit data-leakage controls between training publication records and the evaluation sets. Without these details the reported gains (e.g., 59.2% in management) cannot be confidently attributed to learned taste rather than replication of the same non-merit signals present in the training data.

    Authors: We agree that the current manuscript provides insufficient detail on benchmark construction. In the revised version we will add a dedicated subsection in Methods that fully describes: (1) pitch generation from publication records (title + abstract rephrased into four-tier choice sets), (2) systematic stripping of all author names, affiliations, and other metadata, (3) topic balancing via stratified sampling across sub-disciplines using embedding-based clustering, and (4) leakage controls consisting of temporal hold-out, author disambiguation, and cosine-similarity thresholds on topic embeddings to ensure no overlap between training publication records and evaluation pitches. These additions will allow readers to assess whether performance gains reflect learned evaluative judgment. revision: yes

  2. Referee: [Training-signal validity] Training-signal validity (introduction and evaluation sections): the central claim requires that publication outcomes predominantly encode genuine field-specific evaluative judgment. The paper does not report any controls or robustness checks for known confounders (author prestige, institutional affiliation, topic popularity). If these signals remain in the held-out pitches, the 17.6-point margin over expert majority vote could be explained by the model learning gatekeeping biases rather than superior judgment.

    Authors: We acknowledge that the manuscript does not include explicit robustness checks for prestige, affiliation, or topic-popularity confounds. While the held-out pitches are metadata-stripped and the fine-tuned models outperform both frontier LLMs and domain experts (who are themselves exposed to the same institutional signals), this does not fully isolate merit-based judgment. In revision we will add a new subsection discussing these potential confounds, report any available post-hoc stratification (e.g., by topic popularity proxies derived from citation counts), and explicitly note the limitation that residual prestige or fashion signals may remain. We will also frame the expert outperformance as suggestive rather than conclusive evidence. revision: partial

  3. Referee: [Expert comparison] Expert comparison (management evaluation): the 59.2% accuracy is contrasted with 48 expert gatekeepers at 41.6% (non-tied majority). The manuscript does not specify whether experts received identical pitch formats stripped of metadata, how ties were resolved, or the exact decision criterion, making it impossible to interpret the gap as evidence of superior scientific taste.

    Authors: We will revise the management evaluation section to supply the missing protocol details. The revised text will state that the 48 experts received exactly the same metadata-stripped four-tier pitch sets presented to the models, that non-tied majority was computed by discarding cases where no option received a strict majority, and that the decision criterion was to choose the single pitch most likely to merit publication attention at a top-tier venue in the field. These clarifications will make the 17.6-point gap directly interpretable. revision: yes

Circularity Check

0 steps flagged

No circularity: held-out benchmarks and external expert votes keep claims independent

full rationale

The paper trains LLMs via supervised fine-tuning on field-specific publication outcomes as labels, then evaluates accuracy on separate held-out four-tier research-pitch benchmarks against independent expert majority votes and frontier-model baselines. No derivation step reduces a reported prediction to a quantity defined by the model's own fitted parameters, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz or renaming is smuggled in. The central performance numbers (e.g., 59.2 % vs. 41.6 % expert vote) are measured on external data, satisfying the self-contained benchmark criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that historical publication decisions encode reliable signals of scientific merit that can be learned by models.

axioms (1)
  • domain assumption Publication outcomes reflect underlying field-specific scientific taste or quality.
    Invoked when using publication tier as the supervision signal for training evaluative judgment.

pith-pipeline@v0.9.0 · 5617 in / 1249 out tokens · 61305 ms · 2026-05-15T09:51:44.802857+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Ideation Bottleneck: Decomposing the Quality Gap Between AI-Generated and Human Economics Research

    econ.GN 2026-04 unverdicted novelty 5.0

    The quality gap between AI and human economics research is driven primarily by inferior idea generation, which accounts for 71% of the difference.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021)

  2. [2]

    Hubert, T. et al. Olympiad-level formal mathematical reasoning with reinforcement learning. Nature (2025)

  3. [3]

    OpenAI 2025 ICPC submissions

    OpenAI. OpenAI 2025 ICPC submissions. GitHub https://github.com/openai/openai-icpc-2025 (2025)

  4. [4]

    Si, C. et al. Can LLMs generate novel research ideas? A large-scale human study with 100+ NLP researchers. ICLR (2025)

  5. [5]

    Hao, Q. et al. Artificial intelligence tools expand scientists’ impact but contract science’s focus. Nature 649, 1237–1243 (2026)

  6. [6]

    Karpatne, A. et al. AI-enabled scientific revolution in the age of generative AI: second NSF workshop report. npj Artif. Intell. (2025)

  7. [7]

    Dell’Acqua, F. et al. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School Working Paper 24-013 (2023)

  8. [13]

    & Tercier, J

    Callaham, M.L. & Tercier, J. The relationship of previous training and experience of journal peer reviewers to subsequent review quality. PLoS Med. 4, e40 (2007)

  9. [14]

    & Evans, S

    Black, N., van Rooyen, S., Godlee, F., Smith, R. & Evans, S. What makes a good reviewer and a good review for a general medical journal? JAMA 280, 231–233 (1998)

  10. [15]

    & McCulloch, C

    Callaham, M. & McCulloch, C. Longitudinal trends in the performance of scientific peer reviewers. Ann. Emerg. Med. 57, 141–148 (2011)

  11. [17]

    Boudreau, K.J. et al. Looking across and looking beyond the knowledge frontier. Manag. Sci. 62, 2765–2783 (2016)

  12. [18]

    & Lakhani, K.R

    Teplitskiy, M., Peng, H., Blasco, A. & Lakhani, K.R. Is novel research worth doing? Evidence from peer review at 49 journals. PNAS 119, e2118046119 (2022)

  13. [21]

    & Takeuchi, H

    Nonaka, I. & Takeuchi, H. The Knowledge-Creating Company (Oxford Univ. Press, 1995)

  14. [22]

    More than half of researchers now use AI for peer review—often against guidance

    Naddaf, M. More than half of researchers now use AI for peer review—often against guidance. Nature 649, 273–274 (2026)

  15. [23]

    & Bak-Coleman, J

    Bergstrom, C.T. & Bak-Coleman, J. AI, peer review and the human activity of science. Nature (2025)

  16. [24]

    Russo Latona, G. et al. The AI review lottery: Widespread AI-assisted peer reviews boost paper scores and acceptance rates. arXiv 2405.02150 (2024)

  17. [25]

    Zhu, C. et al. When your reviewer is an LLM: Biases, divergence, and prompt injection risks in peer review. arXiv 2509.09912 (2025)

  18. [26]

    Shin, H. et al. Mind the blind spots: A focus-level evaluation framework for LLM reviews. Proc. EMNLP (2025)

  19. [27]

    Can ChatGPT evaluate research quality? J

    Thelwall, M. Can ChatGPT evaluate research quality? J. Data Inf. Sci. 9, 1–21 (2024)

  20. [32]

    Expert Political Judgment (Princeton Univ

    Tetlock, P.E. Expert Political Judgment (Princeton Univ. Press, 2005)

  21. [33]

    Gallo, S.A. et al. The influence of peer reviewer expertise on the evaluation of research funding applications. PLoS ONE 11, e0165147 (2016)

  22. [38]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. arXiv 1503.02531 (2015)

  23. [39]

    Why are there still so many jobs? The history and future of workplace automation

    Autor, D.H. Why are there still so many jobs? The history and future of workplace automation. J. Econ. Perspect. 29, 3–30 (2015)

  24. [42]

    Yu, Q. et al. DAPO: An open-source LLM reinforcement learning system at scale. NeurIPS (2025)

  25. [44]

    Analyzing Academy of Management Journal operations with artificial intelligence (2006–2022)

    Gruber, M. Analyzing Academy of Management Journal operations with artificial intelligence (2006–2022). Acad. Manag. J. 68, 1–10 (2025)

  26. [45]

    & DellaVigna, S

    Card, D. & DellaVigna, S. Nine facts about top journals in economics. J. Econ. Lit. 51, 144–161 (2013)

  27. [46]

    Yanagizawa-Drott, D., Awuah, K. et al. Project APE: Autonomous policy evaluation with AI-generated economics research papers. Social Catalyst Lab, University of Zurich https://ape.socialcatalystlab.org/ (2026)

  28. [47]

    Yamada, Y. et al. The AI Scientist-v2: Workshop-level automated scientific discovery via agentic tree search. arXiv 2504.08066 (2025)

  29. [48]

    Distinction: A Social Critique of the Judgement of Taste (Harvard Univ

    Bourdieu, P. Distinction: A Social Critique of the Judgement of Taste (Harvard Univ. Press, 1984). 49.Dijksterhuis, A. et al. On making the right choice: The deliberation-without-attention effect. Science 311, 1005– 1007 (2006). Methods Study design. Evaluative judgment in science is widely recognized as tacit and institutionally distributed[8, 16, 20]: e...

  30. [49]

    The Tacit Dimension (Doubleday, 1966)

    Polanyi, M. The Tacit Dimension (Doubleday, 1966)

  31. [50]

    & Gioia, D.A

    Corley, K.G. & Gioia, D.A. Building theory about theory building. Acad. Manag. Rev. 36, 12–32 (2011)

  32. [51]

    & George, G

    Colquitt, J.A. & George, G. Publishing in AMJ—Part 1: Topic choice. Acad. Manag. J. 54, 432–435 (2011)

  33. [52]

    & Daniel, H.-D

    Bornmann, L., Mutz, R. & Daniel, H.-D. A reliability-generalization study of journal peer reviews. PLoS ONE 5, e14331 (2010)

  34. [53]

    Pier, E.L. et al. Low agreement among reviewers evaluating the same NIH grant applications. PNAS 115, 2952– 2957 (2018)

  35. [54]

    How Professors Think: Inside the Curious World of Academic Judgment (Harvard Univ

    Lamont, M. How Professors Think: Inside the Curious World of Academic Judgment (Harvard Univ. Press, 2009)

  36. [55]

    & Bero, L

    Siler, K., Lee, K. & Bero, L. Measuring the effectiveness of scientific gatekeeping. PNAS 112, 360–365 (2015)

  37. [56]

    Tacit and Explicit Knowledge (University of Chicago Press, 2010)

    Collins, H. Tacit and Explicit Knowledge (University of Chicago Press, 2010)

  38. [57]

    Christiano, P.F. et al. Deep reinforcement learning from human preferences. NeurIPS (2017)

  39. [58]

    Sharma, M. et al. Towards understanding sycophancy in language models. Proc. ICLR (2024)

  40. [59]

    & Sunstein, C.R

    Kahneman, D., Sibony, O. & Sunstein, C.R. Noise: A Flaw in Human Judgment (Little, Brown Spark, 2021)

  41. [60]

    Guo, D. et al. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. Nature 645, 633– 638 (2025)

  42. [61]

    Wu, Y. et al. On the generalization of SFT: A reinforcement learning perspective with reward rectification. ICLR (2026)

  43. [62]

    & Schooler, J.W

    Wilson, T.D. & Schooler, J.W. Thinking too much: Introspection can reduce the quality of preferences and decisions. J. Pers. Soc. Psychol. 60, 181–192 (1991)

  44. [63]

    Liu, R. et al. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse. Proc. ICML (2025)

  45. [64]

    Sprague, Z. et al. To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning. ICLR (2025)

  46. [65]

    Shao, Z. et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv 2402.03300 (2024)

  47. [66]

    Wei, J. et al. Finetuned language models are zero-shot learners. ICLR (2022)

  48. [67]

    Yu, Q. et al. DAPO: An open-source LLM reinforcement learning system at scale. NeurIPS (2025). 27.Qu, Y. et al. POPE: Learning to reason on hard problems via privileged on-policy exploration. arXiv 2601.18779 (2026). Extended Data Extended Data Figures Extended Data Figure 1 | Cross-model prompt-sensitivity landscape a, Accuracy across Simple, Journal-anc...

  49. [68]

    CORE_RQ_SHORT (40–60 words): Distilled essential research question(s)

  50. [69]

    RQ_WITH_CONTEXT (120–150 words): Research question with enough context for expert evaluation, including the phenomenon, gap, question, approach, and claimed contribution

  51. [70]

    GAP_FOCUSED (100–130 words): What is known, what remains unknown, and how the study addresses it

  52. [71]

    THEORY_AND_MODEL (100–130 words): Theoretical framework, key variables and relationships, and theoretical contribution

  53. [72]

    The main benchmark uses the RQ_WITH_CONTEXT format

    CONTRIBUTION_FOCUSED (80–100 words): Theoretical, empirical/methodological, and practical contributions as claimed by the authors. The main benchmark uses the RQ_WITH_CONTEXT format. Critical extraction rules required focusing on the abstract, introduction, and theoretical development sections; using the authors’ exact terminology for key constructs; pres...

  54. [73]

    **Research Questions:** Usually in abstract's and introduction

  55. [74]

    **Gaps/problematization:** mostly in introduction and sometimes in theoretical development

  56. [75]

    **Theory:** introduced in introduction and often elaborated in theory development sections

  57. [76]

    explores

    **Contributions:** Abstract, introduction's end ## What to Avoid: ❌ Adding your own theoretical connections ❌ Improving vague or weak language ❌ Creating persuasive hooks not in the original ❌ Inferring contributions not explicitly stated ❌ Making gaps sound more compelling than presented ## Language Rules: ✅ Use the authors' exact terminology for key con...

  58. [77]

    Had you encountered this research idea or its source paper before?

    Prior exposure: “Had you encountered this research idea or its source paper before?” Response options: Yes / No

  59. [78]

    Based on the evaluation criteria, how would you rate the quality of this research idea?

    Quality rating: “Based on the evaluation criteria, how would you rate the quality of this research idea?” Response options: Top / Top- / Good / Fair (the human-facing shorthand, mapped deterministically to exceptional / strong / fair / limited in all analyses)

  60. [79]

    How confident are you in your rating?

    Confidence: “How confident are you in your rating?” Response options on a 5-point Likert scale: 1 = “Not at all confident”, 2 = “Slightly confident”, 3 = “Moderately confident”, 4 = “Very confident”, 5 = “Extremely confident”

  61. [80]

    How familiar are you with this research area?

    Domain familiarity: “How familiar are you with this research area?” Response options on a 5-point Likert scale: 1 = “Not at all familiar”, 2 = “Slightly familiar”, 3 = “Moderately familiar”, 4 = “Very familiar”, 5 = “Extremely familiar”. Completion duration Median expert completion time was 923 seconds (~15.4 minutes) for 8 pitches. Median junior completi...

  62. [81]

    A strong research idea may be published in a lower-tier journal due to poor execution, and a modest idea may reach a top-tier journal through exceptional methods and writing

    Execution gap. A strong research idea may be published in a lower-tier journal due to poor execution, and a modest idea may reach a top-tier journal through exceptional methods and writing. Our inputs strip execution information, so the model cannot account for this variance

  63. [82]

    Publication decisions depend partly on the match between reviewer expertise and the manuscript’s topic, which introduces stochastic variation unrelated to idea quality

    Reviewer–manuscript fit. Publication decisions depend partly on the match between reviewer expertise and the manuscript’s topic, which introduces stochastic variation unrelated to idea quality

  64. [83]

    Editors exercise judgment that reflects strategic considerations (journal scope, topic balance, timeliness) beyond pure quality assessment

    Editorial discretion. Editors exercise judgment that reflects strategic considerations (journal scope, topic balance, timeliness) beyond pure quality assessment

  65. [84]

    Fair” denotes the lowest tier and is mapped to unified tier “Limited

    Tier boundary ambiguity. Some journals sit at the boundary between adjacent tiers. While our mapping is deterministic, the underlying quality distribution is continuous, creating inherent disagreement for articles near tier boundaries. Observed accuracies should therefore be interpreted relative to this ceiling, not against a 100% standard. Critically, th...