pith. sign in

arxiv: 2605.07855 · v2 · pith:KPTUWT5Pnew · submitted 2026-05-08 · 📊 stat.AP

Jagged AI in Scientific Peer Review: Evidence from POMP Data Analysis

Pith reviewed 2026-05-20 22:47 UTC · model grok-4.3

classification 📊 stat.AP
keywords AI in peer reviewjagged AI capabilitiesPOMP modelspartially observed Markov processesscientific peer reviewtime series analysismechanistic modeling
0
0 comments X

The pith

AI peer reviewers catch technical errors in POMP analyses but fail on interpretive checks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests AI tools on reviewing student projects involving partially observed Markov process models, which fit dynamic mechanisms to time series data. It shows AI agents are effective at detecting technical implementation errors and flawed inference methods that human reviewers missed. At the same time, these AI reviewers do not reach human levels when evaluating how well interpretations fit the scientific context, how coherent the narrative is, or how the model fits domain knowledge. This uneven performance profile remained similar no matter how the AI was instructed, suggesting the limitation is built into the base models. Readers should care because it indicates AI can help with parts of the review process but is not ready to take over the full task of ensuring scientific quality.

Core claim

AI reviewers exhibited a jagged capability profile, proficiently catching human-overlooked technical errors and invalid inference methodology, while failing to match human standards in checking interpretive errors, narrative coherence, and domain-informed model critique. The jaggedness was similar for all agents, consistent with it being primarily a property of the underlying AI model rather than the specific instructions. Skill file configuration shifted which weaknesses agents emphasized, without removing the jaggedness.

What carries the argument

Comparison of human peer reviews to AI agents implemented via Claude Code skill files, applied to 72 anonymized POMP student projects

If this is right

  • AI can supplement human review by identifying technical and methodological errors in POMP analyses.
  • Different instruction configurations change which specific weaknesses are shown but do not remove the overall jagged profile.
  • The jagged capability is likely inherent to the AI model rather than dependent on review instructions.
  • Effective peer review of mechanistic time-series models requires strengths in both technical validation and domain-informed interpretation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Combining AI for technical error detection with human reviewers for interpretive and coherence checks could create more efficient hybrid review processes.
  • The jaggedness observed here may apply to AI use in other areas of scientific evaluation beyond POMP modeling.
  • Further experiments could test whether newer AI models or training on peer review data reduce the identified weaknesses.
  • If the student projects differ from professional submissions, the results might understate or overstate AI performance in real journal reviews.

Load-bearing premise

The 72 student POMP projects from one graduate course and their human reviews adequately represent the standards and difficulties of scientific peer review for mechanistic time-series models.

What would settle it

Finding a different performance pattern, such as AI matching or exceeding humans across all categories, when testing on a broader set of professional scientific papers using POMP or similar models.

read the original abstract

Despite their growing use in academic writing and statistical analysis, the performance of artificial intelligence (AI) tools in scientific peer review remains a largely unexplored area. A key challenge is jagged AI, a phenomenon where AI exhibits strong ability spikes in some domains while remaining deficient in others. To study this jaggedness in a practical data science context, we considered the task of reviewing partially observed Markov process (POMP) data analyses. POMP models, also known as state-space models or hidden Markov models, are used to fit mechanistic dynamic models to time series data in diverse applications including disease transmission, ecological dynamics, and financial risk assessment. High-quality peer review in this area entails assessment of scientific context, identification of errors in implementing complex algorithms, and decisions concerning methodological best practices. We studied 72 POMP projects from four semesters of a University of Michigan graduate time series course for which the project reports, the source code, and student peer reviews are anonymized and open-access. We compared the human reviews with four AI reviewing agents, using Claude Code with differing instructions implemented as skill files. We found that AI reviewers exhibited a jagged capability profile, proficiently catching human-overlooked technical errors and invalid inference methodology, while failing to match human standards in checking interpretive errors, narrative coherence, and domain-informed model critique. The jaggedness was found to be similar for all agents, consistent with it being primarily a property of the underlying AI model rather than the specific instructions. Skill file configuration shifted which weaknesses agents emphasized, without removing the jaggedness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical comparison of four AI reviewing agents (based on Claude with varying skill-file instructions) against human peer reviews for 72 anonymized POMP data-analysis projects drawn from four semesters of a University of Michigan graduate time-series course. The central finding is that the AI agents display a jagged capability profile: they outperform humans at detecting technical errors and invalid inference methodology that human reviewers overlooked, yet underperform on interpretive errors, narrative coherence, and domain-informed model critique. The authors conclude that this jaggedness is largely invariant across skill-file configurations and therefore intrinsic to the underlying model.

Significance. If the reported pattern holds under more rigorous validation, the work supplies concrete, domain-specific evidence on where current large-language-model reviewers add value versus where they remain unreliable in mechanistic time-series modeling. The open-access release of the anonymized projects, code, and human reviews is a clear strength that supports reproducibility and future meta-analyses. The observation that jaggedness persists across instruction variants is also useful for guiding prompt-engineering research.

major comments (2)
  1. [Data and Methods] Data and Methods: The manuscript treats the 72 student POMP projects and their associated peer reviews as a proxy for the challenges of professional scientific peer review, yet provides no expert-rater comparison (or even a rubric-based scoring) of student versus journal-level reviews on the same error categories. Because the central claim concerns relevance to 'scientific peer review,' this untested mapping is load-bearing; if student assignments are narrower or evaluated against lower domain-expertise thresholds, both the technical-error successes and the interpretive shortfalls may not generalize.
  2. [Methods and Results] Methods and Results: No statistical tests, effect-size measures, or inter-rater reliability statistics (e.g., Cohen’s kappa or intraclass correlation for human reviewers) are described for the comparison of AI versus human error-detection rates. Without these, it is impossible to determine whether the reported jagged profile exceeds what would be expected from sampling variability or from the particular error-classification rubric employed.
minor comments (2)
  1. [Abstract] Abstract: The term 'jagged AI' is introduced without a concise operational definition; a single-sentence gloss on first use would improve accessibility for readers outside the AI-for-science literature.
  2. The manuscript would benefit from an explicit table or appendix listing the exact error categories, their operational definitions, and example excerpts from the POMP reports.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which highlight important considerations for the generalizability and statistical rigor of our study. We address each major comment below and describe the planned revisions.

read point-by-point responses
  1. Referee: [Data and Methods] Data and Methods: The manuscript treats the 72 student POMP projects and their associated peer reviews as a proxy for the challenges of professional scientific peer review, yet provides no expert-rater comparison (or even a rubric-based scoring) of student versus journal-level reviews on the same error categories. Because the central claim concerns relevance to 'scientific peer review,' this untested mapping is load-bearing; if student assignments are narrower or evaluated against lower domain-expertise thresholds, both the technical-error successes and the interpretive shortfalls may not generalize.

    Authors: We agree that student projects from a graduate course constitute a proxy rather than a direct sample of professional journal peer review, and that this distinction affects the strength of claims about broader scientific peer review. The POMP analyses involve authentic mechanistic modeling tasks drawn from real student work in a specialized time-series course, and the error taxonomy aligns with criteria used in statistical peer review. Nevertheless, we acknowledge the absence of a direct expert-rater comparison as a limitation on generalizability. In the revised manuscript we will expand the Discussion and Limitations sections to explicitly note that student reviews may operate under different expertise thresholds and time constraints than journal reviews, while emphasizing that the observed jagged profile still supplies domain-specific evidence relevant to AI assistance in data-analysis review. We will also outline directions for future studies that incorporate professional reviews. revision: partial

  2. Referee: [Methods and Results] Methods and Results: No statistical tests, effect-size measures, or inter-rater reliability statistics (e.g., Cohen’s kappa or intraclass correlation for human reviewers) are described for the comparison of AI versus human error-detection rates. Without these, it is impossible to determine whether the reported jagged profile exceeds what would be expected from sampling variability or from the particular error-classification rubric employed.

    Authors: We appreciate this observation regarding the lack of formal statistical support. The original manuscript emphasized descriptive comparisons of detection rates across error categories. In the revision we will add paired statistical tests (e.g., McNemar’s test for differences in error-detection proportions between AI and human reviewers), report effect sizes such as odds ratios with confidence intervals, and compute inter-rater reliability metrics (Cohen’s kappa and intraclass correlation) for the human reviews where multiple independent reviews per project exist. These additions will allow readers to assess whether the jagged capability pattern is statistically distinguishable from sampling variability and rubric-specific effects. revision: yes

Circularity Check

0 steps flagged

Empirical comparison exhibits no circularity

full rationale

The manuscript is an empirical study that directly compares outputs from four AI reviewing agents against human peer reviews on 72 anonymized student POMP projects. No derivations, equations, fitted parameters, or predictions are present that could reduce to inputs by construction. The jagged-capability observations rest on external human review data rather than any self-referential definition or self-citation chain. The choice of skill-file instructions is an experimental design decision, not a load-bearing circular step. The analysis is therefore self-contained against the provided human benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that student course projects and their peer reviews form a valid testbed for scientific peer review standards. No free parameters are described. No new entities are postulated beyond the descriptive term 'jagged AI'.

axioms (1)
  • domain assumption Student POMP projects and associated human reviews from one graduate course are representative of broader scientific peer review challenges in mechanistic modeling.
    This premise is required to generalize the observed jagged profile beyond the specific course setting.
invented entities (1)
  • jagged AI no independent evidence
    purpose: Descriptive label for the observed pattern of uneven AI capabilities across review dimensions.
    The term is introduced to name the empirical pattern; it does not carry independent predictive content.

pith-pipeline@v0.9.0 · 5816 in / 1322 out tokens · 41105 ms · 2026-05-20T22:47:42.981540+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Ahtisham, Bakhtawar, Kirk Vanacore, Jinsook Lee, Zhuqian Zhou, Doug Pietrzak, and Rene F Kizilcec. 2026. `` AI Annotation Orchestration: Evaluating LLM Verifiers to Improve the Quality of LLM Annotations in Learning Analytics.'' Proceedings of the LAK26: 16th International Learning Analytics and Knowledge Conference, 447--56. https://doi.org/10.1145/37850...

  2. [2]

    Anthropic. 2025. Sub-agents, https://code.claude.com/docs/en/sub-agents . Accessed February, 2026

  3. [3]

    Dell'Acqua, Fabrizio, Edward McFowland III, Ethan Mollick, et al. 2026. ``Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality.'' Organization Science 37 (2): 403--23. https://doi.org/10.1287/orsc.2025.21838

  4. [4]

    Friedman, Milton. 1937. ``The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance.'' Journal of the American Statistical Association 32 (200): 675--701. https://doi.org/10.2307/2279372

  5. [5]

    Gans, Joshua S. 2026. A Model of Artificial Jagged Intelligence. No. 34712. National Bureau of Economic Research. https://doi.org/10.3386/w34712

  6. [6]

    Ionides, Edward L., Dao Nguyen, Yves Atchadé, Stilian Stoev, and Aaron A. King. 2015. ``Inference for Dynamic and Latent Variable Models via Iterated, Perturbed B ayes Maps.'' Proceedings of the National Academy of Sciences of USA 112 (3): 719-\/-724. https://doi.org/10.1073/pnas.1410597112

  7. [7]

    King, Aaron A, Dao Nguyen, and Edward L Ionides. 2016. ``Statistical Inference for Partially Observed Markov Processes via the R Package Pomp.'' Journal of Statistical Software 69 (12): 1--43. https://doi.org/10.18637/jss.v069.i12

  8. [8]

    Liang, Weixin, Yuhui Zhang, Hancheng Cao, et al. 2024. ``Can Large Language Models Provide Useful Feedback on Research Papers? A Large-Scale Empirical Analysis.'' Transactions on Machine Learning Research. https://arxiv.org/abs/2310.01783

  9. [9]

    Liu, Ryan, and Nihar B Shah. 2023. `` ReviewerGPT ? A n Exploratory Study on Using Large Language Models for Paper Reviewing.'' arXiv 2306.00622. https://doi.org/10.48550/arXiv.2306.00622

  10. [10]

    Morris, Meredith Ringel, Dan Altman, Haydn Belfield, et al. 2026. Characterizing Model Jaggedness Supports Safety and Usability. https://www-cs.stanford.edu/ merrie/papers/jaggedness_preprint.pdf

  11. [11]

    Vaccaro, Michelle, Abdullah Almaatouq, and Thomas Malone. 2024. ``When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis.'' Nature Human Behaviour 8 (12): 2293--303. https://doi.org/10.1038/s41562-024-02024-1

  12. [12]

    Wheeler, Jesse, Anna Rosengart, Zhuoxun Jiang, Kevin Tan, Noah Treutle, and Edward L. Ionides. 2024. ``Informing Policy via Dynamic Models: Cholera in H aiti.'' PLOS Computational Biology 20: e1012032. https://doi.org/10.1371/journal.pcbi.1012032. CSLReferences Supplementary Material sec-supp Tables Table tbl-raw-counts and Table tbl-themes summarize the ...