arxiv: 2601.12910 · v3 · submitted 2026-01-19 · 💻 cs.CL · cs.AI

Recognition: no theorem link

SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

Tim Baumg\"artner , Iryna Gurevych

Authors on Pith no claims yet

Pith reviewed 2026-05-16 13:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords SciCoQApaper-code discrepanciesLLM evaluationreproducibilitydiscrepancy detectionsynthetic databenchmark datasettaxonomy

0 comments

The pith

Large language models detect fewer than half of real discrepancies between scientific papers and their code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds SciCoQA, a dataset of 635 paper-code discrepancies with 92 drawn from real GitHub issues and reproducibility reports plus 543 generated synthetically. It tests 22 models on the task of spotting these mismatches and finds that even the strongest ones identify only 46.7 percent of the authentic cases. The work also supplies a taxonomy of discrepancy categories and pinpoints where models fail most often, such as when papers omit key details or use long contexts. This matters because mismatches damage reproducibility and automated agents are already producing scientific output faster than humans can review it manually. The synthetic pipeline is designed to extend testing into fields like physics and quantitative biology.

Core claim

SciCoQA supplies a benchmark of 635 paper-code discrepancy instances, 92 of them real, to measure how well language models can verify alignment between a paper and its code. On the real subset the best models reach only 46.7 percent detection accuracy. The accompanying taxonomy classifies mismatch types, and the analysis shows consistent weaknesses on omitted details, long inputs, and papers outside a model's training distribution.

What carries the argument

SciCoQA dataset of paper-code discrepancies, built from real GitHub issues plus a synthetic generation pipeline, together with a taxonomy of discrepancy types and categories.

Load-bearing premise

The 92 real discrepancies collected from GitHub issues and reproducibility papers represent the broader space of mismatches that appear in practice.

What would settle it

Gather and label a much larger collection of verified real-world paper-code discrepancies from new sources and test whether top models still detect only around 47 percent of them.

Figures

Figures reproduced from arXiv: 2601.12910 by Iryna Gurevych, Tim Baumg\"artner.

**Figure 1.** Figure 1: Example from SCICOQA, showing specific model implementation in the paper, and its implementation in the code (simplified for readability). The paper’s description and the code’s implementation mismatch, creating a paper-code discrepancy. tricks (Lipton and Steinhardt, 2019), to evaluation metrics that differ in implementation (Post, 2018; Marie et al., 2021), rendering scientific comparisons invalid. Repr… view at source ↗

**Figure 2.** Figure 2: Overview of the data collection process of [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis of the real (blue), synthetic (orange), and combined (green) discrepancy data. The y-axis [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Results of the top 8 best-performing models (sorted by average recall on the real and synthetic data) on [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of top 8 models when given paper and code, and only the code, split by data origin: Real (R), Synthetic (S), and combined (R+S). receive the paper as input, it may be able to recall it from its pre-training. We show the results of the top models in this experiment in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Quantitative analysis of synthetic code modifications. We show the distribution of number of changed [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Comparisons between the two ground truth descriptions generated by Gemini 3.1 Pro and GPT-5. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of programming languages in [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Correlation between model recall on the syn [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Discrepancies between scientific papers and their code undermine reproducibility, a concern that grows as automated research agents scale scientific output beyond human review capacity. Whether LLMs can reliably detect such discrepancies has not been systematically measured. To this end, we present SciCoQA, a dataset of 635 paper-code discrepancies (92 real, 543 synthetic) for this cross-modal verification task. Across 22 evaluated models, even the best-performing LLMs, Gemini 3.1 Pro and GPT-5 Mini, detect only 46.7% of real-world discrepancies, revealing a critical gap in automated scientific quality assurance. We construct SciCoQA from GitHub issues and reproducibility papers, and propose a synthetic generation pipeline to scale beyond AI to Physics, Quantitative Biology, and other computational sciences. We further introduce a taxonomy of discrepancy types and categories to characterize the occurring mismatches. Our analysis shows that models particularly struggle with omitted paper details, long-context inputs, and papers outside their pre-training corpus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SciCoQA introduces a new benchmark for paper-code discrepancy detection with a clear performance gap on real cases, but the 92 real examples may not represent the full range of mismatches.

read the letter

The core point is that this paper builds the first dataset aimed at catching when scientific papers and their code don't line up, and it finds that even the strongest models like Gemini 3.1 Pro and GPT-5 Mini only spot 46.7% of the real discrepancies. That number comes from evaluating 22 models on 635 total cases, split between 92 real ones pulled from GitHub issues and reproducibility papers plus 543 synthetic ones generated to cover more ground across fields like physics and quantitative biology. They also lay out a taxonomy of mismatch types, which helps break down where models fail, such as with omitted details, long contexts, or papers outside the training data. The work is straightforward in its setup and gives concrete numbers instead of vague claims, which makes the gap easy to see and potentially useful for anyone testing automated research agents. The real sample is small and drawn only from already-reported problems, so it likely misses subtler mismatches that never made it into issues or papers. There are no details on how the real cases were validated for agreement or how the synthetics were checked against actual distributions, and no error bars or sampling frame to show coverage. This keeps the headline result from generalizing cleanly to all paper-code pairs. The paper is aimed at groups working on LLM reliability for science and reproducibility tools. It deserves a serious referee because the task is new and the dataset adds something concrete, even if the claims around real-world impact need tighter validation on the sample.

Referee Report

2 major / 2 minor

Summary. The paper introduces SciCoQA, a dataset of 635 paper-code discrepancies (92 real cases collected from GitHub issues and reproducibility papers, plus 543 synthetic examples) for the task of detecting mismatches between scientific papers and their code. It evaluates 22 LLMs on this task, reports that even the strongest models (Gemini 3.1 Pro and GPT-5 Mini) detect only 46.7% of the real discrepancies, and supplies a taxonomy of discrepancy types together with an analysis of model weaknesses on omitted details, long-context inputs, and out-of-corpus papers.

Significance. If the 92 real cases prove representative, the work provides a concrete empirical measurement of a critical limitation in current LLMs for automated scientific quality assurance, an issue that will grow in importance with scaling automated research agents. The combination of real and synthetic data plus the taxonomy offers a reusable resource and diagnostic framework that could guide future model development in cross-modal verification.

major comments (2)

[Abstract and Dataset Construction] Abstract and Dataset Construction: the headline result that top models detect only 46.7% of real-world discrepancies rests on the assumption that the 92 GitHub-sourced cases are representative of the broader space of paper-code mismatches. Collection exclusively from reported issues and reproducibility papers selects for mismatches that humans have already noticed and documented; no sampling frame, coverage statistics, or comparison against a random draw from arXiv+GitHub pairs is described, so the measured gap cannot be extrapolated to the general population the abstract claims to address.
[Evaluation and Results] Evaluation and Results: the 46.7% detection figure for real discrepancies is presented without reported inter-annotator agreement on the labeling of the 92 real cases, without error bars or confidence intervals on the per-model metrics, and without explicit validation that the 543 synthetic examples were checked for distributional match to the real discrepancies.

minor comments (2)

[Taxonomy section] The taxonomy of discrepancy types is introduced but would benefit from one or two concrete examples per category placed in the main text rather than only in the appendix.
[Results table] Table or figure presenting the 22-model results should include the exact prompt template and context length used for each model to allow direct replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major point below and will revise the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and Dataset Construction] Abstract and Dataset Construction: the headline result that top models detect only 46.7% of real-world discrepancies rests on the assumption that the 92 GitHub-sourced cases are representative of the broader space of paper-code mismatches. Collection exclusively from reported issues and reproducibility papers selects for mismatches that humans have already noticed and documented; no sampling frame, coverage statistics, or comparison against a random draw from arXiv+GitHub pairs is described, so the measured gap cannot be extrapolated to the general population the abstract claims to address.

Authors: We agree that the 92 real cases, drawn from GitHub issues and reproducibility papers, constitute a convenience sample of documented discrepancies rather than a statistically representative draw from all arXiv+GitHub pairs. This selection is deliberate: our benchmark targets mismatches that have already surfaced as practical problems in the community. The abstract's phrasing of 'real-world discrepancies' is therefore imprecise and will be revised to 'reported real-world discrepancies.' We will also add an explicit limitations subsection discussing selection bias, the absence of a formal sampling frame, and the complementary role of the 543 synthetic examples in covering a wider range of discrepancy types. These changes preserve the core empirical observation that current LLMs fail on a substantial fraction of documented cases while avoiding over-extrapolation. revision: yes
Referee: [Evaluation and Results] Evaluation and Results: the 46.7% detection figure for real discrepancies is presented without reported inter-annotator agreement on the labeling of the 92 real cases, without error bars or confidence intervals on the per-model metrics, and without explicit validation that the 543 synthetic examples were checked for distributional match to the real discrepancies.

Authors: We acknowledge these reporting gaps. The 92 real cases were labeled by the authors using the introduced taxonomy, with conflicts resolved by consensus; we will compute and report inter-annotator agreement (e.g., Cohen's kappa) in the revision. Bootstrap confidence intervals will be added to all per-model metrics. For the synthetic examples, we performed manual spot-checks against the real discrepancy taxonomy during generation, but we will augment this with quantitative validation (type-frequency histograms and embedding-based distributional similarity) and include the results. All requested statistics and validation details will appear in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and benchmarking

full rationale

The paper constructs SciCoQA from external GitHub issues and reproducibility papers (92 real cases) plus a synthetic generation pipeline, then reports empirical LLM detection rates (46.7% on real cases) across 22 models. No equations, derivations, fitted parameters renamed as predictions, or self-referential claims appear. The central result is a direct measurement on collected data rather than any reduction to inputs by construction. Self-citations, if present, are not load-bearing for the reported detection gap.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the assumption that discrepancies between papers and code can be reliably identified and categorized by humans, and that synthetic generation can usefully approximate real mismatches without introducing artifacts that inflate or deflate model performance.

axioms (2)

domain assumption Human annotators can consistently identify and label paper-code discrepancies from GitHub issues and reproducibility reports.
The 92 real examples depend on this labeling step; no agreement statistics are provided in the abstract.
domain assumption Synthetic discrepancies generated by the pipeline are sufficiently similar to real ones for model evaluation purposes.
The majority of the dataset (543 examples) is synthetic; the abstract does not detail validation against real distributions.

pith-pipeline@v0.9.0 · 5469 in / 1277 out tokens · 30866 ms · 2026-05-16T13:17:08.804283+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do Papers Tell the Whole Story? A Benchmark and Framework for Uncovering Hidden Implementation Gaps in Bioinformatics
cs.LG 2026-03 unverdicted novelty 8.0

BioCon is the first benchmark dataset and cross-modal framework for detecting inconsistencies between methodological descriptions in bioinformatics papers and their code implementations.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Federico Bianchi, Yongchan Kwon, Zachary Izzo, Lin- jun Zhang, and James Zou

Evaluating Sakana’s AI Scientist: Bold Claims, Mixed Results, and a Promising Future?SIGIR Fo- rum, 59(1):1–20. Federico Bianchi, Yongchan Kwon, Zachary Izzo, Lin- jun Zhang, and James Zou. 2025. To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis.CoRR, abs/2512.05925. Daniil A. Boiko, Robert MacKnight, Ben Kline, ...

work page arXiv 2025
[2]

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence

DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence.CoRR, abs/2406.11931. Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. 2019. Show your work: Improved reporting of experimental results. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inte...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- tanu Acharya, Dima Rekesh, Fei Jia, and Boris Gins- burg

AAAI Press. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- tanu Acharya, Dima Rekesh, Fei Jia, and Boris Gins- burg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models? InFirst Conference on Language Modeling. Chuxuan Hu, Liyun Zhang, Yeji Lim, Aum Wadhwani, Austin Peters, and Daniel Kang. 2025. REPRO- bench: Can agentic AI...

work page 2024
[4]

Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Shichao Song, Zehao Lin, Yebin Yang, Simin Niu, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, and Zhiyu Li

Can large language models provide useful feedback on research papers? A large-scale empirical analysis.NEJM AI, 1(8):AIoa2400196. Xun Liang, Jiawei Yang, Yezhaohui Wang, Chen Tang, Zifan Zheng, Shichao Song, Zehao Lin, Yebin Yang, Simin Niu, Hanyu Wang, Bo Tang, Feiyu Xiong, Keming Mao, and Zhiyu Li. 2025. SurveyX: Aca- demic Survey Automation via Large L...

work page arXiv 2025
[5]

Scientific credibility of machine translation re- search: A meta-evaluation of 769 papers. InProceed- ings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7297–7306, Online. Association for Computational Linguistics. Iman M...

work page
[6]

InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025

GSM-Symbolic: Understanding the Limita- tions of Mathematical Reasoning in Large Language Models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. Mistral AI and All Hands AI. 2025. Devstral: Fine- tuning Language Models for Coding Agent Applica- tions.CoRR, abs/2509.25193. Mis...

work page arXiv 2025
[7]

LLM Evaluators Recognize and Favor Their Own Generations. InAdvances in Neural Infor- mation Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024. Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, and Raymond J. Mooney. 2021. Deep Just-In-Time Inconsistency ...

work page 2024
[8]

AAAI Press. Roger D. Peng. 2011. Reproducible Research in Com- putational Science.Science, 334(6060):1226–1227. Joelle Pineau, Koustuv Sinha, Genevieve Fried, Rose- mary Nan Ke, and Hugo Larochelle. 2019. ICLR Re- producibility Challenge 2019.ReScience C, 5(2):5. Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Lariviere, Alina Beygelzimer,...

work page 2011
[9]

Matt Post

Improving Reproducibility in Machine Learn- ing Research (A Report from the NeurIPS 2019 Re- producibility Program).Journal of Machine Learn- ing Research, 22(164):1–20. Matt Post. 2018. A call for clarity in reporting BLEU scores. InProceedings of the Third Conference on Machine Translation: Research Papers, pages 186– 191, Brussels, Belgium. Association...

work page arXiv 2019
[10]

Zachary S

OpenReview.net. Zachary S. Siegel, Sayash Kapoor, Nitya Nadgir, Benedikt Stroebl, and Arvind Narayanan. 2024. CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark.Transactions on Machine Learn- ing Research, 2024. Koustuv Sinha, Maurits Bleeker, Samarth Bhargav, Jes- sica Zosa Forde, Sharath C...

work page arXiv 2024
[11]

small” and “conceptually meaningful

From automation to autonomy: A survey on large language models in scientific discovery. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 17733–17750, Suzhou, China. Association for Com- putational Linguistics. Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, ...

work page arXiv 2025
[12]

Analyze the issue and ensure you understand exactly the claimed discrepancy between the paper and the code

work page
[14]

A discrepancy is valid, if the claimed discrepancy actually exists and there is a difference between the paper description and the code

Using your understanding of paper, code, and the discrepancy in the issue, analyze whether the discrepancy is valid or not. A discrepancy is valid, if the claimed discrepancy actually exists and there is a difference between the paper description and the code

work page
[15]

we could not match the reported results

Provide your final judgement whether the reported discrepancy is valid or not, and if so a summary and the relevant paper sections and code files in the format below: “‘yaml is_valid_discrepancy: <yes or no> is_valid_discrepancy_reason: <provide a short explanation for your judgement> discrepancy_summary: <if valid provide the following description, else ...

work page
[16]

Analyze the description and evidence from the reproducibility report and ensure you understand exactly the claimed discrepancy between the paper and the code

work page
[17]

Analyze the paper and the code and understand in detail the relevant paper sections and code files

work page
[18]

A discrepancy is valid, if the claimed discrepancy actually exists, i.e

Using your understanding of paper, code, and the reported discrepancy, analyze whether the discrepancy is valid or not. A discrepancy is valid, if the claimed discrepancy actually exists, i.e. the described difference between paper and code exists

work page
[19]

Provide your final judgement whether the reported discrepancy is valid or not, and if so a summary and the relevant paper sections and code files in the format below: “‘yaml is_valid_discrepancy: <yes or no> is_valid_discrepancy_reason: <provide a short explanation for your judgement> discrepancy_summary: <if valid provide the following description, else ...

work page
[20]

Y our goal is to identify the key ideas, methods, and components described in the paper and how they correspond to the implementation in the code

Carefully read and understand both the research paper and the entire code repository provided below. Y our goal is to identify the key ideas, methods, and components described in the paper and how they correspond to the implementation in the code

work page
[21]

It may affect multiple files, but only to the extent necessary to create a coherent and realistic discrepancy

Y our changes must adhere to the following constraints: - **Small**: The changes must affect a few lines of code or a short function. It may affect multiple files, but only to the extent necessary to create a coherent and realistic discrepancy. - **Relevance**: The changes must relate directly to a core scientific or algorithmic idea of the paper and woul...

work page
[22]

Y ou can create multiple discrepancies of the same or different types

Y our changes can be one of the following types. Y ou can create multiple discrepancies of the same or different types. - **Paper Omission**: Modify the code such that it implements a concept or idea that is not described in the paper. - **Code Omission**: Modify the code such that it drops a specific concept or idea that is described in the paper. - **Di...

work page
[23]

Note, you can create multiple discrepancies of the same or different types

Decide which paper-code discrepancies are most appropriate for the given paper, choosing from the following categories. Note, you can create multiple discrepancies of the same or different types. - **Loss**: changes to loss definition or terms - **Algorithm**: changes in the order of steps, operations, or core logic - **Training**: changes to the learning...

work page
[24]

the authors did not specify X

Generate 5 discrepancies in the following strict format: “‘md # Discrepancy 1 - Type: <choose one from: Paper Omission, Code Omission, Difference> - Category: <choose one from: Loss, Algorithm, Training, Evaluation, Model, Data, Other> - Description: <a summary of the discrepancy between the paper and the code in 3-8 sentences. When referring to the code,...

work page
[25]

Extract the core claims and issues from the reference and predicted discrepancies

Analyze which part of the paper or code each discrepancy is describing. Extract the core claims and issues from the reference and predicted discrepancies

work page
[26]

if they describe the same or different paper-code discrepancies

Analyze whether the core claims are about the same issue, i.e. if they describe the same or different paper-code discrepancies. The two discrepancies might use different wording or one might be more detailed than the other. Focus on whether the issue is the same, even if minor details are different. However, if they describe different issues (even about t...

work page
[27]

Provide a brief explanation of your reasoning. ## Reference Paper-Code Discrepancy {reference discrepancy} ## Predicted Paper-Code Discrepancy {predicted discrepancy} ## Answer Format Provide your answer in the following format: “‘yaml core_claim_reference: <core claim from reference discrepancy > core_claim_predicted: <core claim from predicted discrepan...

work page