arxiv: 2605.12808 · v2 · submitted 2026-05-12 · 💻 cs.LG

Recognition: no theorem link

Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

Ling-Qi Zhang , Kristin Branson

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:48 UTC · model grok-4.3

classification 💻 cs.LG

keywords neuroscience data reuseagentic AILLM coding agentsdata formattingneural decodingbenchmarking AI agentsdata integrationhuman-in-the-loop

0 comments

The pith

General-purpose AI coding agents handle isolated steps of neuroscience data reformatting but rarely complete error-free end-to-end pipelines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neuroscience data sits in fragmented, lab-specific formats that demand heavy manual work before reuse across studies. This paper supplies eight recent mouse neural recording papers to common coding agents along with their code and data, then asks the agents to prepare the files for a shared task of training a decoder from neural activity to behavior. The agents completed individual subtasks such as loading files or extracting variables at reasonable rates, yet they almost never assembled a fully correct pipeline without human intervention. The same agents also proved unreliable when asked to judge their own or other agents' outputs, particularly when no ground-truth reference was supplied. These results point to a continuing need for interactive human-AI workflows and to concrete data-sharing practices that would make future agent assistance more reliable.

Core claim

General-purpose coding agents commonly used by scientists performed well on each sub-task but rarely strung together a fully error-free end-to-end solution when reformatting data from eight diverse neuroscience papers for decoder training. Agents-as-judges are unreliable at catching errors, especially without ground-truth references, so interactive, human-in-the-loop coding remains necessary.

What carries the argument

The end-to-end reformatting benchmark that supplies papers, code, and raw data files to agents and requires them to produce clean inputs for training a neural decoder to behavioral variables.

If this is right

Data-sharing practices can be updated to include explicit metadata and examples that reduce the specific error types agents currently make.
Human oversight remains essential because agent self-evaluation does not reliably detect pipeline failures.
Success on sub-tasks does not guarantee success on chained workflows, so future agent designs should target end-to-end consistency.
Common flexible data formats still require external documentation that agents can exploit only when it is explicitly present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If chaining reliability improves, large-scale integration of existing neural datasets could become routine without new manual curation for each reuse case.
The same benchmarking approach could be applied to other fields that store raw experimental data in heterogeneous formats, such as genomics or materials science.
Dataset properties that currently trigger agent errors, such as unusual file structures or missing variable descriptions, could be used to prioritize which legacy data to re-document first.

Load-bearing premise

The eight chosen papers and their formats are representative of typical neuroscience data-reuse obstacles and that performance on the decoder-training task serves as a good stand-in for broader reuse utility.

What would settle it

A follow-up test set of new neuroscience papers in which the same general-purpose agents produce complete, error-free reformatted files on the first attempt in at least 80 percent of cases without human edits.

Figures

Figures reproduced from arXiv: 2605.12808 by Kristin Branson, Ling-Qi Zhang.

**Figure 1.** Figure 1: Overview of the data conversion task. The benchmark includes eight datasets spanning a range of neural recording modalities, behavioral tasks, measurements, experiment protocols, and data formats. For each dataset, agents were also given the released paper, methods, and code, together with a structured prompt. The agent’s goal was to convert each heterogeneous source dataset into a common format suitable … view at source ↗

**Figure 2.** Figure 2: Summary of process-based manual evaluation of agent performance. Each task is divided into four sections - Data Loading, Neural Data, (other) Data Variables, and (code) Efficiency, with each subtask assessing different aspects of the agents solution. Each square represents a single trial and is graded into one of five categories: incorrect, concerning, ok, match, or better. For each dataset, the left three… view at source ↗

**Figure 3.** Figure 3: A) Distribution of evaluation ratings across all tasks for the Claude Code (orange) and Codex (blue) agents. Bars show the total number of trials assigned to each rating category (incorrect, concerning, ok, match, better). B) Average trial performance (proportion correct) for each dataset. Points represent individual trials (three per agent, six total per dataset), and horizontal lines indicate the mean pe… view at source ↗

**Figure 4.** Figure 4: A) Breakdown of error types across the eight datasets and six trials. Each subtask receiving an incorrect or concerning rating in a given agent trial was assigned to one of six broad error categories (defined in the main text). Bars indicate the number of instances in each category across all subtasks and trials; the “overall” row summarizes counts across all datasets. B) Agent performance is stochastic … view at source ↗

read the original abstract

Neuroscience data are highly fragmented across labs, formats, and experimental paradigms, and reuse often requires substantial manual effort. A persistent roadblock to data reuse and integration is the need to decipher bespoke and diverse data formatting choices. Common data formats have been proposed in response, but the field continues to struggle with a fundamental tension: formats flexible enough to accommodate diverse experiments are rarely descriptive enough to be self-explanatory, and sufficiently descriptive formats demand detailed documentation and curation effort that few labs can sustain. Agentic AI is a natural candidate to solve this problem: LLMs read code and text faster and with sustained attention to the low-level details humans tend to skim over. To measure how well agentic AI performs on this task, we selected eight recent papers studying large-scale mouse neural population recordings that shared both data and code, spanning diverse recording modalities, behavioral paradigms, and dataset formats (e.g., NWB, specialized APIs, and general-purpose Python or MATLAB files). We provided agents with the data, code, and paper, and prompted them to load, understand, and reformat the data for a common downstream task: training a decoder from neural activity to task or behavioral variables. General-purpose coding agents commonly used by scientists performed well on each sub-task, but rarely strung together a fully error-free end-to-end solution. We characterize the types of mistakes agents made and the dataset properties that elicited them, and propose data-sharing best practices for the agentic-AI era. We further find that agents-as-judges are unreliable at catching errors, especially without ground-truth references, so interactive, human-in-the-loop coding remains necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agents handle sub-tasks on real neuroscience data but rarely complete error-free end-to-end reformatting, based on a new eight-paper benchmark.

read the letter

The main takeaway is that general-purpose coding agents manage individual steps like loading data or reading code from neuroscience papers, but they almost never assemble a complete, error-free pipeline for reformatting the data into a usable form for decoder training. Agents also turn out to be unreliable at spotting their own mistakes when no ground-truth reference is available. This paper introduces a benchmark built from eight recent mouse neural recording studies that released both data and code across formats such as NWB, custom APIs, and standard Python or MATLAB files. The agents were given the paper, code, and data, then asked to prepare everything for a downstream decoder task. The authors map the specific failure modes they observed and connect them to dataset properties like missing metadata or nonstandard structures, then suggest practical data-sharing habits that would make the work easier for agents. The setup is grounded because it uses actual published work instead of synthetic examples, and the error characterizations feel tied to what the runs actually produced. The sample is small and limited to papers that already share data and code openly, which likely makes the cases cleaner than average lab output and may not capture messier reuse scenarios like cross-dataset merging. The results stay qualitative with no success rates or error counts, so the size of the gap is hard to judge precisely. This is useful reading for anyone working on AI tools for scientific data handling or trying to improve reuse in neuroscience. The observations come from direct tests rather than speculation. I would bring it to a reading group to discuss the error patterns and whether the sharing recommendations scale. It deserves peer review because the benchmark is new and based on real attempts, though a larger sample and some quantitative metrics would help.

Referee Report

3 major / 2 minor

Summary. This manuscript benchmarks general-purpose coding agents on neuroscience data reuse. From eight recent papers on large-scale mouse neural recordings that publicly share data and code, agents receive the paper, data, and code and are prompted to load, understand, and reformat the data for a downstream decoder-training task (neural activity to behavioral variables). The central empirical claims are that agents handle individual sub-tasks competently but rarely produce fully error-free end-to-end solutions, that common failure modes can be characterized, that agents-as-judges are unreliable at error detection without ground truth, and that specific data-sharing best practices would help.

Significance. If the empirical observations hold under more rigorous evaluation, the work supplies concrete, field-specific evidence of current agentic-AI limitations for data-reuse workflows in neuroscience. The characterization of failure modes and the call for human-in-the-loop practices could usefully inform both data-sharing standards and future agent design. The study also highlights a practical tension between flexible data formats and machine readability that is widely recognized but rarely quantified in this domain.

major comments (3)

[Methods] Methods (dataset selection): the eight papers were chosen precisely because they already share both data and code; this selection criterion favors unusually clean, documented cases and does not represent the typical bespoke, poorly documented, or inaccessible formats that dominate neuroscience data reuse. The general claims about agent performance therefore rest on an unrepresentative sample.
[Results] Results / Evaluation: the manuscript reports only qualitative outcomes and states that agents 'rarely strung together a fully error-free end-to-end solution' without defining 'error-free,' without reporting success rates, error frequencies, or inter-agent variability, and without error bars or statistical measures. This absence of quantitative metrics makes the central empirical claims impossible to assess rigorously.
[Methods] Proxy task: decoder-training reformatting is presented as a representative reuse scenario, yet the paper does not demonstrate that success on this task correlates with performance on harder reuse problems (cross-dataset integration, missing metadata, multi-modal alignment). The stress-test concern that the chosen task underestimates real-world difficulty is therefore unaddressed.

minor comments (2)

[Methods] Specify the exact agent implementations, LLM back-ends, versions, and prompting strategies used; reproducibility of the benchmark requires these details.
[Abstract] The abstract claims agents 'performed well on each sub-task' but provides neither a breakdown of the sub-tasks nor any illustrative examples or failure cases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us strengthen the manuscript. We address each major point below and have made revisions where appropriate to improve clarity, rigor, and transparency.

read point-by-point responses

Referee: [Methods] Methods (dataset selection): the eight papers were chosen precisely because they already share both data and code; this selection criterion favors unusually clean, documented cases and does not represent the typical bespoke, poorly documented, or inaccessible formats that dominate neuroscience data reuse. The general claims about agent performance therefore rest on an unrepresentative sample.

Authors: We intentionally selected papers that share both data and code to isolate the agents' performance on data interpretation and reformatting, rather than confounding the evaluation with data-access barriers. This establishes a baseline on relatively well-curated cases. We agree the sample is not representative of typical neuroscience datasets. In the revision we have added an explicit limitations section acknowledging this selection bias and noting that agent performance would likely degrade further on poorly documented data; we also outline how future benchmarks could incorporate more challenging cases. revision: partial
Referee: [Results] Results / Evaluation: the manuscript reports only qualitative outcomes and states that agents 'rarely strung together a fully error-free end-to-end solution' without defining 'error-free,' without reporting success rates, error frequencies, or inter-agent variability, and without error bars or statistical measures. This absence of quantitative metrics makes the central empirical claims impossible to assess rigorously.

Authors: We acknowledge that the original presentation relied primarily on qualitative characterization. For the revision we have added quantitative metrics: we now define 'error-free' as code that executes without runtime errors and produces output in the exact format required for the downstream decoder; we report per-agent success rates for end-to-end completion, frequencies of each error category, and inter-agent variability with appropriate error bars and statistical summaries. revision: yes
Referee: [Methods] Proxy task: decoder-training reformatting is presented as a representative reuse scenario, yet the paper does not demonstrate that success on this task correlates with performance on harder reuse problems (cross-dataset integration, missing metadata, multi-modal alignment). The stress-test concern that the chosen task underestimates real-world difficulty is therefore unaddressed.

Authors: The decoder-training reformatting task was selected because it is a frequent, concrete reuse goal in the field. We agree it does not automatically generalize to harder scenarios. The revised manuscript now includes a dedicated limitations paragraph discussing this proxy-task choice, provides illustrative examples of how the observed failure modes would compound on cross-dataset integration or missing-metadata cases, and states that the current results should be viewed as a lower bound on difficulty. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking on external datasets

full rationale

The paper reports direct empirical observations from running general-purpose coding agents on data and code from eight independently published neuroscience papers. No mathematical derivations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the described methodology or results. Central claims rest on observable agent behavior (sub-task success vs. end-to-end failures, judge unreliability) measured against external ground-truth data formats, with no reduction of outputs to the paper's own definitions or prior self-citations. The eight-paper selection is an explicit sampling choice whose representativeness is a separate generalizability concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the representativeness of the eight chosen papers and the assumption that decoder-training reformatting is a meaningful proxy for data reuse. No free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5595 in / 1147 out tokens · 29601 ms · 2026-05-15T04:48:25.550428+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 1 internal anchor

[1]

The hippocampus as a predic- tive map

Kimberly L Stachenfeld, Matthew M Botvinick, and Samuel J Gershman. The hippocampus as a predic- tive map. Nature neuroscience, 20(11):1643–1653, 2017

work page 2017
[2]

Coarse graining, ﬁxed points, and scaling in a large population of neurons

Leenoy Meshulam, Jeffrey L Gauthier, Carlos D Brody, David W Tank, and William Bialek. Coarse graining, ﬁxed points, and scaling in a large population of neurons. Physical review letters , 123(17): 178103, 2019

work page 2019
[3]

Space is a latent sequence: A theory of the hippocampus

Rajkumar V asudeva Raju, J Swaroop Guntupalli, Guangyao Zhou, Carter Wendelken, Miguel Lázaro- Gredilla, and Dileep George. Space is a latent sequence: A theory of the hippocampus. Science Advances, 10(31):eadm8470, 2024

work page 2024
[4]

A uniﬁed, scalable framework for neural population decoding

Mehdi Azabou, Vinam Arora, V enkataramana Ganesh, Ximeng Mao, Santosh Nachimuthu, Michael Mendelson, Blake Richards, Matthew Perich, Guillaume Lajoie, and Eva Dyer. A uniﬁed, scalable framework for neural population decoding. Advances in Neural Information Processing Systems , 36: 44937–44956, 2023

work page 2023
[5]

Foundation model of neural activity predicts response to new stimulus types

Eric Y Wang, Paul G Fahey, Zhuokun Ding, Stelios Papadopoulos, Kayla Ponder, Marissa A Weis, An- dersen Chang, Taliah Muhammad, Saumil Patel, Zhiwei Ding, et al. Foundation model of neural activity predicts response to new stimulus types. Nature, 640(8058):470–477, 2025

work page 2025
[6]

the bitter lesson

Eva Dyer and Blake Richards. Accepting “the bitter lesson” and embracing the brain’s complexity. The Transmitter, March 2025. doi: 10.53053/ORXM6480. URL https://www.thetransmitter.org/ neuroai/accepting-the-bitter-lesson-and-embracing-the-brains-complexity/ . Online; accessed 2026-05-07. 10

work page doi:10.53053/orxm6480 2025
[7]

The neurodata without borders ecosystem for neurophysiological data science

Oliver Rübel, Andrew Tritt, Ryan Ly, Benjamin K Dichter, Satrajit Ghosh, Lawrence Niu, Pamela Baker, Ivan Soltesz, Lydia Ng, Karel Svoboda, Loren Frank, and Kristofer E Bouchard. The neurodata without borders ecosystem for neurophysiological data science. eLife, 11:e78362, oct 2022. ISSN 2050-084X. doi: 10.7554/eLife.78362. URL https://doi.org/10.7554/eLife.78362

work page doi:10.7554/elife.78362 2022
[8]

A brain-wide map of neural activity during complex behaviour

Dora Angelaki, Brandon Benson, Julius Benson, Daniel Birman, Niccolò Bonacchi, Kcénia Bougrova, Sebastian A Bruijns, Matteo Carandini, Joana A Catarino, et al. A brain-wide map of neural activity during complex behaviour. Nature, 645(8079):177–191, 2025

work page 2025
[9]

Brain-wide representations of prior information in mouse decision-making

Charles Findling, Felix Hubert, International Brain Laboratory, Luigi Acerbi, Brandon Benson, Julius Benson, Daniel Birman, Niccolò Bonacchi, E Kelly Buchanan, Sebastian Bruijns, et al. Brain-wide representations of prior information in mouse decision-making. Nature, 645(8079):192–200, 2025

work page 2025
[10]

Sharing neurophysiology data from the allen brain observatory

Saskia EJ de Vries, Joshua H Siegle, and Christof Koch. Sharing neurophysiology data from the allen brain observatory. Elife, 12:e85550, 2023

work page 2023
[11]

Liu, Han Hou, Nai-Wen Tien, Tim Wang, Timothy Harris, Shaul Druckmann, Nuo Li, and Karel Svoboda

Susu Chen, Yi Liu, Ziyue Aiden Wang, Jennifer Colonell, Liu D. Liu, Han Hou, Nai-Wen Tien, Tim Wang, Timothy Harris, Shaul Druckmann, Nuo Li, and Karel Svoboda. Brain-wide neural activity underlying memory-guided movement. Cell, 187(3):676–691.e16, 2024. doi: 10.1016/j.cell.2023.12.035

work page doi:10.1016/j.cell.2023.12.035 2024
[12]

Swe-bench: Can language models resolve real-world github issues? In The twelfth interna- tional conference on learning representations, 2023

Carlos E Jimenez, John Y ang, Alexander Wettig, Shunyu Y ao, Kexin Pei, Oﬁr Press, and Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github issues? In The twelfth interna- tional conference on learning representations, 2023

work page 2023
[13]

Human-in-the-loop software development agents

Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Ruix- iong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, and Ming Wu. Human-in-the-loop software development agents. In 2025 IEEE/ACM 47th International Conference on Software Engineering: Soft- ware Engineering in Practice (ICSE-SEIP) , pages 342–352. IEEE, 2025

work page 2025
[14]

Super: Evaluating agents on setting up and executing tasks from research repositories, 2024

Ben Bogin, Kejuan Y ang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sab- harwal, and Tushar Khot. Super: Evaluating agents on setting up and executing tasks from research repositories, 2024. URL https://arxiv.org/abs/2409.07440

work page arXiv 2024
[15]

Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan

Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, and Arvind Narayanan. Core-bench: Fostering the credibility of published research through a computational reproducibility agent benchmark,

work page
[16]

URL https://arxiv.org/abs/2409.11363

work page arXiv
[17]

PaperBench: Evaluating AI's Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, and Te- jal Patwardhan. Paperbench: Evaluating ai’s ability to replicate ai research, 2025. URL https: //arxiv.org/abs/2504.01848

work page internal anchor Pith review arXiv 2025
[18]

Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Mdry. Mle-bench: Evaluating machine learning agents on machine learning engineering, 2025. URL https://arxiv.org/abs/2410. 07095

work page 2025
[19]

Huang, J

Qian Huang, Jian V ora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation, 2024. URL https://arxiv.org/abs/2310.03302

work page arXiv 2024
[20]

, author Chen, S

Ziru Chen, Shijie Chen, Y uting Ning, Qianheng Zhang, Boshi Wang, Botao Y u, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu- Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Y u Su, and Huan Sun. Scienceagentbench: To- ward rigorous assessment of language agents for data-driven scientiﬁc discovery...

work page arXiv 2025
[21]

Zhang, Lanyi Zhu, Mike A

Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Y ue Mao, Y ouran Pan, Teng Wu, Jiaqian Y u, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, Mike A. Merrill, Jeffrey Heer, and Tim Althoff. Blade: Benchmarking language model agents for data-driven science, 2025. URL https://arxiv.org/abs/2408.09667

work page arXiv 2025
[22]

Discoverybench: Towards data-driven discovery with large language models, 2024

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth V ora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models, 2024. URL https://arxiv.org/abs/ 2407.01725

work page arXiv 2024
[23]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers, 2021. URL https://arxiv. org/abs/2105.03011. 11

work page arXiv 2021
[24]

Qasa: advanced question answering on scientiﬁc articles

Y oonjoo Lee, Kyungjae Lee, Sunghyun Park, Dasol Hwang, Jaehyeon Kim, Hong-in Lee, and Moontae Lee. Qasa: advanced question answering on scientiﬁc articles. In International Conference on Machine Learning, pages 19036–19052. PMLR, 2023

work page 2023
[25]

Scidqa: A deep reading comprehension dataset over scientiﬁc papers, 2024

Shruti Singh, Nandan Sarkar, and Arman Cohan. Scidqa: A deep reading comprehension dataset over scientiﬁc papers, 2024. URL https://arxiv.org/abs/2411.05338

work page arXiv 2024
[26]

Ritt, and Alexander Fleis- chmann

Andrea Pierré, Tuan Pham, Jonah Pearl, Sandeep Robert Datta, Jason T. Ritt, and Alexander Fleis- chmann. A perspective on neuroscience data standardization with neurodata without borders. Journal of Neuroscience , 44(38), 2024. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.0381-24.2024. URL https://www.jneurosci.org/content/44/38/e0381242024

work page doi:10.1523/jneurosci.0381-24.2024 2024
[27]

DANDI: Distributed archives for neurophysiology data integration

DANDI Archive. DANDI: Distributed archives for neurophysiology data integration. https:// dandiarchive.org/, 2024. Accessed: 2026-05-07

work page 2024
[28]

Neuroconv: Streamlining neurophysiology data conversion to the nwb standard

Heberto Mayorquin, Cody Baker, Paul Adkisson-Floro, Szonja Weigl, Alessandra Trapani, Luiz Tauffer, Oliver Rübel, and Benjamin Dichter. Neuroconv: Streamlining neurophysiology data conversion to the nwb standard. In Proceedings of the 24th Python in Science Conference , July 2025. doi: 10.25080/ cehj4257

work page 2025
[29]

Data Release – Brainwide Map – Q4 2022

The International Brain Laboratory. Data Release – Brainwide Map – Q4 2022. ﬁgshare preprint,

work page 2022
[30]

V ersion 7, updated 2024-09-19

URL https://figshare.com/articles/preprint/Data_release_-_Brainwide_map_-_ Q4_2022/21400815. V ersion 7, updated 2024-09-19

work page 2024
[31]

Allen Brain Observatory: Visual Behavior 2P Technical Whitepa- per

Allen Institute. Allen Brain Observatory: Visual Behavior 2P Technical Whitepa- per. Technical report, Allen Institute for Brain Science, March 2021. URL https: //brainmapportal-live-4cc80a57cd6e400d854-f7fdcae.divio-media.net/filer_ public/4e/be/4ebe2911-bd38-4230-86c8-01a86cfd758e/visual_behavior_2p_technical_ whitepaper.pdf. V1.0

work page 2021
[32]

Multi-session, multi-task neural decoding from distinct cell-types and brain regions

Mehdi Azabou, Krystal Xuejing Pan, Vinam Arora, Ian Jarratt Knight, Eva L Dyer, and Blake Aaron Richards. Multi-session, multi-task neural decoding from distinct cell-types and brain regions. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview. net/forum?id=IuU0wcO0mo

work page 2025
[33]

Kunigk, J

Joel Y e, Fabio Rizzoglio, Adam Smoulder, Hongwei Mao, Xuan Ma, Patrick Marino, Raeed Chowdhury, Dalton Moore, Gary Blumenthal, William Hockeimer, Nicolas G. Kunigk, J. Patrick Mayo, Aaron Batista, Steven Chase, Adam Rouse, Michael L. Boninger, Charles Greenspon, Andrew B. Schwartz, Nicholas G. Hatsopoulos, Lee E. Miller, Kristofer E. Bouchard, Jennifer L...

work page doi:10.1101/2025.02.02.634313 2025
[34]

universal translator

Yizi Zhang, Y anchen Wang, Donato Jimenez-Beneto, Zixuan Wang, Mehdi Azabou, Blake Richards, Olivier Winter, International Brain Laboratory, Eva Dyer, Liam Paninski, and Cole Hurwitz. Towards a "universal translator" for neural dynamics at single-cell, single-spike resolution, 2024. URL https: //arxiv.org/abs/2407.14668

work page arXiv 2024
[35]

Hasnain, Jaclyn E

Munib A. Hasnain, Jaclyn E. Birnbaum, Juan Luis Ugarte Nunez, Emma K. Hartman, Chandramouli Chandrasekaran, and Michael N. Economo. Separating cognitive and motor processes in the behaving mouse. Nature Neuroscience, 28:640–653, 2025. doi: 10.1038/s41593-024-01859-1

work page doi:10.1038/s41593-024-01859-1 2025
[36]

Quinn Lee, Alexandra T

J. Quinn Lee, Alexandra T. Keinath, Erica Cianfarano, and Mark P . Brandon. Identifying representational structure in CA1 to benchmark theoretical models of cognitive mapping. Neuron, 113(2):307–320.e5,

work page
[37]

doi: 10.1016/j.neuron.2024.10.027

work page doi:10.1016/j.neuron.2024.10.027 2024
[38]

Longitudinal tracking of neuronal activity from the same cells in the developing brain using Track2p

Jure Majnik, Manon Mantez, Soﬁa Zangila, Stéphane Bugeon, Léo Guignard, Jean-Claude Platel, and Rosa Cossart. Longitudinal tracking of neuronal activity from the same cells in the developing brain using Track2p. eLife, 14:RP107540, 2025. doi: 10.7554/eLife.107540

work page doi:10.7554/elife.107540 2025
[39]

Unsupervised pretraining in biological neural networks

Lin Zhong, Scott Baptista, Rachel Gattoni, Jon Arnold, Daniel Flickinger, Carsen Stringer, and Marius Pachitariu. Unsupervised pretraining in biological neural networks. Nature, 644:741–748, 2025. doi: 10.1038/s41586-025-09180-y

work page doi:10.1038/s41586-025-09180-y 2025
[40]

Plitt, and Lisa M

Marielena Sosa, Mark H. Plitt, and Lisa M. Giocomo. A ﬂexible hippocampal population code for experi- ence relative to reward. Nature Neuroscience, 28:1497–1509, 2025. doi: 10.1038/s41593-025-01985-4

work page doi:10.1038/s41593-025-01985-4 2025
[41]

reference

Yizi Zhang, Hanrui Lyu, Cole Hurwitz, Shuqi Wang, Charles Findling, Y anchen Wang, Felix Hu- bert, International Brain Laboratory, Alexandre Pouget, Erdem V arol, and Liam Paninski. Exploit- ing correlations across trials and behavioral sessions to improve neural decoding. Neuron, 2025. doi: 10.1016/j.neuron.2025.10.026. 12 Appendix A Supplementary Figure...

work page doi:10.1016/j.neuron.2025.10.026 2025
[42]

Do this BEFORE exploring any code or data

**CREATE CONVERSION_NOTES.md IMMEDIATELY** : Your very first action must be to create `CONVERSION_NOTES.md`. Do this BEFORE exploring any code or data. Document everything in this file as you work

work page
[43]

Done when

**FOLLOW STEPS IN ORDER** : Do not skip steps. Do not proceed to step N+1 until step N is complete. Each step has "Done when" criteria verify them before moving on

work page
[44]

<FILL IN>

**MATCH THE REFERENCE PROCESSING** : Your processing must match what 's described in the reference paper and code. You will be assessed on this consistency. --- ## Project Context - You are a computational neuroscientist. Your goal is to load and reformat data from a neuroscience paper into a specified structure suitable for downstream analysis. - You nee...

work page
[45]

Data structure is correct - will report errors and warnings to stdout

work page
[46]

Dimensions are consistent - will report errors and warnings to stdout

work page
[47]

Values are sensible - will report errors and warnings to stdout

work page
[48]

Decoder can successfully train and predict

work page
[49]

To check consistency, you must compare statistics available in the reference paper and your converted dataset

Poor performance indicates data formatting issues that need investigation ### Consistency requirements Your processing and formatting must **match the reference paper and code** with respect to: - Loading of data - Temporal alignment of different time series streams (neural, inputs, outputs) - Processing of neural, input, and output data streams - Curatio...

work page
[50]

**FIRST**: Create `CONVERSION_NOTES.md` with the template structure shown at the end of this document

work page
[51]

Verify you can run `python3` and import key packages (numpy, torch)

work page
[52]

List the contents of this directory to understand what files are available

work page
[53]

Step 5" - You have documented all decisions under

**REMINDER**: Do NOT look at any files outside this directory **Done when** : - `CONVERSION_NOTES.md` exists with the proper template structure - You have confirmed Python and key packages work - You have listed directory contents in CONVERSION_NOTES.md ** CHECKPOINT** : Before proceeding to Step 1, verify that `CONVERSION_NOTES.md` exists by running `ls ...

work page
[54]

Verify the sample data structure matches the specification

Run `python -u convert_data.py sample_data.pkl --sample --show-processing 2>&1 | tee conversion_sample_out.txt` a. Verify the sample data structure matches the specification. b. Check dimensions are consistent across trials/sessions. c. Check that no data is missed during conversion. d. Validate that metadata accurately describes the data

work page
[55]

Improve efficiency of conversion. a. Look at and note timing information for the conversion. b. Estimate how long the full conversion will run. When you do this, make sure that your estimate accounts for differences in lengths/numbers of trials/sessions. c. If full conversion time estimate is longer than 15 minutes, speed up the code by writing more effic...

work page
[56]

Run `python -u train_decoder.py sample_data.pkl --verify-only 2>&1 | tee verification_sample_out.txt `

work page
[57]

Verify no errors reported

Analyze the output file `verification_sample_out.txt`: a. Verify no errors reported. b. Attempt to address any warnings c. Check input ranges and output value distributions against expectations from reference texts. **Done when** : - `sample_data.pkl` is created and passes manual inspection - `verification_sample_out.txt` is created and passes inspection ...

work page
[58]

Run `python -u train_decoder.py sample_data.pkl 2>&1 | tee train_decoder_sample_out.txt `

work page
[60]

--- ### Step 9: Full Conversion and Validation **Goal**: Process the complete dataset

**Check accuracy** : Verify accuracy is above chance for EVERY output **Investigate any issues** : - If accuracy is low: check for conversion bugs or reconsider output representation 32 **Done when** : - Decoder training completes without errors - Loss decreases over epochs - Accuracies are above chance - Document all results in CONVERSION_NOTES.md under ...

work page
[61]

Review your time estimate from Step 7

work page
[62]

longer than 15 minutes), optimize bottlenecks first

If estimated time is long (e.g. longer than 15 minutes), optimize bottlenecks first

work page
[63]

Run `python -u convert_data.py converted_data.pkl --full 2>&1 | tee conversion_full_out.txt `

work page
[64]

If it is much longer than your previous estimate (> 1.5x), kill the process, optimize bottlenecks, and repeat

As the code runs, update your estimates of how long processing will take. If it is much longer than your previous estimate (> 1.5x), kill the process, optimize bottlenecks, and repeat

work page
[65]

Run `python -u train_decoder.py converted_data.pkl --verify-only 2>&1 | tee verification_full_out.txt `

work page
[66]

Check that no data was lost during conversion

work page
[67]

Spot-check a few sessions to verify data integrity

work page
[68]

**Check for consistency** between dataset statistics in `verification_full_out.txt` and the reference texts

work page
[69]

--- ### Step 10: Critical Review 1 **Goal**: Find and fix any errors by performing the following checks

Investigate any inconsistencies, and revise the conversion script until all dataset statistics are consistent **Done when** : - `converted_data.pkl` is created and passes manual inspection - `verification_full_out.txt` is created and passes inspection - **ALL** dataset statistics are consistent with values from reference texts - You have documented statis...

work page
[73]

**Done when** : You have documented your review findings and all issues are resolved in CONVERSION_NOTES.md under "Step 10"

Document each iteration: what was found, what was fixed, what the re-check showed Write a report to CONVERSION_NOTES.md describing **every** check you did, to help convince the user that the conversion code works. **Done when** : You have documented your review findings and all issues are resolved in CONVERSION_NOTES.md under "Step 10". ** DO NOT PROCEED*...

work page
[74]

If the GPU does not have sufficient RAM for the network training, use the flag `--cpu` to specify to use the CPU to train

Run `python -u train_decoder.py converted_data.pkl --plot-samples 2>&1 | tee train_decoder_full_out.txt`. If the GPU does not have sufficient RAM for the network training, use the flag `--cpu` to specify to use the CPU to train

work page
[75]

**Wait for complete execution** this may take significant time for large datasets

work page
[76]

**Check training** : Verify loss decreases over epochs

work page
[77]

Accuracy near chance is a sign that there might be issues in temporal alignment of signals, choice of data streams, or filtering or processing of data

**Check accuracy** : High accuracy is an indicator that data conversion has been done correctly. Accuracy near chance is a sign that there might be issues in temporal alignment of signals, choice of data streams, or filtering or processing of data. Compare accuracy to expectations based on the paper. **Done when** :

work page
[78]

Full decoder training completes (the script finishes running)

work page
[79]

Fix issues and re-run affected steps

Accuracy results are documented in CONVERSION_NOTES.md under "Step 11" with a table of accuracies --- ### Step 12: Critical Review 2 **Goal**: Find and fix any errors by performing the following checks. Fix issues and re-run affected steps. Iterate until no issues remain. Perform all of the following checks: **Check 1: Accuracy vs chance analysis** - For ...

work page
[80]

Verify the output values are correct by loading raw data and checking 3 specific trials

work page
[81]

Check temporal alignment plot neural + output for a single trial to verify they 're synchronized

work page
[82]

Check whether the output has enough variation (not 99% one class)

work page
[83]

Check that filtering steps for neural activity streams are followed

work page
[84]

**Iteration protocol** : If ANY check above reveals an issue:

Check that processing of the data matches the reference code and paper. **Iteration protocol** : If ANY check above reveals an issue:

work page

Showing first 80 references.