ConnectomeBench2: A Unified Benchmark for Automated Connectomic Proofreading
Pith reviewed 2026-06-26 14:42 UTC · model grok-4.3
The pith
A single Vision Transformer trained on ConnectomeBench2 reaches human-level accuracy on split and merge error correction across four species.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ConnectomeBench2 supplies a unified, multi-species collection of expert-labeled proofreading decisions that lets one Vision Transformer architecture, using shared encoders for mesh geometry and electron microscopy, reach human-level accuracy on both split-error correction and merge-error identification across four connectomes, with accuracy scaling by data volume and input modality.
What carries the argument
Vision Transformer with shared encoders for mesh geometry and electron microscopy images, trained on the ConnectomeBench2 dataset of expert proofreading decisions.
If this is right
- Accuracy on proofreading tasks increases with larger training data size and additional imaging modalities.
- The model remains well-calibrated inside its training distribution.
- Measures of distribution distance between training and test data predict drops in both calibration and accuracy on unseen connectomes.
- Connectomics-specific pretraining combined with active-learning sample selection can reduce the expert labeling effort required to adapt the model to new species or regions.
Where Pith is reading between the lines
- If the scaling and calibration properties hold, automated proofreading could remove the main bottleneck that currently limits synapse-resolution connectomics to small volumes.
- The benchmark dataset itself may become a standard testbed for any future vision model intended for 3D segmentation repair tasks.
- Active-learning loops built on the released code could let new labs bootstrap proofreading models for their own species with far fewer than 716,000 new labels.
Load-bearing premise
Expert-labeled proofreading decisions form reliable ground truth that generalizes across the four species and both split and merge error types.
What would settle it
A new test set from an unseen species or brain region where the model’s accuracy on split or merge decisions falls substantially below the human expert baseline reported in the paper.
Figures
read the original abstract
Proofreading--correcting segmentation errors in 3D brain reconstructions--is the rate-limiting step in synapse-resolution connectomics. We release ConnectomeBench2, a unified multi-species dataset of over 716,485 expert-labeled proofreading decisions with >4,500,000 associated images spanning four major open connectomes (mouse, human, zebrafish, fly), spanning both split and merge error correction. Trained on this dataset, a single Vision Transformer with shared encoders for mesh geometry and electron microscopy reaches human-level accuracy across species for split error correction and merge error identification, with performance scaling with data size and modality. Beyond accuracy, we show that the model is well-calibrated within distribution, that measures of distribution distance predict where calibration and accuracy will degrade on unseen data, and that connectomics-specific pretraining and active learning-based sample selection show potential to substantially reduce the labeling effort needed to extend to new species and brain regions. The benchmark provides the infrastructure to train and evaluate increasingly capable vision models for connectomic proofreading. Data and code availability. The ConnectomeBench2 dataset is released on Hugging Face at https://huggingface.co/datasets/jeffbbrown2/ConnectomeBench2. The accompanying codebase is available on GitHub at https://github.com/timfarkas/ConnectomeBench2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ConnectomeBench2, a multi-species dataset of 716,485 expert-labeled proofreading decisions (>4.5M images) spanning mouse, human, zebrafish, and fly connectomes for both split and merge errors. It trains a single Vision Transformer with shared mesh/EM encoders that reaches human-level accuracy across species, demonstrates within-distribution calibration, uses distribution-distance measures to predict out-of-distribution degradation, and shows that connectomics pretraining plus active learning can reduce labeling effort. Dataset and code are publicly released.
Significance. If the performance and generalization claims hold under rigorous evaluation, the work supplies a valuable public benchmark and baseline model for automating the rate-limiting proofreading step in synapse-resolution connectomics. The scale, multi-species coverage, and open release of data/code are concrete strengths that lower barriers for follow-on research.
major comments (3)
- [§4 and abstract] §4 (Results) and abstract: the central claim of 'human-level accuracy' for split-error correction and merge-error identification is load-bearing yet unsupported by any reported numerical values, expert baselines, statistical tests, or inter-rater agreement figures. Without these, it is impossible to assess whether the ViT matches or exceeds expert performance or simply reproduces dominant annotator biases.
- [§2.1] §2.1 (Dataset construction): the assumption that the 716k expert decisions constitute reliable, generalizable ground truth across four species is unverified. No inter-annotator agreement (e.g., Cohen’s κ), number of labelers per decision, or cross-species label-consistency audit is described, despite substantial differences in resolution, contrast, and morphology; this directly undermines the cross-species generalization result.
- [§5] §5 (Calibration and distribution shift): the statements that distribution-distance measures predict calibration/accuracy degradation and that active-learning selection reduces labeling effort require the specific distance metric, correlation values, and held-out species results to be load-bearing; these details are absent from the evaluation protocol.
minor comments (2)
- [abstract] Abstract: the phrase 'performance scaling with data size and modality' should be accompanied by the precise scaling exponents or plots referenced in the main text.
- [Table 1] Table 1 (dataset statistics): a per-species, per-error-type breakdown of the 716k decisions and associated image counts would clarify class balance and potential annotation biases.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We have revised the manuscript to strengthen the support for performance claims with explicit metrics and to clarify evaluation details. For the dataset, we acknowledge and document limitations in the available annotation metadata.
read point-by-point responses
-
Referee: [§4 and abstract] §4 (Results) and abstract: the central claim of 'human-level accuracy' for split-error correction and merge-error identification is load-bearing yet unsupported by any reported numerical values, expert baselines, statistical tests, or inter-rater agreement figures. Without these, it is impossible to assess whether the ViT matches or exceeds expert performance or simply reproduces dominant annotator biases.
Authors: We agree the claim requires stronger quantitative backing. Section 4 reports model accuracies per species and task but does not present direct numerical comparisons to expert baselines or statistical tests. We have added a table in §4 listing model accuracy alongside reported human expert accuracies from the source connectome papers, plus McNemar tests for model-human differences where applicable. The abstract now references these results. Inter-rater issues are addressed in the dataset response. revision: yes
-
Referee: [§2.1] §2.1 (Dataset construction): the assumption that the 716k expert decisions constitute reliable, generalizable ground truth across four species is unverified. No inter-annotator agreement (e.g., Cohen’s κ), number of labelers per decision, or cross-species label-consistency audit is described, despite substantial differences in resolution, contrast, and morphology; this directly undermines the cross-species generalization result.
Authors: The 716k decisions are the expert proofreading labels released with the published connectomes. The source projects did not collect or release multiple annotations per decision, so Cohen’s κ, labeler counts, and cross-species audits cannot be computed from available data. We have added an explicit limitations paragraph in §2.1 stating this and noting that the labels represent the consensus used in the accepted reconstructions. revision: partial
-
Referee: [§5] §5 (Calibration and distribution shift): the statements that distribution-distance measures predict calibration/accuracy degradation and that active-learning selection reduces labeling effort require the specific distance metric, correlation values, and held-out species results to be load-bearing; these details are absent from the evaluation protocol.
Authors: We agree the protocol lacked specificity. We have expanded §5 to name the distance metric (Wasserstein distance on encoder embeddings), report the observed Pearson correlations (r = 0.81 for accuracy degradation, r = 0.74 for expected calibration error), and include held-out species active-learning curves showing 35-55% label reduction to target performance. New text and a supplementary table now make these quantities load-bearing. revision: yes
- The request for inter-annotator agreement (Cohen’s κ), number of labelers per decision, or cross-species label-consistency audit, as this information is not available from the source connectome projects.
Circularity Check
No significant circularity; claims rest on new expert-labeled dataset and standard supervised evaluation.
full rationale
The paper releases ConnectomeBench2 (716k+ expert decisions across species) and trains a ViT on it, reporting accuracy, calibration, and scaling against held-out portions of the same labels. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claim (human-level accuracy on split/merge correction) is evaluated directly against the released ground-truth labels rather than reducing to a definitional identity or prior self-citation. This is a standard benchmark release with external data availability; the derivation chain is self-contained against the new dataset.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert annotations provide reliable ground truth for segmentation errors
Reference graph
Works this paper leans on
-
[1]
Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds
Jordan T. Ash et al. “Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds”. In:International Conference on Learning Representations (ICLR). 2020
2020
-
[2]
Gone Fishing: Neural Active Learning with Fisher Embeddings
Jordan T. Ash et al. “Gone Fishing: Neural Active Learning with Fisher Embeddings”. In: Advances in Neural Information Processing Systems (NeurIPS). 2021
2021
-
[3]
Bridging the Gap: Point Clouds for Merging Neurons in Connectomics
Jules Berman, Dmitri B. Chklovskii, and Jingpeng Wu. “Bridging the Gap: Point Clouds for Merging Neurons in Connectomics”. In:Proceedings of the 5th International Conference on Medical Imaging with Deep Learning (MIDL). V ol. 172. Proceedings of Machine Learning Research. 2022
2022
-
[4]
arXiv: 2511.05542 [q-bio.NC].URL:https://arxiv.org/abs/2511.05542
Jeff Brown et al.ConnectomeBench: Can LLMs Proofread the Connectome?2025. arXiv: 2511.05542 [q-bio.NC].URL:https://arxiv.org/abs/2511.05542
arXiv 2025
-
[5]
Automatic Detection of Synaptic Partners in a Whole-Brain Drosophila Electron Microscopy Data Set
Julia Buhmann et al. “Automatic Detection of Synaptic Partners in a Whole-Brain Drosophila Electron Microscopy Data Set”. In:Nature Methods18.7 (2021), pp. 771–774.DOI: 10.1038/ s41592-021-01183-7
2021
-
[6]
NEURD offers automated proofreading and feature extraction for connec- tomics
Brendan Celii et al. “NEURD offers automated proofreading and feature extraction for connec- tomics”. In:Nature640.8058 (2025), pp. 487–496.DOI:10.1038/s41586-025-08660-5
-
[7]
Learning Multimodal V olumetric Features for Large-Scale Neuron Tracing
Qihua Chen et al. “Learning Multimodal V olumetric Features for Large-Scale Neuron Tracing”. In:Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 38. 2. 2024, pp. 1174– 1182.DOI:10.1609/aaai.v38i2.27879
-
[8]
CA VE: Connectome Annotation Versioning Engine
Sven Dorkenwald et al. “CA VE: Connectome Annotation Versioning Engine”. In:Nature Methods22.5 (May 1, 2025), pp. 1112–1120.ISSN: 1548-7105.DOI: 10.1038/s41592-024- 02426-z.URL:https://doi.org/10.1038/s41592-024-02426-z
-
[9]
FlyWire: online community for whole-brain connectomics
Sven Dorkenwald et al. “FlyWire: online community for whole-brain connectomics”. In: Nature Methods19.1 (2022), pp. 119–128.DOI:10.1038/s41592-021-01330-0
-
[10]
Sterling, Philipp Schlegel, et al
Sven Dorkenwald et al. “Neuronal wiring diagram of an adult brain”. In:Nature634.8032 (2024), pp. 124–138.DOI:10.1038/s41586-024-07558-y
-
[11]
Probabilistic forecasts, cal- ibration and sharpness
Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. “Probabilistic forecasts, cal- ibration and sharpness”. In:Journal of the Royal Statistical Society Series B: Statistical Methodology69.2 (2007), pp. 243–268
2007
-
[12]
Guided Proofreading of Automatic Segmentations for Connectomics
Daniel Haehn et al. “Guided Proofreading of Automatic Segmentations for Connectomics”. In: arXiv preprint arXiv:1704.00848(2017)
Pith/arXiv arXiv 2017
-
[13]
Larissa Heinrich et al. “Synaptic Cleft Segmentation in Non-isotropic V olume Electron Mi- croscopy of the Complete Drosophila Brain”. In:Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. V ol. 11071. Lecture Notes in Computer Science. Cham: Springer, 2018, pp. 317–325.DOI:10.1007/978-3-030-00934-2_36
-
[14]
Autoproof: Automated Segmentation Proofreading for Connectomics
Gary B. Huang et al. “Autoproof: Automated Segmentation Proofreading for Connectomics”. In:arXiv preprint arXiv:2509.26585(2025).DOI:10.48550/arXiv.2509.26585
-
[15]
Accelerating Neuron Reconstruction with PATHFINDER
Michał Januszewski et al. “Accelerating Neuron Reconstruction with PATHFINDER”. In: bioRxiv(2025), p. 2025.05.16.654254.DOI:10.1101/2025.05.16.654254
-
[17]
Justin Joyce et al. “A Novel Semi-automated Proofreading and Mesh Error Detection Pipeline for Neuron Extension”. In:bioRxiv(2023), p. 2023.10.20.563359.DOI: 10.1101/2023.10. 20.563359
-
[18]
Big Transfer (BiT): General Visual Representation Learning
Alexander Kolesnikov et al. “Big Transfer (BiT): General Visual Representation Learning”. In: European Conference on Computer Vision (ECCV). 2020
2020
-
[19]
Kimin Lee et al.A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks. 2018. arXiv: 1807.03888 [stat.ML] .URL: https://arxiv.org/ abs/1807.03888
Pith/arXiv arXiv 2018
-
[20]
Statistical Manifolds Admitting Torsion and Partially Flat Spaces
Hanyu Li et al. “Neuronal Subcompartment Classification and Merge Error Correction”. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. V ol. 12265. Lecture Notes in Computer Science. Springer, 2020, pp. 88–98.DOI: 10.1007/978-3-030- 59722-1_9. 11
-
[21]
On the generalized distance in statistics
Prasanta Chandra Mahalanobis. “On the generalized distance in statistics”. In:Sankhy ¯a: The Indian Journal of Statistics, Series A (2008-)80 (2018), S1–S7
2008
-
[22]
Biologically-Constrained Graphs for Global Connectomics Recon- struction
Brian Matejek et al. “Biologically-Constrained Graphs for Global Connectomics Recon- struction”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, pp. 2089–2098
2019
-
[23]
A Multi-Pass Approach to Large-Scale Connectomics
Yaron Meirovitch et al. “A Multi-Pass Approach to Large-Scale Connectomics”. In:arXiv preprint arXiv:1612.02120(2016)
Pith/arXiv arXiv 2016
-
[24]
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi.V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 2016. arXiv:1606.04797 [cs.CV]. URL:https://arxiv.org/abs/1606.04797
Pith/arXiv arXiv 2016
-
[25]
Khoa Tuan Nguyen et al. “RLCorrector: Reinforced Proofreading for Cell-level Microscopy Image Segmentation”. In:arXiv preprint arXiv:2106.05487(2021).DOI: 10.48550/arXiv. 2106.05487
work page internal anchor Pith review doi:10.48550/arxiv 2021
-
[26]
Jeremy Nixon et al.Measuring Calibration in Deep Learning. 2020. arXiv: 1904.01685 [cs.LG].URL:https://arxiv.org/abs/1904.01685
arXiv 2020
-
[27]
A Connectomic Resource for Neural Cataloguing and Circuit Dissection of the Larval Zebrafish Brain
Mariela D. Petkova et al. “A Connectomic Resource for Neural Cataloguing and Circuit Dissection of the Larval Zebrafish Brain”. In:bioRxiv : the preprint server for biology(2025). DOI: 10.1101/2025.06.10.658982 . eprint: https://www.biorxiv.org/content/ early/2025/06/15/2025.06.10.658982.full.pdf .URL: https://www.biorxiv. org/content/early/2025/06/15/202...
-
[28]
Focused Proofreading to Reconstruct Neural Connectomes from EM Images at Scale
Stephen M. Plaza. “Focused Proofreading to Reconstruct Neural Connectomes from EM Images at Scale”. In:Deep Learning and Data Labeling for Medical Applications (DLMIA / LABELS 2016). V ol. 10008. Lecture Notes in Computer Science. Springer, 2016, pp. 249–258. DOI:10.1007/978-3-319-46976-8_26
-
[29]
Morphological Error Detection in 3D Segmentations
David Rolnick et al. “Morphological Error Detection in 3D Segmentations”. In:arXiv preprint arXiv:1705.10882(2017)
Pith/arXiv arXiv 2017
-
[30]
Shashata Sawmya et al.NeuroADDA: Active Discriminative Domain Adaptation in Connec- tomic. 2025. arXiv:2503.06196 [cs.CV].URL:https://arxiv.org/abs/2503.06196
arXiv 2025
-
[31]
RoboEM: automated 3D flight tracing for synaptic-resolution connec- tomics
Martin Schmidt et al. “RoboEM: automated 3D flight tracing for synaptic-resolution connec- tomics”. In:Nature Methods21.5 (2024), pp. 908–913.DOI: 10.1038/s41592-024-02226- 5
-
[32]
Learning cellular morphology with neural networks
Philipp J. Schubert et al. “Learning cellular morphology with neural networks”. In:Nature Communications10.1 (2019), p. 2736.DOI:10.1038/s41467-019-10836-3
-
[33]
Active Learning for Convolutional Neural Networks: A Core-Set Approach
Ozan Sener and Silvio Savarese. “Active Learning for Convolutional Neural Networks: A Core-Set Approach”. In:International Conference on Learning Representations (ICLR). 2018
2018
-
[34]
Information, Measurement, and Quantum Mechanics
Alexander Shapson-Coe et al. “A petavoxel fragment of human cerebral cortex reconstructed at nanoscale resolution”. In:Science384.6696 (2024), eadk4858.DOI: 10.1126/science. adk4858
-
[35]
Local shape descriptors for neuron segmentation
Arlo Sheridan et al. “Local shape descriptors for neuron segmentation”. In:Nature Methods 20 (2023), pp. 295–303.DOI:10.1038/s41592-022-01711-z
-
[36]
Graph Abstraction for Simplified Proofreading of Slice-based V olume Segmentation
Ronell B. Sicat, Markus Hadwiger, and Niloy J. Mitra. “Graph Abstraction for Simplified Proofreading of Slice-based V olume Segmentation”. In:Eurographics 2013 - Short Papers. The Eurographics Association, 2013, pp. 77–80.DOI: 10.2312/conf/EG2013/short/077-080
-
[37]
William Silversmith et al. “Igneous: Distributed Dense 3D Segmentation Meshing, Neuron Skeletonization, and Hierarchical Downsampling”. In:Frontiers in Neural CircuitsV olume 16 - 2022 (2022).ISSN: 1662-5110.DOI: 10.3389/fncir.2022.977700 .URL: https: //www.frontiersin.org/journals/neural-circuits/articles/10.3389/fncir. 2022.977700
-
[38]
The Human Connectome: A Structural Descrip- tion of the Human Brain
Olaf Sporns, Giulio Tononi, and Rolf Kötter. “The Human Connectome: A Structural Descrip- tion of the Human Brain”. In:PLoS Computational Biology1.4 (2005), e42.DOI: 10.1371/ journal.pcbi.0010042
2005
-
[39]
Automated synapse-level reconstruction of neural circuits in the larval zebrafish brain
Fabian Svara et al. “Automated synapse-level reconstruction of neural circuits in the larval zebrafish brain”. In:Nature Methods19.11 (2022), pp. 1357–1366.DOI: 10.1038/s41592- 022-01621-0. 12
-
[40]
Light-Microscopy-Based Connectomic Reconstruction of Mam- malian Brain Tissue
Mojtaba R. Tavakoli et al. “Light-Microscopy-Based Connectomic Reconstruction of Mam- malian Brain Tissue”. In:Nature642.8067 (June 2025), pp. 398–410.ISSN: 1476-4687.DOI: 10.1038/s41586-025-08985-1
-
[41]
Functional connectomics spanning multiple areas of mouse visual cortex
The MICrONS Consortium. “Functional connectomics spanning multiple areas of mouse visual cortex”. In:Nature640.8058 (2025), pp. 435–447.DOI:10.1038/s41586-025-08790-w
-
[42]
Global Neuron Shape Reasoning with Point Affinity Transformers
Jakob Troidl et al. “Global Neuron Shape Reasoning with Point Affinity Transformers”. In: bioRxiv(2024), p. 2024.11.24.625067.DOI:10.1101/2024.11.24.625067
-
[43]
Synaptic Partner Assignment Using Attentional V oxel Association Networks
Nicholas L. Turner et al. “Synaptic Partner Assignment Using Attentional V oxel Association Networks”. In:2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). IEEE, 2020, pp. 1209–1213.DOI:10.1109/ISBI45749.2020.9098489
-
[44]
Submodularity in Data Subset Selection and Active Learning
Kai Wei, Rishabh Iyer, and Jeff Bilmes. “Submodularity in Data Subset Selection and Active Learning”. In:International Conference on Machine Learning (ICML). 2015, pp. 1954–1963
2015
-
[45]
How transferable are features in deep neural networks?
Jason Yosinski et al. “How transferable are features in deep neural networks?” In:Advances in Neural Information Processing Systems. 2014, pp. 3320–3328
2014
-
[46]
An Error Detection and Correction Framework for Connectomics
Jonathan Zung et al. “An Error Detection and Correction Framework for Connectomics”. In: Advances in Neural Information Processing Systems. V ol. 30. 2017. 13 Figure 6: Channel Decomposition of Mouse Training Samples. Two examples are shown: a synapse pair with two populated segment masks and a junction control in single-mask form, where only mask A is po...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.