pith. sign in

arxiv: 2606.17475 · v1 · pith:DPIIBCG2new · submitted 2026-06-16 · 💻 cs.CV

StereoFactory: A Unified Merging Framework for Robust Stereo Matching

Pith reviewed 2026-06-27 01:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords stereo matchingmodel merginggenetic algorithmCMA-ESevolutionary optimizationfoundation modelstask vectorscomputer vision
0
0 comments X

The pith

StereoFactory merges stereo models with genetic search and CMA-ES routing to beat retraining on benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that post-hoc model merging can overcome the scalability issues of foundation models in stereo matching by adaptively combining knowledge from multiple checkpoints. It introduces a two-stage process where a genetic algorithm first selects which models to include and CMA-ES then optimizes how to route their contributions at the module level. This approach achieves lower average errors across four benchmarks while using only a small fraction of the time required for joint retraining. A sympathetic reader would care because it offers a way to incorporate new data without expensive retraining. The results show consistent improvements over baselines that either use all models or greedy selection.

Core claim

StereoFactory is a coarse-to-fine evolutionary framework that first uses a genetic algorithm to search the combinatorial space of model subsets and then CMA-ES optimization of architecture-adaptive routing over the selected task vectors, with optional module-level scaling, achieving the best four-benchmark average under the same checkpoint pool while requiring only 2.7--3.7% of joint-retraining wall-clock time.

What carries the argument

The two-stage evolutionary process of genetic-algorithm subset selection followed by CMA-ES optimization of module-level routing over task vectors.

Load-bearing premise

The genetic algorithm search for model subsets and the subsequent CMA-ES optimization of module-level routing will generalize beyond the specific validation data used during the search without introducing selection bias or overfitting.

What would settle it

Running the merged models on a new stereo matching benchmark withheld from the search process and finding that the error reduction disappears or reverses relative to the strongest baseline would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.17475 by Mang Ye, Pinhan Fu, Qin Zou, Ruilin Wang, Wenke Huang, Xianda Guo.

Figure 1
Figure 1. Figure 1: Motivation of StereoFactory. Lower is better for all metrics. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the StereoFactory framework. Stage 1 searches for the optimal subset of fine-tuned checkpoints via evolutionary search [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison across NMRF-Stereo and FoundationStereo. We compare uniform averaging (Mean) with StereoFactory (Ours) on representative samples from four target benchmarks. StereoFactory produces cleaner structures and fewer localized artifacts across different architectural families. in the table indicate per-metric best/second-best values; consequently, some individual metrics are led by other ba… view at source ↗
Figure 4
Figure 4. Figure 4: Selection statistics of the FoundationStereo training dataset. Across validation settings where the FoundationStereo training dataset is in the candidate pool, we compare its selection frequency and average fusion weight against the strongest competing source. The dataset is frequently selected with non-trivial weights, indicating that it provides a useful but not universally dominant source of transferabl… view at source ↗
Figure 5
Figure 5. Figure 5: Population-size ablation for Stage 1 evolutionary search. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Stereo matching has advanced through foundation models trained on large-scale datasets, yet this paradigm suffers from a scalability bottleneck: incorporating new data requires costly joint retraining. Model merging offers a scalable post-hoc alternative by integrating knowledge from specialized models after source checkpoints are available. However, existing merging methods typically retain all available models or rely on greedy inclusion, which can preserve harmful task-vector interference. We propose StereoFactory, a coarse-to-fine evolutionary framework for adaptive model merging. Stage~1 employs a genetic algorithm to search the combinatorial space of model subsets, determining which models should participate. Stage~2 addresses module-level knowledge specialization (different functional modules exhibit distinct preferences for knowledge sources) through CMA-ES optimization of architecture-adaptive routing over the selected task vectors, with optional module-level scaling. Experiments across two architectures and four benchmarks demonstrate that StereoFactory consistently achieves the best four-benchmark average under the same checkpoint pool, reducing the average error from 3.80 to 3.30 on NMRF and from 2.88 to 2.19 on FoundationStereo relative to the strongest controlled baseline. The post-hoc search requires only 2.7--3.7\% of the corresponding joint-retraining wall-clock time. Analysis reveals that knowledge contributions are inherently module-specific, and selected subsets can transfer across architectures with minimal degradation. Code will be publicly released upon acceptance at: https://github.com/XiandaGuo/StereoFactory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents StereoFactory, a coarse-to-fine evolutionary model-merging framework for stereo matching. Stage 1 applies a genetic algorithm to search over subsets of available checkpoints; Stage 2 applies CMA-ES to optimize module-level routing (and optional scaling) over the selected task vectors. The central empirical claim is that the resulting merged models achieve the best four-benchmark average under a fixed checkpoint pool, reducing average error from 3.80 to 3.30 (NMRF) and from 2.88 to 2.19 (FoundationStereo) relative to the strongest controlled baseline, while requiring only 2.7–3.7 % of the wall-clock time of joint retraining. Additional claims include module-specific knowledge contributions and limited degradation when subsets transfer across architectures.

Significance. If the reported gains survive proper controls for search overfitting, the work supplies a practical post-hoc route to incorporating new stereo data without full retraining of foundation models. The planned public code release and the explicit analysis of module-level specialization are concrete strengths that would aid reproducibility and further research.

major comments (2)
  1. [Experiments] Experiments section: the manuscript provides no description of held-out validation splits used during the genetic-algorithm subset search or the subsequent CMA-ES routing optimization. Because the four reported benchmarks supply both the checkpoint pool and the final metrics, the absence of such splits leaves open the possibility that the discovered configurations exploit benchmark-specific statistics rather than generalizing.
  2. [Abstract and Experiments] Abstract and Experiments: the quantitative claims (e.g., error reductions to 3.30 and 2.19) are presented without reporting the number of independent search runs, standard deviations, or statistical significance tests, and without explicit controls for post-hoc selection bias in the evolutionary procedure. These omissions directly affect the load-bearing claim that StereoFactory “consistently achieves the best four-benchmark average.”
minor comments (1)
  1. [Abstract] The abstract states that code “will be publicly released upon acceptance” yet already supplies a GitHub link; the repository should include the precise search seeds, validation splits (if any), and configuration files used to produce the reported numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on experimental validation and statistical reporting. We address each major comment below and will incorporate revisions to strengthen these aspects of the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the manuscript provides no description of held-out validation splits used during the genetic-algorithm subset search or the subsequent CMA-ES routing optimization. Because the four reported benchmarks supply both the checkpoint pool and the final metrics, the absence of such splits leaves open the possibility that the discovered configurations exploit benchmark-specific statistics rather than generalizing.

    Authors: We agree this is a valid concern: the manuscript does not describe held-out splits, and direct optimization on the reported benchmarks could in principle allow exploitation of benchmark-specific statistics. In the original experiments the searches operated on the standard benchmark evaluation protocols to identify merging configurations. For the revision we will add an explicit held-out validation protocol (reserving one benchmark or a data subset exclusively for search and CMA-ES, with final metrics reported on the remaining benchmarks) and update the Experiments section accordingly to demonstrate generalization. revision: yes

  2. Referee: [Abstract and Experiments] Abstract and Experiments: the quantitative claims (e.g., error reductions to 3.30 and 2.19) are presented without reporting the number of independent search runs, standard deviations, or statistical significance tests, and without explicit controls for post-hoc selection bias in the evolutionary procedure. These omissions directly affect the load-bearing claim that StereoFactory “consistently achieves the best four-benchmark average.”

    Authors: We acknowledge that the current presentation lacks multiple-run statistics, standard deviations, significance tests, and explicit bias controls, which weakens the “consistently” claim. The reported numbers reflect single-run searches performed under computational constraints. In the revised manuscript we will rerun the genetic algorithm and CMA-ES for multiple independent trials (minimum 5 runs), report means and standard deviations in the tables, add statistical significance tests (e.g., paired t-tests against baselines), and include additional controls such as random-subset and greedy baselines to quantify post-hoc selection effects. The abstract and Experiments section will be updated to reflect these results. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes an empirical post-hoc optimization procedure (genetic algorithm for subset selection followed by CMA-ES for module routing) applied to existing model checkpoints. Performance numbers are reported as direct outcomes of this optimization on the evaluation benchmarks, but the derivation chain does not reduce any claimed result to its inputs by construction, nor does it rely on self-citations, ansatzes smuggled via prior work, or renaming of known results. The method is self-contained as a standard evolutionary search whose outputs are compared against controlled baselines; no load-bearing step equates the final metric to the search objective by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the approach relies on standard evolutionary computation techniques whose hyperparameters would typically be tuned but are not detailed here.

pith-pipeline@v0.9.1-grok · 5799 in / 1169 out tokens · 39768 ms · 2026-06-27T01:58:05.008003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 5 linked inside Pith

  1. [1]

    Akiba, M

    T. Akiba, M. Shing, Y. Tang, Q. Sun, and D. Ha. Evolution- ary optimization of model merging recipes.Nature Machine Intelligence, 2025

  2. [2]

    T. H. W. B¨ ack, A. V. Kononova, B. van Stein, H. Wang, K. A. Antonov, R. T. Kalkreuth, J. de Nobel, D. Vermetten, R. de Winter, and F. Ye. Evolutionary algorithms for parameter optimization—thirty years later.Evolutionary Computation, 2023

  3. [3]

    W. Bao, W. Wang, Y. Xu, Y. Guo, S. Hong, and X. Zhang. Instereo2k: a large real dataset for stereo matching in indoor scenes.Science China Information Sciences, 2020

  4. [4]

    Y. Bi, B. Xue, P. Mesejo, S. Cagnoni, and M. Zhang. A survey on evolutionary computation for computer vision and image analysis: Past, present, and future trends.IEEE Transactions on Evolutionary Computation, 2023

  5. [5]

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In ECCV, 2012

  6. [6]

    Cabon, N

    Y. Cabon, N. Murray, and M. Humenberger. Virtual kitti 2. arXiv preprint arXiv:2001.10773, 2020

  7. [7]

    Chang and Y.-S

    J.-R. Chang and Y.-S. Chen. Pyramid stereo matching network. InCVPR, 2018

  8. [8]

    Chang, X

    T. Chang, X. Yang, T. Zhang, and M. Wang. Domain generalized stereo matching via hierarchical visual transformation. InCVPR, 2023

  9. [9]

    Cheng, H

    C. Cheng, H. Li, and L. Zhang. Two-branch deconvolutional network with application in stereo matching.TIP, 2022

  10. [10]

    Z. Dai, Z. Tang, H. Zhang, and Y. Xie. Mgs-stereo: Multi-scale geometric-structure-enhanced stereo matching for complex real- world scenes.TIP, 2025

  11. [11]

    Y. Deng, J. Xiao, S. Z. Zhou, and J. Feng. Detail preserving coarse-to-fine matching for stereo matching and optical flow. TIP, 2021

  12. [12]

    T. G. Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems, 2000

  13. [13]

    Z. Gao, Y. Li, J. Zhang, R. Zhao, T. Wu, H. Tang, Z. Yu, H. Dong, G. Chen, and T. Huang. Spikestereonet: A brain-inspired framework for stereo depth estimation from spike streams. In ICLR, 2026

  14. [14]

    Geiger, P

    A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. InCVPR, 2012

  15. [15]

    J. L. Gray, A. T. Naman, and D. S. Taubman. Multi-view disparity estimation using the gradient consistency model.TIP, 2025

  16. [16]

    T. Guan, J. Guo, T. Huang, J. Dong, C. Wang, and Y.-H. Liu. DispViT: Direct Stereo Disparity Regression with a Single- Stream Vision Transformer. InICLR, 2026

  17. [17]

    T. Guan, C. Wang, and Y.-H. Liu. Neural markov random field for stereo matching. InCVPR, 2024

  18. [18]

    X. Guo, J. Lu, C. Zhang, Y. Wang, Y. Duan, T. Yang, Z. Zhu, and L. Chen. Openstereo: A comprehensive benchmark for stereo matching and strong baseline.arXiv preprint arXiv:2312.00343, 2023

  19. [19]

    X. Guo, K. Yang, W. Yang, X. Wang, and H. Li. Group-wise correlation stereo network. InCVPR, 2019

  20. [20]

    X. Guo, C. Zhang, R. Wang, Y. Zhang, W. Zheng, M. Poggi, H. Zhao, Q. Zou, and L. Chen. Stereocarla: A high-fidelity driving dataset for generalizable stereo.arXiv preprint arXiv:2509.12683, 2025

  21. [21]

    X. Guo, C. Zhang, Y. Zhang, R. Wang, D. Nie, W. Zheng, M. Poggi, H. Zhao, M. Ye, Q. Zou, et al. Stereo anything: Unifying zero-shot stereo matching with large-scale mixed data. arXiv preprint arXiv:2411.14053, 2024

  22. [22]

    Hansen and A

    N. Hansen and A. Ostermeier. Completely derandomized self- adaptation in evolution strategies.Evolutionary Computation, 2001

  23. [23]

    Ilharco, M

    G. Ilharco, M. T. Ribeiro, M. Wortsman, L. Schmidt, H. Ha- jishirzi, and A. Farhadi. Editing models with task arithmetic. InProceedings of the International Conference on Learning Representations, 2023

  24. [24]

    Karaev, I

    N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. InIROS, 2023

  25. [25]

    Kendall, H

    A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. End-to-end learning of geometry and context for deep stereo regression. InICCV, 2017

  26. [26]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y. Chen, K. Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945, 2024

  27. [27]

    J. Li, P. Wang, P. Xiong, T. Cai, Z. Yan, L. Yang, J. Liu, H. Fan, and S. Liu. Practical stereo matching via cascaded recurrent network with adaptive correlation. InCVPR, 2022

  28. [28]

    N. Li, L. Ma, G. Yu, B. Xue, M. Zhang, and Y. Jin. Survey on evolutionary deep learning: Principles, algorithms, applications and open issues.ACM Computing Surveys, 2024

  29. [29]

    Lipson, Z

    L. Lipson, Z. Teed, and J. Deng. Raft-stereo: Multilevel recurrent field transforms for stereo matching. In3DV, 2021

  30. [30]

    Y. Liu, Y. Sun, B. Xue, M. Zhang, G. G. Yen, and K. C. Tan. A survey on evolutionary neural architecture search.TNNLS, 2023

  31. [31]

    Loshchilov and F

    I. Loshchilov and F. Hutter. CMA-ES for hyperparame- ter optimization of deep neural networks.arXiv preprint arXiv:1604.07269, 2016

  32. [32]

    M. S. Matena and C. A. Raffel. Merging models with fisher- weighted averaging. InNeurIPS, 2022

  33. [33]

    Mayer, E

    N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Doso- vitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. InCVPR, 2016

  34. [34]

    L. Mehl, J. Schmalfuss, A. Jahedi, Y. Nalivayko, and A. Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. InCVPR, 2023

  35. [35]

    Menze and A

    M. Menze and A. Geiger. Object scene flow for autonomous vehicles. InCVPR, 2015

  36. [36]

    Ranftl, K

    R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE transactions on pattern analysis and machine intelligence, 2020

  37. [37]

    Z. Rao, B. Xiong, M. He, Y. Dai, R. He, Z. Shen, and X. Li. Masked representation learning for domain generalized stereo matching. InCVPR, 2023

  38. [38]

    Scharstein, H

    D. Scharstein, H. Hirschm¨ uller, Y. Kitajima, G. Krathwohl, N. Neˇ si´ c, X. Wang, and P. Westling. High-resolution stereo 13 datasets with subpixel-accurate ground truth. InGCPR, 2014

  39. [39]

    Sch¨ ops, J

    T. Sch¨ ops, J. L. Sch¨ onberger, S. Galliani, T. Sattler, K. Schindler, M. Pollefeys, and A. Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. InCVPR, 2017

  40. [40]

    F. Tosi, Y. Liao, C. Schmitt, and A. Geiger. Smd-nets: Stereo mixture density networks. InCVPR, 2021

  41. [41]

    Tremblay, T

    J. Tremblay, T. To, and S. Birchfield. Falling things: A synthetic dataset for 3d object detection and pose estimation. InCVPRW, 2018

  42. [42]

    S. Wang, J. Sun, B. Fan, Q. Wang, B. Lu, and Y. Dai. Ec-mvsnet: Enhanced cascaded multi-view stereo with cross-scale relevance integration. InAAAI, 2026

  43. [43]

    W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer. Tartanair: A dataset to push the limits of visual slam. InIROS, 2020

  44. [44]

    X. Wang, G. Xu, H. Jia, and X. Yang. Selective-stereo: Adaptive frequency information selection for stereo matching. InCVPR, 2024

  45. [45]

    Y. Wang, K. Li, L. Wang, J. Hu, D. Oliver Wu, and Y. Guo. Adstereo: Efficient stereo matching with adaptive downsampling and disparity alignment.TIP, 2025

  46. [46]

    Y. Wang, Z. Yang, J. Zheng, Z. Zhang, D. O. Wu, and Y. Guo. Smformer: Empowering self-supervised stereo matching via foundation models and data augmentation.TIP, 2026

  47. [47]

    B. Wen, S. Dewan, and S. Birchfield. Fast-foundationstereo: Real-time zero-shot stereo matching. InCVPR, 2026

  48. [48]

    B. Wen, M. Trepte, J. Aribido, J. Kautz, O. Gallo, and S. Birch- field. FoundationStereo: Zero-shot stereo matching. InCVPR, 2025

  49. [49]

    Wortsman, G

    M. Wortsman, G. Ilharco, S. Y. Gadre, R. Roelofs, R. Gontijo- Lopes, A. S. Morcos, H. Namkoong, A. Farhadi, Y. Carmon, S. Kornblith, and L. Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InICML, 2022

  50. [50]

    G. Xu, X. Wang, X. Ding, and X. Yang. Iterative geometry encoding volume for stereo matching. InCVPR, 2023

  51. [51]

    G. Xu, X. Wang, Z. Zhang, J. Cheng, C. Liao, and X. Yang. Igev++: iterative multi-range geometry encoding volumes for stereo matching.TPAMI, 2025

  52. [52]

    Xu and J

    H. Xu and J. Zhang. Aanet: Adaptive aggregation network for efficient stereo matching. InCVPR, 2020

  53. [53]

    P. Xu, Z. Xiang, C. Qiao, J. Fu, and T. Pu. Adaptive multi-modal cross-entropy loss for stereo matching. InCVPR, 2024

  54. [54]

    Yadav, C

    P. Yadav, C. Shieh, D. Tam, M. Bansal, C. Raffel, and M. Muqeeth. A survey on model MoErging: Recycling and routing among specialized experts for collaborative learning. arXiv preprint arXiv:2408.07057, 2024

  55. [55]

    Yadav, D

    P. Yadav, D. Tam, L. Choshen, C. Raffel, and M. Bansal. Ties-merging: Resolving interference when merging models. In NeurIPS, 2023

  56. [56]

    E. Yang, L. Shen, G. Guo, X. Wang, X. Cao, J. Zhang, and D. Tao. Model merging in LLMs, MLLMs, and beyond: Methods, theories, applications and opportunities.arXiv preprint arXiv:2408.07666, 2024

  57. [57]

    E. Yang, Z. Wang, L. Shen, S. Liu, G. Guo, X. Wang, and D. Tao. Adamerging: Adaptive model merging for multi-task learning. InICLR, 2024

  58. [58]

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In CVPR, 2024

  59. [59]

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao. Depth anything v2.arXiv preprint arXiv:2406.09414, 2024

  60. [60]

    M. Yang, F. Wu, and W. Li. Rlstereo: Real-time stereo matching based on reinforcement learning.TIP, 2021

  61. [61]

    L. Yu, B. Yu, H. Yu, F. Huang, and Y. Li. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InICML, 2024

  62. [62]

    K. Zeng, Y. Wang, J. Mao, C. Liu, W. Peng, and Y. Yang. Deep stereo matching with hysteresis attention and supervised cost volume construction.TIP, 2022

  63. [63]

    Zhang, X

    F. Zhang, X. Qi, R. Yang, V. Prisacariu, B. Wah, and P. Torr. Domain-invariant stereo matching networks. InECCV, 2020

  64. [64]

    Zhang, L

    J. Zhang, L. Jing, Y. Wang, X. Li, G. Yang, S. Jin, and C. Qiu. Geometry-aware stereo matching via monocular disparity distribution prior and gradient enhancement. InAAAI, 2026

  65. [65]

    Zhang, L

    Y. Zhang, L. Wang, K. Li, Y. Wang, and Y. Guo. Learning representations from foundation models for domain generalized stereo matching. InECCV, 2024

  66. [66]

    Zheng, Q

    J. Zheng, Q. Liu, H. Xu, and Z. Chen. Pip-stereo: Progressive iterations pruner for iterative optimization based stereo matching. InCVPR, 2026. Xianda Guois currently a Ph.D. student at the School of Computer Science, Wuhan University. Before that, he received a mas- ter’s degree from the Dalian University of Technology in 2019. His work is mainly in comp...