Recognition: unknown
Unified Multimodal Visual Tracking with Dual Mixture-of-Experts
Pith reviewed 2026-05-07 17:44 UTC · model grok-4.3
The pith
OneTrackerV2 uses a shared architecture and dual mixture-of-experts to unify RGB and RGB-plus-X tracking into a single end-to-end trained model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OneTrackerV2 introduces a Meta Merger that projects arbitrary modalities into a unified representation space for flexible fusion, paired with Dual Mixture-of-Experts where T-MoE handles spatio-temporal relations and M-MoE manages multi-modal knowledge to reduce feature conflicts. This shared architecture with unified parameters and single end-to-end training delivers state-of-the-art results across five RGB and RGB+X tracking tasks on twelve benchmarks, preserves performance after model compression, and maintains robustness when modalities are missing.
What carries the argument
Dual Mixture-of-Experts (T-MoE for spatio-temporal tracking relations and M-MoE for disentangling cross-modal dependencies) together with the Meta Merger that embeds inputs into a single space.
If this is right
- A single set of parameters suffices for RGB, RGB-D, RGB-T, RGB-E, and RGB-D-T tracking without per-task retraining or pretrained adapters.
- Inference speed remains high because no modality-specific branches or extra fusion modules are added at test time.
- Performance after quantization or pruning stays close to the uncompressed model.
- The network continues to produce usable tracks when one sensor fails or is unavailable.
Where Pith is reading between the lines
- Deployment in real environments could use one model instead of maintaining separate trackers for each sensor suite.
- Adding a new sensor type would require only retraining the merger and experts rather than redesigning the entire pipeline.
- The separation of temporal and modal factors may transfer to other multimodal video tasks such as action recognition or video prediction.
Load-bearing premise
The Meta Merger maps any input modality into the shared space without discarding information needed for accurate tracking, and the two experts separate factors so the model generalizes beyond the training modality combinations.
What would settle it
Measure whether accuracy falls sharply on a held-out modality combination, such as RGB plus both depth and event data, that was never seen during the single end-to-end training run.
Figures
read the original abstract
Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OneTrackerV2, a unified multimodal visual tracking framework. It proposes a Meta Merger module to embed multi-modal inputs (RGB and RGB+X) into a shared representation space and a Dual Mixture-of-Experts (DMoE) architecture with T-MoE handling spatio-temporal relations for tracking and M-MoE embedding multi-modal knowledge to disentangle cross-modal dependencies. Using a single shared architecture, unified parameters, and end-to-end training, the model claims state-of-the-art performance across five RGB and RGB+X tracking tasks on 12 benchmarks, while preserving high inference efficiency, robustness after model compression, and strong results under modality-missing conditions.
Significance. If the empirical results and generalization claims hold under rigorous validation, the work would offer a meaningful advance in scalable multimodal tracking by eliminating the need for modality-specific models or heavy reliance on pretrained adapters. The combination of unified training, efficiency, and explicit robustness to missing modalities could have practical impact in real-world deployment scenarios where input modalities vary or are incomplete.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The central SOTA claim across five tasks and 12 benchmarks is stated without accompanying quantitative metrics, baseline comparisons, statistical significance tests, or ablation studies in the provided abstract; the experimental section must supply these details with error bars and per-benchmark breakdowns to substantiate the unified performance advantage.
- [§3.1] §3.1 (Meta Merger): The mechanism by which Meta Merger projects arbitrary modalities into a unified space without critical information loss is load-bearing for the robustness and unification claims, yet the manuscript provides no ablations on information preservation, reconstruction error, or zero-shot performance on unseen modality combinations beyond the reported benchmarks.
- [§3.2] §3.2 (Dual MoE): The assertion that T-MoE and M-MoE successfully disentangle spatio-temporal versus cross-modal factors in a manner that generalizes is central to the single-model advantage, but lacks targeted experiments, feature visualizations, or controlled tests demonstrating factor separation and generalization to novel modality mixes or real-world conditions.
minor comments (2)
- [Introduction] The five specific tracking tasks should be enumerated explicitly with their corresponding benchmarks in the introduction or experimental setup to improve readability and reproducibility.
- Figure captions and architecture diagrams would benefit from clearer labeling of data flow between Meta Merger and the two MoE modules to aid understanding of the unified pipeline.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript introducing OneTrackerV2. We address each major comment point by point below, providing clarifications from the full experimental results and committing to targeted revisions that strengthen the presentation without altering the core claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central SOTA claim across five tasks and 12 benchmarks is stated without accompanying quantitative metrics, baseline comparisons, statistical significance tests, or ablation studies in the provided abstract; the experimental section must supply these details with error bars and per-benchmark breakdowns to substantiate the unified performance advantage.
Authors: The abstract serves as a high-level summary and conventionally omits specific numerical values. Section 4 of the manuscript already contains extensive quantitative tables reporting performance metrics, direct comparisons against baselines, and per-benchmark breakdowns across all 12 benchmarks for the five tasks, along with ablation studies on the proposed components. To further address the request for rigor, we will add error bars derived from multiple independent runs and statistical significance tests (e.g., paired t-tests) in the revised experimental section. revision: yes
-
Referee: [§3.1] §3.1 (Meta Merger): The mechanism by which Meta Merger projects arbitrary modalities into a unified space without critical information loss is load-bearing for the robustness and unification claims, yet the manuscript provides no ablations on information preservation, reconstruction error, or zero-shot performance on unseen modality combinations beyond the reported benchmarks.
Authors: The Meta Merger employs meta-embeddings and adaptive projection layers to map diverse modalities into a shared representation while preserving task-relevant features, as evidenced by the model's strong performance under modality-missing conditions. We acknowledge that explicit ablations quantifying information preservation (e.g., reconstruction error metrics) and zero-shot evaluation on entirely novel modality combinations would provide additional support. We will incorporate these targeted experiments and analyses into the revised manuscript. revision: yes
-
Referee: [§3.2] §3.2 (Dual MoE): The assertion that T-MoE and M-MoE successfully disentangle spatio-temporal versus cross-modal factors in a manner that generalizes is central to the single-model advantage, but lacks targeted experiments, feature visualizations, or controlled tests demonstrating factor separation and generalization to novel modality mixes or real-world conditions.
Authors: The Dual MoE design explicitly separates spatio-temporal modeling (T-MoE) from cross-modal knowledge embedding (M-MoE) to reduce feature conflicts, with generalization supported by unified training results across diverse benchmarks. To make the factor disentanglement more explicit, we will add feature visualizations (such as expert activation maps) and controlled ablation studies evaluating performance on novel modality combinations and simulated real-world perturbations in the revised version. revision: yes
Circularity Check
No circularity: empirical architecture proposal with benchmark results, no derivations or self-referential predictions.
full rationale
The paper introduces OneTrackerV2 via Meta Merger for modality embedding and Dual MoE (T-MoE/M-MoE) for factor disentanglement, then reports end-to-end training results on 12 benchmarks across five RGB/RGB+X tasks. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or description. Performance claims are externally falsifiable on public benchmarks and do not reduce to the inputs by construction. This is the standard non-circular case for an empirical CV architecture paper.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Chen, X., Kang, B., Zhu, J., Wang, D., Peng, H., and Lu, H. Unified sequence-to-sequence learning for single- and multi-modal visual object tracking.arXiv preprint arXiv:2304.14394, 2023a. Chen, X., Peng, H., Wang, D., Lu, H., and Hu, H. Seqtrack: Sequence to sequence learning for visual object tracking. InProceedings of the IEEE/CVF conference on compute...
-
[2]
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts lan- guage models.arXiv preprint arXiv:2401.06066,
work page internal anchor Pith review arXiv
-
[3]
Feng, X., Zhang, D., Hu, S., Li, X., Wu, M., Zhang, J., Chen, X., and Huang, K. Cstrack: Enhancing rgb-x track- ing via compact spatiotemporal features.arXiv preprint arXiv:2505.19434,
-
[4]
Hong, L., Li, J., Zhou, X., Yan, S., Guo, P., Jiang, K., Chen, Z., Gao, S., Li, R., Sheng, X., et al. General compression framework for efficient transformer object tracking.arXiv preprint arXiv:2409.17564, 2024a. Hong, L., Yan, S., Zhang, R., Li, W., Zhou, X., Guo, P., Jiang, K., Chen, Y ., Li, J., Chen, Z., et al. Onetracker: Unifying visual object trac...
-
[5]
Decoupled Weight Decay Regularization
Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review arXiv
-
[6]
SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker
Su, J., Xue, Z., Zhang, S., Chen, K., Hu, W., and Zhang, Z. Seatrack: Simple, efficient, and adaptive multimodal tracker.arXiv preprint arXiv:2604.12502,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
P., Van Gool, L., and Timofte, R
Tan, Y ., Wu, Z., Fu, Y ., Zhou, Z., Sun, G., Zamfi, E., Ma, C., Paudel, D. P., Van Gool, L., and Timofte, R. Xtrack: Multimodal training boosts rgb-x video object trackers. arXiv preprint arXiv:2405.17773,
-
[8]
Tan, Y ., Shao, J., Zamfir, E., Li, R., An, Z., Ma, C., Paudel, D., Van Gool, L., Timofte, R., and Wu, Z. What you have is what you track: Adaptive and robust multimodal tracking.arXiv preprint arXiv:2507.05899,
-
[9]
Visevent: Reliable object tracking via collaboration of frame and event flows.IEEE Transactions on Cybernetics, 54(3):1997–2010,
Wang, X., Li, J., Zhu, L., Zhang, Z., Chen, Z., Li, X., Wang, Y ., Tian, Y ., and Wu, F. Visevent: Reliable object tracking via collaboration of frame and event flows.IEEE Transactions on Cybernetics, 54(3):1997–2010,
1997
-
[10]
Yan, B., Peng, H., Fu, J., Wang, D., and Lu, H. Learning spatio-temporal transformer for visual tracking. InPro- ceedings of the IEEE/CVF international conference on computer vision, pp. 10448–10457, 2021a. Yan, S., Yang, J., K¨apyl¨a, J., Zheng, F., Leonardis, A., and K¨am¨ar¨ainen, J.-K. Depthtrack: Unveiling the power of rgbd tracking. InProceedings of...
work page internal anchor Pith review arXiv
-
[11]
Zhang, X., Tian, Y ., Huang, W., Ye, Q., Dai, Q., Xie, L., and Tian, Q. Hivit: Hierarchical vision trans- former meets masked image modeling.arXiv preprint arXiv:2205.14949,
-
[12]
Zhou, X., Li, J., Hong, L., Jiang, K., Guo, P., Ge, W., and Zhang, W. Detrack: In-model latent denoising learning for visual object tracking.arXiv preprint arXiv:2501.02467, 2025a. Zhou, X., Pan, T., Hong, L., Guo, P., Guo, H., Chen, Z., Jiang, K., and Zhang, W. Dynamic semantic-aware correlation modeling for uav tracking.arXiv preprint arXiv:2510.21351, ...
-
[13]
Generalization of Meta Merger We further validate the generalization potential of the Meta Merger by testing on anunseenmodality excluded from the training phase
12 Unified Multimodal Visual Tracking with Dual Mixture-of-Experts A. Generalization of Meta Merger We further validate the generalization potential of the Meta Merger by testing on anunseenmodality excluded from the training phase. Using the CMOTB dataset (Liu et al., 2024), which consists of Near-Infrared (NIR) signals, we observed thatOneTrackerV2achie...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.