arxiv: 2509.09530 · v2 · submitted 2025-09-11 · 💻 cs.CV

DualTrack: Sensorless 3D Ultrasound needs Local and Global Context

Paul F. R. Wilson , Matteo Ronchetti , R\"udiger G\"obl , Viktoria Markova , Sebastian Rosenzweig , Raphael Prevost , Parvin Mousavi , Oliver Zettinig This is my paper

Pith reviewed 2026-05-18 17:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords sensorless 3D ultrasounddual encodertrajectory estimationlocal global featuresdeep learningmedical imagingprobe tracking3D reconstruction

0 comments p. Extension

The pith

DualTrack uses separate encoders for local motion and global anatomy to improve sensorless 3D ultrasound reconstruction accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that sensorless 3D ultrasound, which reconstructs volumes from sequences of 2D images without tracking hardware, benefits from explicitly separating the extraction of local features like speckle for motion estimation and global features like organ shapes for overall context. Earlier methods either skipped global information or combined it in a single network, which hindered robust modeling of both short-term movements and long-term anatomical positioning. DualTrack introduces a dual-encoder design where one path uses dense convolutions over space and time for detailed local cues, while the other employs a standard image network with attention for high-level and long-range information. These are merged in a simple fusion step to output the 3D probe trajectory. If successful, this leads to more accurate and drift-free reconstructions, as shown by average errors under 5 mm on public benchmarks, making 3D ultrasound more accessible in clinical settings without expensive equipment.

Core claim

The authors establish that a decoupled dual-encoder architecture, with specialized local spatiotemporal processing and global backbone plus temporal attention, followed by lightweight fusion, enables better capture of complementary scales of information, resulting in state-of-the-art trajectory estimation and globally consistent 3D reconstructions from 2D ultrasound sequences.

What carries the argument

The DualTrack dual-encoder architecture that processes local fine-grained features via dense spatiotemporal convolutions and global high-level anatomical features via an image backbone with temporal attention layers, integrated by a lightweight fusion module for trajectory prediction.

If this is right

Reconstruction error averages below 5 mm on large public benchmarks.
Produces globally consistent volumes without significant drift over extended scans.
Outperforms previous sensorless methods that used single or coupled feature streams.
The approach supports plugging in different backbones, including foundation models, for the global encoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This design choice may help in other sequential imaging problems where both short-range details and long-range structure are important.
Testing the method on data from different ultrasound machines or body regions could reveal if the decoupling generalizes beyond the training distribution.
If the fusion step proves insufficient for very complex motions, adding more sophisticated integration could further reduce inconsistencies.

Load-bearing premise

The benefits of keeping local and global feature streams separate and fusing them lightly will outweigh the potential advantages of fully joint optimization across both scales.

What would settle it

Run DualTrack and a comparable single-encoder model on a held-out set of long clinical ultrasound sequences and check if the dual version maintains lower cumulative trajectory error and better volume consistency; absence of such improvement would falsify the core benefit of decoupling.

Figures

Figures reproduced from arXiv: 2509.09530 by Matteo Ronchetti, Oliver Zettinig, Parvin Mousavi, Paul F. R. Wilson, Raphael Prevost, R\"udiger G\"obl, Sebastian Rosenzweig, Viktoria Markova.

**Figure 1.** Figure 1: DualTrack enables sensorless 3D ultrasound by estimating the probe trajectory for a sequence of ultrasound images (Top Left). Specialized local and global encoders are designed to extract low- and high-level features–their outputs are combined using self- and cross-attention through time (Right). The encoders are pretrained at their respective timescales before being finetuned with the fusion module (Botto… view at source ↗

**Figure 2.** Figure 2: (a) Trajectories and 3D US images generated with DualTrack predictions vs. ground truth, for three sweeps in the test set with worst/median/best GPE, respectively. (b) Comparison of the out-of-plane displacement prediction for the “local only” module and the full DualTrack model. The local model fails to disambiguate the outof-plane direction and loses track after the first turn (see indicator); DualTrac… view at source ↗

**Figure 3.** Figure 3: (a) Distribution of tracking reconstruction errors across the 72 test scans for DualTrack and competing methods. Our method significantly (p ≤ 0.005) outperforms its leading competitor on all metrics. (b) DualTrack can be successfully used with a variety of image backbones, giving the flexibility to leverage transfer learning strategies. 3-4) likely stem from its dual encoder architecture, which explicitly… view at source ↗

read the original abstract

Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems. Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative. Local features, such as speckle patterns, can help predict frame-to-frame motion, while global features, such as coarse shapes and anatomical structures, can situate the scan relative to anatomy and help predict its general shape. In prior approaches, global features are either ignored or tightly coupled with local feature extraction, restricting the ability to robustly model these two complementary aspects. We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions to capture fine-grained features, while the global encoder utilizes an image backbone (e.g., a 2D CNN or foundation model) and temporal attention layers to embed high-level anatomical features and long-range dependencies. A lightweight fusion module then combines these features to estimate the trajectory. Experimental results on a large public benchmark show that DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions, outperforming previous methods and yielding an average reconstruction error below 5 mm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DualTrack's split into specialized local dense-conv and global backbone-plus-attention encoders is a reasonable architectural move for sensorless 3D US, but the lightweight fusion step still needs stronger checks for consistency on new data.

read the letter

The main thing to know about this paper is that it introduces a dual-encoder setup for sensorless 3D ultrasound reconstruction, with one path handling local dense spatiotemporal features via convolutions and another handling global context through a backbone and attention. They report beating previous methods with under 5 mm average error on a public dataset. This split is the actual novelty. Most prior work either left out the global anatomical cues or blended local and global features in one network, which limited how well each scale could be optimized. By decoupling them and using a simple fusion at the end, the design lets the local encoder focus on speckle-based motion and the global one on long-range structure. That seems like a practical improvement if the results hold. On the downside, the abstract doesn't show any ablations comparing this to a single encoder or testing the fusion module specifically. There's no mention of error bars or how the dataset was split, so it's tough to gauge how robust the gains are. The stress point is whether the lightweight fusion really keeps trajectories consistent across different patients or scan conditions without the joint training that coupled methods had. If the full paper has good out-of-distribution tests, that would address it; otherwise it's a gap. This kind of work is for folks doing medical imaging research, especially those trying to make 3D ultrasound more accessible without extra hardware. Someone looking for architecture ideas in trajectory estimation from image sequences would find it relevant. The core idea is solid enough and the benchmark claim is specific, so it should go to peer review rather than get desk rejected. Ask for the ablations and some failure case analysis in the reviews.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DualTrack, a dual-encoder architecture for sensorless 3D ultrasound reconstruction from 2D image sequences. It decouples a local encoder (dense spatiotemporal convolutions for speckle-based frame-to-frame motion) from a global encoder (2D CNN or foundation model backbone plus temporal attention for anatomical context and long-range dependencies), then combines them with a lightweight fusion module to regress the 6-DoF probe trajectory. Experiments on a large public benchmark are reported to yield state-of-the-art accuracy with average reconstruction error below 5 mm and globally consistent 3D volumes, outperforming prior methods.

Significance. If the empirical results hold, the explicit separation of local and global feature streams addresses a clear limitation in earlier tightly coupled designs and could improve robustness for clinical sensorless 3D US. The timely use of foundation models in the global branch is a positive design choice that aligns with broader trends in computer vision.

major comments (3)

[Experiments] Experiments section: the central SOTA claim with sub-5 mm average error is presented without quantitative ablation results isolating the contribution of the decoupled local encoder, global encoder, or the lightweight fusion module. This makes it impossible to verify that the architectural decoupling (rather than other implementation details) drives the reported gains over prior coupled methods.
[Section 4] Section 4 (results): no error bars, standard deviations across sequences, or statistical significance tests are supplied for the reconstruction errors on the public benchmark. This weakens the claim of consistent outperformance and global consistency, particularly given the known variability of clinical ultrasound data.
[Method] Method (fusion module description): the paper provides no explicit analysis or out-of-distribution experiments showing that the lightweight fusion reliably reconciles the two decoupled streams without introducing drift when anatomical context changes. This is load-bearing for the global-consistency claim, as prior coupled architectures avoided this risk by construction.

minor comments (2)

[Abstract] Abstract: the phrase 'a large public benchmark' is used without naming the dataset or supplying basic statistics (number of sequences, patients, or acquisition parameters), which would help readers assess the scope of the evaluation.
[Method] Notation: the parameterization of the 6-DoF trajectory (e.g., whether rotation is represented as quaternions, Euler angles, or axis-angle) is not introduced until late in the method section; an early definition would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments identify key areas where additional analysis and statistical reporting can strengthen the presentation of DualTrack's contributions. We address each major comment below and outline the revisions we will incorporate.

read point-by-point responses

Referee: [Experiments] Experiments section: the central SOTA claim with sub-5 mm average error is presented without quantitative ablation results isolating the contribution of the decoupled local encoder, global encoder, or the lightweight fusion module. This makes it impossible to verify that the architectural decoupling (rather than other implementation details) drives the reported gains over prior coupled methods.

Authors: We agree that quantitative ablations are necessary to isolate the role of the decoupled encoders and fusion module. We have conducted these experiments on the public benchmark and will add the results to the revised Experiments section, including tables that report performance when each component is removed or replaced with a coupled baseline. revision: yes
Referee: [Section 4] Section 4 (results): no error bars, standard deviations across sequences, or statistical significance tests are supplied for the reconstruction errors on the public benchmark. This weakens the claim of consistent outperformance and global consistency, particularly given the known variability of clinical ultrasound data.

Authors: We acknowledge that the current results lack these statistical details. In the revision we will report per-sequence standard deviations, add error bars to all relevant plots and tables in Section 4, and include paired statistical significance tests against prior methods to support the claims of consistent outperformance. revision: yes
Referee: [Method] Method (fusion module description): the paper provides no explicit analysis or out-of-distribution experiments showing that the lightweight fusion reliably reconciles the two decoupled streams without introducing drift when anatomical context changes. This is load-bearing for the global-consistency claim, as prior coupled architectures avoided this risk by construction.

Authors: We recognize the value of explicit validation for the fusion module. While the benchmark results already demonstrate low drift through globally consistent volumes, we will expand the method description with a dedicated analysis subsection on fusion behavior and add experiments on anatomical-context subsets of the benchmark to further illustrate robustness without introducing new data collection. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on benchmark

full rationale

The paper presents DualTrack as a dual-encoder neural architecture for sensorless 3D ultrasound trajectory estimation. Claims of SOTA accuracy and sub-5mm average error rest on experimental results from a large public benchmark, not on any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. The abstract describes decoupled local (dense spatiotemporal convolutions) and global (CNN + temporal attention) encoders plus a lightweight fusion module, but performance is reported as measured outcome rather than forced by construction. No equations, uniqueness theorems, or ansatzes are invoked that reduce to the inputs. This is a standard empirical CV contribution with independent external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that local and global features are complementary and benefit from separate extraction paths; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Decoupled local and global feature extraction improves modeling of complementary aspects in ultrasound sequences over tightly coupled or single-stream approaches.
The abstract states that prior methods either ignore global features or couple them tightly, restricting robust modeling of both scales.

pith-pipeline@v0.9.0 · 5808 in / 1220 out tokens · 57542 ms · 2026-05-18T17:43:54.210484+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions... while the global encoder utilizes an image backbone... and temporal attention layers... A lightweight fusion module then combines these features
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions... average reconstruction error below 5 mm

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

[1]

In: International conference on medical image computing and computer-assisted intervention

Farshad, A., Yeganeh, Y., Gehlbach, P., Navab, N.: Y-net: A spatiospectral dual- encoder network for medical image segmentation. In: International conference on medical image computing and computer-assisted intervention. pp. 582–592. Springer (2022)

work page 2022
[2]

Computers in Biology and Medicine150, 106197 (2022)

Fu, Z., Li, J., Hua, Z.: Deau-net: Attention networks based on dual encoder for medical image segmentation. Computers in Biology and Medicine150, 106197 (2022)

work page 2022
[3]

IEEE Transactions on Biomed- ical Engineering70(3), 970–979 (2023).https://doi.org/10.1109/TBME.2022

Guo, H., Chao, H., Xu, S., Wood, B.J., Wang, J., Yan, P.: Ultrasound volume re- construction from freehand scans without tracking. IEEE Transactions on Biomed- ical Engineering70(3), 970–979 (2023).https://doi.org/10.1109/TBME.2022. 3206596

work page doi:10.1109/tbme.2022 2023
[4]

In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23

Guo, H., Xu, S., Wood, B., Yan, P.: Sensorless freehand 3d ultrasound reconstruc- tion via deep contextual learning. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part III 23. pp. 463–472. Springer (2020)

work page 2020
[5]

Medical & biological engineering & computing61(3), 661–671 (2023)

Hong, Z., Chen, M., Hu, W., Yan, S., Qu, A., Chen, L., Chen, J.: Dual encoder network with transformer-cnn for multi-organ segmentation. Medical & biological engineering & computing61(3), 661–671 (2023)

work page 2023
[6]

Medical Image Analysis 96, 103202 (2024)

Jiao,J.,Zhou,J.,Li,X.,Xia,M.,Huang,Y.,Huang,L.,Wang,N.,Zhang,X.,Zhou, S., Wang, Y., et al.: Usfm: A universal ultrasound foundation model generalized to tasks and organs towards label efficient image analysis. Medical Image Analysis 96, 103202 (2024)

work page 2024
[7]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023)

work page 2023
[8]

In: 27th International Con- ference on Medical Image Computing and Computer Assisted Intervention (MIC- CAI 2024)

Li, Q., Saeed, S.U., Barratt, D.C., Clarkson, M.J., Vercauteren, T., Hu, Y.: Track- erless 3d freehand ultrasound reconstruction challenge. In: 27th International Con- ference on Medical Image Computing and Computer Assisted Intervention (MIC- CAI 2024). Zenodo (2024).https://doi.org/10.5281/zenodo.10991501,https: //doi.org/10.5281/zenodo.10991501

work page doi:10.5281/zenodo.10991501 2024
[9]

In: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI)

Li, Q., Shen, Z., Li, Q., Barratt, D.C., Dowrick, T., Clarkson, M.J., Vercauteren, T., Hu, Y.: Trackerless freehand ultrasound with sequence modelling and auxiliary transformation over past and future frames. In: 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI). pp. 1–5. IEEE (2023) 10 Authors Suppressed Due to Excessive Length

work page 2023
[10]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention

Li, Q., Shen, Z., Yang, Q., Barratt, D.C., Clarkson, M.J., Vercauteren, T., Hu, Y.: Nonrigid reconstruction of freehand ultrasound without a tracker. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 689–699. Springer (2024)

work page 2024
[11]

Advances in neural information processing systems32(2019)

Lu,J.,Batra,D.,Parikh,D.,Lee,S.:Vilbert:Pretrainingtask-agnosticvisiolinguis- tic representations for vision-and-language tasks. Advances in neural information processing systems32(2019)

work page 2019
[12]

Luo, M., Yang, X., Huang, X., Huang, Y., Zou, Y., Hu, X., Ravikumar, N., Frangi, A.F., Ni, D.: Self context and shape prior for sensorless freehand 3d ultrasound reconstruction. In: Medical Image Computing and Computer Assisted Interven- tion – MICCAI 2021: 24th International Conference, Strasbourg, France, Septem- ber 27–October 1, 2021, Proceedings, Par...

work page doi:10.1007/978-3-030-87231-1_20 2021
[13]

In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention

Luo, M., Yang, X., Wang, H., Du, L., Ni, D.: Deep motion network for freehand 3d ultrasound reconstruction. In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention. pp. 290–299. Springer (2022)

work page 2022
[14]

Nature Communications15(1), 654 (2024)

Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nature Communications15(1), 654 (2024)

work page 2024
[15]

In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI)

Ning, G., Liang, H., Zhou, L., Zhang, X., Liao, H.: Spatial position estimation method for 3d ultrasound reconstruction based on hybrid transfomers. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI). pp. 1–5. IEEE (2022)

work page 2022
[16]

Medical image analysis48, 187–202 (2018)

Prevost, R., Salehi, M., Jagoda, S., Kumar, N., Sprung, J., Ladikos, A., Bauer, R., Zettinig, O., Wein, W.: 3d freehand ultrasound without external tracking using deep learning. Medical image analysis48, 187–202 (2018)

work page 2018
[17]

Advances in neural information pro- cessing systems30(2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

work page 2017
[18]

In: Medical image computing and computer assisted intervention–MICCAI 2021: 24th international conference, Strasbourg, France, September 27–October 1, 2021, proceedings, Part I 24

Zhang, Y., Liu, H., Hu, Q.: Transfuse: Fusing transformers and cnns for med- ical image segmentation. In: Medical image computing and computer assisted intervention–MICCAI 2021: 24th international conference, Strasbourg, France, September 27–October 1, 2021, proceedings, Part I 24. pp. 14–24. Springer (2021)

work page 2021
[19]

iBOT: Image BERT Pre-Training with Online Tokenizer

Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., Kong, T.: ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021