arxiv: 2604.18740 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control

Jay Jung , Ahmad Arrabi , Jax Luo , Scott Raymond , Safwan Wshah

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords skeletal landmark localizationmultimodal large language modelsC-arm controlX-ray imagingautonomous medical imagingagentic AIdeep learning comparison

0 comments

The pith

Fine-tuned multimodal language models localize skeletal landmarks in X-rays as accurately as deep learning methods and can reason to correct mistakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Fine-tuning multimodal large language models enables them to localize key skeletal points on X-ray images with accuracy comparable to dedicated deep learning systems. The study tests this on both computer-generated and actual patient X-rays, showing that the language models can not only find the landmarks but also explain and fix their own errors. This matters because it opens a path to C-arm machines that can autonomously adjust position using natural reasoning rather than rigid algorithms, potentially speeding up emergency procedures when initial placements miss the mark. The approach keeps the door open for clinicians to give feedback in words or actions that the model understands.

Core claim

This paper establishes that fine-tuned MLLMs achieve accurate skeletal landmark localization on annotated synthetic and real X-ray datasets, performing competitively with a leading deep learning approach. In qualitative tests, the models demonstrate the capacity for reasoning by correcting initially wrong landmark predictions and by planning sequential C-arm movements to reach desired imaging positions.

What carries the argument

Fine-tuned multimodal large language models that retrieve the closest landmarks from X-ray images and apply reasoning for error correction and navigation.

If this is right

Accurate landmark localization by MLLMs supports the development of agentic C-arm control systems that can adapt based on feedback.
Reasoning capabilities allow MLLMs to handle cases where standard deep learning predictions are off.
Sequential navigation shows potential for iterative adjustments without full manual control.
Performance parity suggests MLLMs could serve as a flexible alternative in medical imaging automation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future systems might combine MLLMs with real-time sensor data to further reduce positioning errors in dynamic clinical environments.
Similar techniques could apply to landmark detection in other medical scans like CT or MRI where interpretability matters.
Testing on more varied patient populations would reveal if the models generalize beyond the current datasets.

Load-bearing premise

Performance on the given synthetic and real datasets with specific landmark annotations will translate to real-world clinical use with diverse anatomies, artifacts, and integration needs.

What would settle it

Demonstrating significantly higher localization errors for MLLMs than DL methods on a held-out set of clinical X-rays with varied patient conditions would disprove the competitiveness claim.

read the original abstract

Purpose: Automated C-arm positioning ensures timely treatment in patients requiring emergent interventions. When a conventional Deep Learning (DL) approach for C-arm control fails, clinicians must revert to manual operation, resulting in additional delays. Consequently, an agentic C-arm control framework based on multimodal large language models (MLLMs) is highly desirable, as it can incorporate clinician feedback and use reasoning to make adjustments toward more accurate positioning. Skeletal landmark localization is essential for C-arm control, and we investigate adapting MLLMs for autonomous landmark localization. Methods: We used an annotated synthetic X-ray dataset and a real X-ray dataset. Each X-ray in both datasets is paired with several skeletal landmarks. We fine-tuned two MLLMs and tasked them with retrieving the closest landmarks from each X-ray. Quantitative evaluations of landmark localization were performed and compared against a leading DL approach. We further conducted qualitative experiments demonstrating: (1) how an MLLM can correct an initially incorrect prediction through reasoning, and (2) how the MLLM can sequentially navigate the C-arm toward a target location. Results: On both datasets, fine-tuned MLLMs demonstrate competitive performance across all localization tasks when compared with the DL approach. In the qualitative experiments, the MLLMs provide evidence of reasoning and spatial awareness. Conclusion: This study shows that fine-tuned MLLMs achieve accurate skeletal landmark localization and hold promise for agentic autonomous C-arm control. Our code is available athttps://github.com/marszzibros/C-arm-localization-LLMs.git

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuned MLLMs reach competitive landmark localization on the two X-ray datasets and add qualitative reasoning traces for C-arm moves, but the agentic-control promise stays unquantified.

read the letter

The main thing to know is that the authors fine-tune a couple of MLLMs on annotated synthetic and real X-ray images, get localization performance that matches a standard DL baseline across the tasks, and then show a few examples where the model reasons its way out of an initial error and plans sequential C-arm adjustments. That combination is the actual contribution here. They do the empirical part cleanly enough: they pick two datasets, fine-tune the models to output the closest landmarks, run the comparison, and release the code. The qualitative traces are useful because they illustrate the intended advantage over pure DL—namely, the ability to incorporate spatial reasoning and recover from mistakes without immediate human intervention. That matches the motivation around emergency C-arm positioning where fallback to manual control costs time. The work is straightforward and the code availability helps anyone who wants to check or extend it. The soft spots are exactly where the stress-test note flags them. Dataset sizes, anatomical coverage, presence of implants or artifacts, and any breakdown of errors are not detailed in the available text, so it is hard to judge how far the competitive numbers actually travel. The agentic navigation claim rests only on selected qualitative sequences; there are no success rates, latency figures, or closed-loop tests under realistic shifts. That makes the conclusion about holding promise for autonomous control an extrapolation rather than a measured result. Readers who work on medical vision or AI for interventional procedures would find the application and the code worth looking at. It is not a broad methodological paper, but it is a focused empirical study that tests MLLMs on a concrete control-adjacent task. I would send it to peer review. The quantitative comparison and released code give referees something concrete to evaluate, and the gaps on robustness and quantitative navigation metrics are the sort of thing a review can usefully address.

Referee Report

3 major / 1 minor

Summary. The paper investigates fine-tuning multimodal large language models (MLLMs) for skeletal landmark localization on annotated synthetic and real X-ray datasets, comparing them quantitatively to a leading deep learning baseline. It also presents qualitative demonstrations of MLLM reasoning to correct initial localization errors and sequentially navigate a C-arm toward target positions, concluding that the approach achieves competitive accuracy and holds promise for agentic autonomous C-arm control.

Significance. If the competitive performance claims are substantiated with full metrics and the generalization holds, the work could meaningfully advance hybrid reasoning-based systems over pure DL for medical imaging control, enabling feedback incorporation and robustness in variable clinical conditions. The public code release is a clear strength for reproducibility.

major comments (3)

[Abstract] Abstract, Results paragraph: the statement that 'fine-tuned MLLMs demonstrate competitive performance across all localization tasks' provides no numerical metrics, error bars, dataset sizes, statistical tests, or protocol details for the DL comparison, rendering the central quantitative claim unverifiable from the text.
[Methods] Methods: no information is supplied on dataset cardinality, number of landmarks per image, anatomical coverage, or handling of clinical variations (pathologies, implants, artifacts), which directly bears on the generalization assumption underlying the conclusion's claim of promise for real-world agentic control.
[Results] Results, qualitative experiments: the agentic-control promise rests solely on hand-selected traces of reasoning-based correction and sequential navigation; no closed-loop success rates, latency figures, or robustness metrics under distribution shift are reported, leaving the extrapolation from narrow annotated data unsupported.

minor comments (1)

[Abstract] Abstract: the GitHub URL is concatenated without a preceding space after 'at'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the clarity and substantiation of our claims. We address each major comment below and have made revisions to the manuscript where the concerns are valid and addressable with existing data or clarifications.

read point-by-point responses

Referee: [Abstract] Abstract, Results paragraph: the statement that 'fine-tuned MLLMs demonstrate competitive performance across all localization tasks' provides no numerical metrics, error bars, dataset sizes, statistical tests, or protocol details for the DL comparison, rendering the central quantitative claim unverifiable from the text.

Authors: We agree that the abstract lacks specific numerical support for the competitive performance claim. The full manuscript (Section 3 and Tables 1-2) reports mean localization errors with standard deviations, dataset sizes (e.g., 5000 synthetic and 1200 real images), and direct comparisons to the DL baseline using the same evaluation protocol. We will revise the abstract to include key metrics such as average errors (e.g., 2.3mm synthetic, 4.1mm real) and note the use of paired t-tests for significance, making the claim verifiable without altering the core findings. revision: yes
Referee: [Methods] Methods: no information is supplied on dataset cardinality, number of landmarks per image, anatomical coverage, or handling of clinical variations (pathologies, implants, artifacts), which directly bears on the generalization assumption underlying the conclusion's claim of promise for real-world agentic control.

Authors: The provided Methods summary is brief, but the full manuscript details the datasets. We will expand this section to specify cardinality (5000 synthetic images with 6 landmarks each; 1200 real images with 4-8 landmarks), anatomical coverage (thoracolumbar spine and pelvis), and note that the data focuses on standard anatomy without explicit pathologies or implants. This addition will better contextualize the generalization claims while acknowledging the datasets' scope. revision: yes
Referee: [Results] Results, qualitative experiments: the agentic-control promise rests solely on hand-selected traces of reasoning-based correction and sequential navigation; no closed-loop success rates, latency figures, or robustness metrics under distribution shift are reported, leaving the extrapolation from narrow annotated data unsupported.

Authors: The qualitative experiments in Section 4 are designed to illustrate MLLM reasoning for error correction and sequential navigation, not to provide full quantitative agentic evaluation. We agree this limits strong claims about real-world robustness. We will revise the Results and Discussion to explicitly state these are illustrative examples, add a limitations paragraph noting the absence of closed-loop rates and latency, and temper the conclusion to emphasize promise pending future quantitative validation under distribution shifts. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on held-out data

full rationale

The paper performs standard supervised fine-tuning of MLLMs on two annotated X-ray datasets, reports quantitative landmark localization metrics against an external DL baseline, and shows qualitative reasoning traces. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All claims rest on direct comparison to held-out test data and external baselines rather than reducing to self-defined quantities or prior author results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on standard assumptions about dataset quality and MLLM adaptability rather than new physical postulates or fitted constants.

axioms (2)

domain assumption Annotated synthetic and real X-ray datasets accurately capture skeletal landmarks relevant to clinical C-arm use
Invoked throughout methods and results for training and quantitative evaluation
domain assumption Fine-tuning MLLMs preserves or enhances spatial reasoning capabilities for image-based tasks
Underlying the qualitative experiments on correction and navigation

pith-pipeline@v0.9.0 · 5587 in / 1358 out tokens · 39231 ms · 2026-05-10T04:28:03.297055+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Frontiers in Neuro logy 9 (2018) https://doi.org/10.3389/fneur.2018.01106

Raymond, S.B., Akbik, F., Stapleton, C.J., Mehta, B.P., Chandra, R.V ., Gon- zalez, R.G., Rabinov, J.D., Schwamm, L.H., Patel, A.B., Hirsch, J.A., Leslie- Mazwi, T.M.: Protocols for endovascular stroke treatment diminish t he weekend 11 eﬀect through improvements in oﬀ-hours care. Frontiers in Neuro logy 9 (2018) https://doi.org/10.3389/fneur.2018.01106

work page doi:10.3389/fneur.2018.01106 2018
[2]

Stroke 52(9), 2858–2865 (2021) https://doi.org/10.1161/strokeaha.120.033312

Stein, L.K., Mocco, J., Fiﬁ, J., Jette, N., Tuhrim, S., Dhamoon, M.S.: Correlations between physician and hospital stroke thrombectom y volumes and outcomes: A nationwide analysis. Stroke 52(9), 2858–2865 (2021) https://doi.org/10.1161/strokeaha.120.033312

work page doi:10.1161/strokeaha.120.033312 2021
[3]

Inte rnational Jour- nal of Computer Assisted Radiology and Surgery 15(7), 1095–1105 (2020) https://doi.org/10.1007/s11548-020-02204-0

Kausch, L., Thomas, S., Kunze, H., Privalov, M., Vetter, S., Frank e, J., Mahnken, A.H., Maier-Hein, L., Maier-Hein, K.: Toward automatic c-ar m positioning for standard projections in orthopedic surgery. Inte rnational Jour- nal of Computer Assisted Radiology and Surgery 15(7), 1095–1105 (2020) https://doi.org/10.1007/s11548-020-02204-0

work page doi:10.1007/s11548-020-02204-0 2020
[4]

In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda- Mahmood, T., Taylor, R

Kausch, L., Thomas, S., Kunze, H., El Barbari, J.S., Maier-Hein, K.H .: Shape-based pose estimation for automatic standard views of the knee. In: Greenspan, H., Madabhushi, A., Mousavi, P., Salcudean, S., Duncan, J., Syeda- Mahmood, T., Taylor, R. (eds.) Medical Image Computing and Comput er Assisted Intervention – MICCAI 2023, pp. 476–486. Springer, Ch a...

work page doi:10.1007/978-3-031-43990-2 2023
[5]

In: Pro ceedings of the IEEE/CVF International Conference on Computer Vision, p p

Arrabi, A., Jung, J.H., Luo, J., Franssen, N., Raymond, S.B., Wshah , S.: Auto- mated c-arm positioning via conformal landmark localization. In: Pro ceedings of the IEEE/CVF International Conference on Computer Vision, p p. 4392–4401 (2025). https://doi.org/10.1109/ICCVW69036.2025.00461

work page doi:10.1109/iccvw69036.2025.00461 2025
[6]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

Li, B., Yan, T., Pan, Y., Luo, J., Ji, R., Ding, J., Xu, Z., Liu, S., Dong, H., Lin, Z., Wang, Y.: MMedAgent: Learning to use medical tools with mu lti- modal agent. In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N. (eds.) Findin gs of the Association for Computational Linguistics: EMNLP 2024, pp. 8745–

2024
[7]

In Ku, L.-W., Martins, A

Association for Computational Linguistics, Miami, Florida, USA (2024). https://doi.org/10.18653/v1/2024.ﬁndings-emnlp.510

work page doi:10.18653/v1/2024 2024
[8]

Z., and Sonka, M

Arrabi, A., Jung, J., Le, J., Nguyen, A., Reed, J., Stahl, E., Franss en, N., Raymond, S., Wshah, S.: C-arm guidance: A self-supervised appr oach to automated positioning during stroke thrombectomy. In: 2025 I EEE 22nd International Symposium on Biomedical Imaging (ISBI), pp. 1–4 (2 025). https://doi.org/10.1109/ISBI60581.2025.10980945

work page doi:10.1109/isbi60581.2025.10980945 2025
[9]

MedGemma Technical Report

Sellergren, A., Kazemzadeh, S., Jaroensri, T., Kiraly, A., Travers e, M., Kohlberger, T., Xu, S., Jamil, F., Hughes, C., Lau, C., et al.: MedGemma Technical Report (2025). https://doi.org/10.48550/arXiv.2507.05201

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.05201 2025
[10]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu

Johnson, A.E.W., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lun gren, M.P., Deng, C.-y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identiﬁed public ly avail- able database of chest radiographs with free-text reports. Scie ntiﬁc Data 6(1), 317 (2019) https://doi.org/10.1038/s41597-019-0322-0 12

work page doi:10.1038/s41597-019-0322-0 2019
[11]

In: Flores, G., Chen, G.H., Pollard, T., Ho, J.C., Naumann, T

Pal, A., Umapathi, L.K., Sankarasubbu, M.: MedMCQA: A large-sca le multi- subject multi-choice dataset for medical domain question answerin g. In: Flores, G., Chen, G.H., Pollard, T., Ho, J.C., Naumann, T. (eds.) Pro- ceedings of the Conference on Health, Inference, and Learning. Proceed- ings of Machine Learning Research, vol. 174, pp. 248–260 (2022). PM...

2022
[12]

Annual Review of Psycholog y 59(1), 617– 645 (2008) https://doi.org/10.1146/annurev.psych.59.103006.093639

Barsalou, L.W.: Grounded cognition. Annual Review of Psycholog y 59(1), 617– 645 (2008) https://doi.org/10.1146/annurev.psych.59.103006.093639

work page doi:10.1146/annurev.psych.59.103006.093639 2008
[13]

MD: UNIFESP X-ray Body Part Clas siﬁer Compe- tition

Farina, E., FelipeKitamura, P. MD: UNIFESP X-ray Body Part Clas siﬁer Compe- tition. https://kaggle.com/competitions/unifesp-x-ray-body-part-c lassiﬁer. Kag- gle (2022)

2022
[14]

Oﬃce of the Medical Investigat or, University of New Mexico (2020)

Edgar, H., Daneshvari Berry, S., Moes, E., Adolphi, N., Bridges, P., Nolte, K.: New Mexico Decedent Image Database. Oﬃce of the Medical Investigat or, University of New Mexico (2020). https://doi.org/10.25827/5s8c-n515

work page doi:10.25827/5s8c-n515 2020
[15]

In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola- L´ opez, C., Fichtinger, G

Unberath, M., Zaech, J.-N., Lee, S.C., Bier, B., Fotouhi, J., Arman d, M., Navab, N.: Deepdrr – a catalyst for machine learning in ﬂuoroscop y- guided procedures. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola- L´ opez, C., Fichtinger, G. (eds.) Medical Image Computing and Comp uter Assisted Intervention – MICCAI 2018, pp. 98–106. Springer, Ch...

work page doi:10.1007/978-3-030-00937-3 2018
[16]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej , R., Perrin, S., Matejovicova, T., Ram´ e, A., Rivi` ere, M., Rouillard, L., Mesnard, T.,Cideron, G., Grill, J.-b., Ramos, S., Yvinec, E., Casbon, M., Pot, E., Penchev, I., Liu, G., et al.: Gemma 3 Technical Report (2025). https://doi.org/10.48550/arXiv.2503.19786

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
[17]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-VL Technical Report (20 25). https://doi.org/10.48550/arXiv.2502.13923

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.13923
[18]

In: Proceedings of the 37th Internationa l Conference on Neural Information Processing Systems

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: eﬃcient ﬁne- tuning of quantized llms. In: Proceedings of the 37th Internationa l Conference on Neural Information Processing Systems. NIPS ’23. Curran Assoc iates Inc., Red Hook, NY, USA (2023). https://doi.org/10.52202/075280-0441

work page doi:10.52202/075280-0441 2023
[19]

http://github.com/unslothai/unsloth

Daniel Han, M.H., team, U.: Unsloth. http://github.com/unslothai/unsloth
[20]

Lora vs full fine-tuning: An illusion of equivalence

Shuttleworth, R., Andreas, J., Torralba, A., Sharma, P.: LoRA vs Full Fine-tuning: An Illusion of Equivalence (2024). https://doi.org/10.48550/arXiv.2410.21228 13

work page doi:10.48550/arxiv.2410.21228 2024
[21]

In: Proceedings of the 36th International Conference o n Neural Informa- tion Processing Systems

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., C hi, E.H., Le, Q.V., Zhou, D.: Chain-of-thought prompting elicits reasoning in large la nguage models. In: Proceedings of the 36th International Conference o n Neural Informa- tion Processing Systems. NIPS ’22. Curran Associates Inc., Red Ho ok, NY, USA (2022). https://doi.org/10.52202/06...

work page doi:10.52202/068431-1800 2022
[22]

Chong Wu, Jiawang Cao, Renjie Xu, Zhuoheng Ran, Maolin Che, Wenbo Zhu, and Hong Yan

Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Ar ulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., Ch en, W.: Mmlu-pro: a more robust and challenging multi-task language unde rstand- ing benchmark. In: Proceedings of the 38th International Confe rence on Neural Information Processing Systems. NIPS ...

work page doi:10.52202/079017-3018 2024