Clutter-Robust Vision-Language-Action Models through Object-Centric and Geometry Grounding
Pith reviewed 2026-05-16 19:25 UTC · model grok-4.3
The pith
Explicit object-centric and geometry grounding disentangles perception from action in VLAs to improve robustness in cluttered robotic manipulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OBEYED-VLA augments pretrained VLAs with a perception module that first uses VLM-based object-centric grounding to select task-relevant regions across camera views and then applies geometric grounding to emphasize 3D structure over appearance. The resulting grounded observations are passed to a VLA policy that is fine-tuned exclusively on single-object demonstrations collected without clutter, yielding substantially better performance than baselines in four challenging real-world regimes.
What carries the argument
The OBEYED-VLA perception module, which combines a VLM-based object-centric grounding stage that selects task-relevant regions across views with a complementary geometric grounding stage that emphasizes 3D structure.
If this is right
- Policies become robust to distractor objects without requiring retraining on cluttered data.
- Absent-target rejection improves because grounding prevents action on irrelevant or missing objects.
- Background appearance changes have less effect since the module prioritizes object geometry over visual texture.
- Manipulation of unseen objects in clutter succeeds when the grounding stage identifies relevant regions from clean demonstrations.
Where Pith is reading between the lines
- The same disentanglement could reduce data needs for other VLA tasks such as mobile manipulation or long-horizon planning.
- Combining the grounding module with larger or more recent VLMs might further improve region selection accuracy in dense clutter.
- The approach suggests a path to hybrid systems where perception modules are swapped or upgraded independently of the action policy.
Load-bearing premise
The VLM-based object-centric grounding stage can reliably pick task-relevant object regions even when scenes contain distractors or clutter, and policies trained only on clean single-object data will transfer to multi-object cluttered conditions.
What would settle it
Running the full OBEYED-VLA pipeline on the UR10e tabletop with added distractors and absent targets and observing no reduction in over-grasping or background distraction compared to the ungrounded VLA baseline.
Figures
read the original abstract
Recent Vision-Language-Action (VLA) models have made impressive progress toward general-purpose robotic manipulation by post-training large Vision-Language Models (VLMs) for action prediction. Yet most VLAs entangle perception and control in a monolithic pipeline optimized purely for action, which can erode language-conditioned grounding. In our real-world tabletop tests, policies over-grasp when the target is absent, are distracted by clutter, and overfit to background appearance. To address these issues, we propose OBEYED-VLA (OBject-centric and gEometrY groundED VLA), a framework that explicitly disentangles perceptual grounding from action reasoning. Instead of operating directly on raw RGB, OBEYED-VLA augments VLAs with a perception module that grounds multi-view inputs into task-conditioned, object-centric, and geometry-aware observations. This module includes a VLM-based object-centric grounding stage that selects task-relevant object regions across camera views, along with a complementary geometric grounding stage that emphasizes the 3D structure of these objects over their appearance. The resulting grounded views are then fed to a pretrained VLA policy, which we fine-tune exclusively on single-object demonstrations collected without environmental clutter or non-target objects. On a real-world UR10e tabletop setup, OBEYED-VLA substantially improves robustness over strong VLA baselines across four challenging regimes and multiple difficulty levels: distractor objects, absent-target rejection, background appearance changes, and cluttered manipulation of unseen objects. Ablation studies confirm that both semantic grounding and geometry-aware grounding are critical to these gains. Overall, the results indicate that making perception an explicit, object-centric component is an effective way to strengthen and generalize VLA-based robotic manipulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OBEYED-VLA, a framework that augments Vision-Language-Action (VLA) models with an explicit perception module for VLM-based object-centric grounding across views and complementary geometric grounding to emphasize 3D structure. The grounded observations are fed to a pretrained VLA policy fine-tuned exclusively on single-object clean demonstrations. Real-robot experiments on a UR10e tabletop setup show improved robustness over baselines across four regimes (distractors, absent-target rejection, background changes, cluttered unseen objects), with ablations indicating both grounding stages are critical.
Significance. If the results hold, this work provides empirical evidence that explicit object-centric and geometry-aware perception can strengthen VLA robustness to clutter and distractors without requiring cluttered training data. The real-world validation across multiple difficulty levels and the ablation studies are positive aspects supporting the claim that disentangling perception from action improves generalization in robotic manipulation.
major comments (2)
- [Ablation studies] Ablation studies: The ablations demonstrate performance drops when semantic or geometry grounding is removed, but no quantitative metrics (precision, recall, or IoU) are reported for the VLM-based object-centric grounding stage's ability to select task-relevant regions versus ground-truth objects in the actual cluttered multi-object test scenes. This weakens causal attribution of gains to the proposed perception module, as residual errors could stem from imperfect grounding rather than policy limitations.
- [Results section] Results section: Reported improvements over strong VLA baselines lack error bars, trial counts, or statistical details across the four regimes and difficulty levels, making it hard to evaluate the reliability and magnitude of the claimed robustness gains.
minor comments (2)
- [Abstract] The abstract and results could specify the exact baseline VLA models and architectures used for comparison to allow direct replication.
- [Methods] Implementation details on how the geometric grounding stage processes multi-view inputs to emphasize 3D structure over appearance would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comments below and will incorporate the suggested revisions to strengthen the evaluation of the perception module and the statistical presentation of results.
read point-by-point responses
-
Referee: The ablations demonstrate performance drops when semantic or geometry grounding is removed, but no quantitative metrics (precision, recall, or IoU) are reported for the VLM-based object-centric grounding stage's ability to select task-relevant regions versus ground-truth objects in the actual cluttered multi-object test scenes. This weakens causal attribution of gains to the proposed perception module, as residual errors could stem from imperfect grounding rather than policy limitations.
Authors: We agree that direct quantitative evaluation of the grounding stage would strengthen causal attribution. In the revised manuscript we will add precision, recall, and IoU metrics for the VLM-based object-centric grounding by comparing selected regions against manually annotated ground-truth task-relevant objects in a representative subset of the cluttered test scenes. This addition will clarify the accuracy of the perception module and better link it to the observed robustness gains. revision: yes
-
Referee: Reported improvements over strong VLA baselines lack error bars, trial counts, or statistical details across the four regimes and difficulty levels, making it hard to evaluate the reliability and magnitude of the claimed robustness gains.
Authors: We thank the referee for this observation. The current manuscript reports average success rates but does not include detailed statistics. We will revise the results section to specify the number of trials conducted per regime and difficulty level (15 trials per condition), include error bars as standard deviations, and report statistical comparisons (e.g., p-values from paired t-tests) between OBEYED-VLA and the baselines. These changes will provide a more rigorous assessment of the improvements. revision: yes
Circularity Check
No circularity: empirical architecture validated by real-world experiments
full rationale
The paper proposes an empirical framework (OBEYED-VLA) that augments VLAs with explicit object-centric and geometry grounding modules, fine-tunes the policy exclusively on single-object clean demonstrations, and evaluates robustness on cluttered real-world tasks. No mathematical derivations, equations, fitted parameters, or self-citation chains appear in the abstract or described structure. The central claims rest on ablation studies and comparative performance metrics across regimes (distractors, absent targets, background changes, unseen objects), which are externally falsifiable via the reported real-world UR10e experiments rather than reducing to self-defined inputs or prior self-citations. This is a standard non-circular empirical contribution.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pretrained VLMs can accurately identify and segment task-relevant objects from multi-view RGB inputs in cluttered scenes
- domain assumption Policies trained only on single-object demonstrations without clutter will generalize when given grounded observations in cluttered test environments
invented entities (1)
-
OBEYED-VLA perception module
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
OBEYED-VLA augments VLAs with a perception module that grounds multi-view inputs into task-conditioned, object-centric, and geometry-aware observations... VLM-based object-centric grounding stage... geometric grounding stage that emphasizes the 3D structure
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fine-tune exclusively on single-object demonstrations collected without environmental clutter
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
Reference graph
Works this paper leans on
-
[1]
Octo: An open-source generalist robot policy,
Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine, “Octo: An open-source generalist robot policy,” inProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024
work page 2024
-
[2]
Vision-language foundation models as effective robot imitators,
X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y . Jing, W. Zhang, H. Liu, H. Li, and T. Kong, “Vision-language foundation models as effective robot imitators,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=lFYj0oibGR
work page 2024
-
[3]
OpenVLA: An open-source vision-language-action model,
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “OpenVLA: An open-source vision-language-action model,” in8th Annual Conference on Robot Learning, 2024. [Online]. Available: https://openreview.net/fo...
work page 2024
-
[4]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π 0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
FAST: Efficient Action Tokenization for Vision-Language-Action Models
K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y . Fang, D. Fox, F. Hu, S. Huanget al., “Gr00t n1: An open foundation model for generalist humanoid robots,”arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Bridgedata v2: A dataset for robot learning at scale,
H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruch, A. W. He, V . Myers, M. J. Kim, M. Duet al., “Bridgedata v2: A dataset for robot learning at scale,” inConference on Robot Learning. PMLR, 2023, pp. 1723–1736. 13
work page 2023
-
[8]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,
A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collab- oration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903
work page 2024
-
[9]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Elliset al., “Droid: A large-scale in-the-wild robot manipulation dataset,”arXiv preprint arXiv:2403.12945, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Robotic Control via Embodied Chain-of-Thought Reasoning
M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,”arXiv preprint arXiv:2407.08693, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Fast ecot: Efficient embodied chain-of-thought via thoughts reuse,
Z. Duan, Y . Zhang, S. Geng, G. Liu, J. Boedecker, and C. X. Lu, “Fast ecot: Efficient embodied chain-of-thought via thoughts reuse,”arXiv preprint arXiv:2506.07639, 2025
-
[12]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,
Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, T.-Y . Lin, G. Wetzstein, M.-Y . Liu, and D. Xiang, “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2025, pp. 1702–1713
work page 2025
-
[13]
OpenAI, “Gpt-4 technical report,” 2024. [Online]. Available: https: //arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: bootstrapping language- image pre-training with frozen image encoders and large language models,” inICML, ser. ICML’23, 2023
work page 2023
-
[15]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v,”arXiv preprint arXiv:2310.11441, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
PaliGemma: A versatile 3B VLM for transfer
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello et al., “Paligemma: A versatile 3b vlm for transfer,”arXiv preprint arXiv:2407.07726, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Prismatic vlms: Investigating the design space of visually- conditioned language models,
S. Karamcheti, S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh, “Prismatic vlms: Investigating the design space of visually- conditioned language models,” inInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[20]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “π 0.5: a vision- language-action model with open-world generalization,”arXiv preprint arXiv:2504.16054, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Flow matching for generative modeling,
Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=PqvMRDCJT9t
work page 2023
-
[22]
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
J. Duan, W. Pumacay, N. Kumar, Y . R. Wang, S. Tian, W. Yuan, R. Krishna, D. Fox, A. Mandlekar, and Y . Guo, “Aha: A vision-language- model for detecting and reasoning over failures in robotic manipulation,” arXiv preprint arXiv:2410.00371, 2024
-
[23]
Failsafe: Reasoning and recovery from failures in vision-language-action models,
Z. Lin, J. Duan, H. Fang, D. Fox, R. Krishna, C. Tan, and B. Wen, “Failsafe: Reasoning and recovery from failures in vision-language-action models,” 2025. [Online]. Available: https: //arxiv.org/abs/2510.01642
-
[24]
Hi robot: Open-ended instruction following with hierarchical vision-language-action models,
L. X. Shi, brian ichter, M. R. Equi, L. Ke, K. Pertsch, Q. Vuong, J. Tanner, A. Walling, H. Wang, N. Fusai, A. Li-Bell, D. Driess, L. Groom, S. Levine, and C. Finn, “Hi robot: Open-ended instruction following with hierarchical vision-language-action models,” inForty- second International Conference on Machine Learning, 2025. [Online]. Available: https://o...
work page 2025
-
[25]
HAMSTER: Hierarchical action models for open-world robot manipulation,
Y . Li, Y . Deng, J. Zhang, J. Jang, M. Memmel, C. R. Garrett, F. Ramos, D. Fox, A. Li, A. Gupta, and A. Goyal, “HAMSTER: Hierarchical action models for open-world robot manipulation,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/forum?id=h7aQxzKbq6
work page 2025
-
[26]
Moka: Open-world robotic manipulation through mark-based visual prompting,
K. Fang, F. Liu, P. Abbeel, and S. Levine, “Moka: Open-world robotic manipulation through mark-based visual prompting,”Robotics: Science and Systems (RSS), 2024
work page 2024
-
[27]
Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,
W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,” inProceedings of The 8th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, P. Agrawal, O. Kroemer, and W. Burgard, Eds., vol. 270. PMLR, 06–09 Nov 2025, pp. 4573–4602. [Online]. Ava...
work page 2025
-
[28]
Run-time observation interventions make vision-language-action models more visually robust,
A. J. Hancock, A. Z. Ren, and A. Majumdar, “Run-time observation interventions make vision-language-action models more visually robust,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 9499–9506
work page 2025
-
[29]
Grad-cam: Visual explanations from deep networks via gradient-based localization,
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,”International Journal of Computer Vision, vol. 128, no. 2, pp. 336–359, 2020
work page 2020
-
[30]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026
work page 2023
-
[31]
SAM 2: Segment anything in images and videos,
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Dollar, and C. Feichtenhofer, “SAM 2: Segment anything in images and videos,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https...
work page 2025
-
[32]
Segment and recognize anything at any granularity,
F. Li, H. Zhang, P. Sun, X. Zou, S. Liu, C. Li, J. Yang, L. Zhang, and J. Gao, “Segment and recognize anything at any granularity,” in European Conference on Computer Vision. Springer, 2024, pp. 467– 484
work page 2024
-
[33]
Segment everything everywhere all at once,
X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y . J. Lee, “Segment everything everywhere all at once,”Advances in neural information processing systems, vol. 36, pp. 19 769–19 782, 2023
work page 2023
-
[34]
Detrs with collaborative hybrid as- signments training,
Z. Zong, G. Song, and Y . Liu, “Detrs with collaborative hybrid as- signments training,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 6748–6758
work page 2023
-
[35]
Objects365: A large-scale, high-quality dataset for object detection,
S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, “Objects365: A large-scale, high-quality dataset for object detection,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8430–8439
work page 2019
-
[36]
Lvis: A dataset for large vocabulary instance segmentation,
A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
work page 2019
-
[37]
YOLOv11: An Overview of the Key Architectural Enhancements
R. Khanam and M. Hussain, “Yolov11: An overview of the key architectural enhancements,”arXiv preprint arXiv:2410.17725, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection,
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Suet al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” inEuropean conference on computer vision. Springer, 2024, pp. 38–55
work page 2024
-
[39]
Putting the object back into video object segmentation,
H. K. Cheng, S. W. Oh, B. Price, J.-Y . Lee, and A. Schwing, “Putting the object back into video object segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 3151–3161
work page 2024
-
[40]
L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024
work page 2024
-
[41]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.