DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
Pith reviewed 2026-05-20 12:34 UTC · model grok-4.3
The pith
DyGRO-VLA improves cross-task generalization in vision-language-action models through information-theoretic latents and dynamic residual optimization in reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DyGRO-VLA is a two-stage optimization framework that effectively captures cross-task latent representations based on information-theoretic principles and dynamically refines policy optimization via a mixture-of-RL-residuals, allowing the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process.
What carries the argument
The mixture-of-RL-residuals in the second stage, which dynamically groups and applies residuals to protect cross-task latent representations identified in the first stage.
If this is right
- VLA models achieve higher performance in multi-task training scenarios without overfitting to specific tasks.
- The approach maintains representation quality under task distribution shifts.
- Policy optimization becomes more effective for generalist robotic control.
- Real-world robotic applications benefit from improved adaptability across varied tasks.
- The method provides a way to scale VLA models to broader task sets while preserving learned features.
Where Pith is reading between the lines
- Similar dynamic grouping could help in other multi-agent or multi-environment RL settings.
- The information-theoretic principle for grouping might generalize to identifying task clusters without manual specification.
- Future experiments could measure how the number of tasks affects the stability of the extracted latents.
- This technique might reduce the data requirements for training versatile robotic policies.
Load-bearing premise
Cross-task latent representations identified via information-theoretic principles remain stable and beneficial under distribution shift and during residual-based policy refinement, without needing explicit assumptions about task similarities.
What would settle it
A direct test would involve training on one group of tasks, introducing a new task with substantially different observations or actions, and verifying if the full method yields higher success rates on both old and new tasks compared to a standard RL baseline.
Figures
read the original abstract
Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DyGRO-VLA, a two-stage optimization framework for Vision-Language-Action (VLA) models. The first stage captures cross-task latent representations based on information-theoretic principles. The second stage applies dynamic grouped residual optimization via a mixture-of-RL-residuals to refine policies while mitigating adverse interference on learned representations. The approach is evaluated on the LIBERO and RoboTwin2 benchmarks plus real-world tasks, reporting consistent improvements over strong baselines under multi-task training and distribution shift.
Significance. If the stability of the information-theoretic cross-task latents under residual refinement is confirmed, the work could meaningfully advance generalist VLA controllers by reducing task-specific overfitting in RL optimization. The dynamic grouping mechanism addresses a recognized challenge in multi-task robotics learning.
major comments (2)
- [Method and Experiments sections] The central claim that stage-1 information-theoretic latent identification produces representations that stage-2 mixture-of-RL-residuals can exploit without introducing instability or adverse interference is load-bearing yet lacks direct empirical verification. No ablation isolating latent stability (e.g., mutual information with task labels or representation similarity metrics before versus after residual optimization) is reported, especially under distribution shift from the initial cross-task capture phase.
- [Experiments section] The evaluation claims of consistent improvements under multi-task training and distribution shift rest on benchmark results whose statistical robustness is unclear; variance across random seeds, confidence intervals, or significance tests are not detailed, weakening support for the generalizability assertions.
minor comments (2)
- [Abstract] Clarify the precise information-theoretic quantity (e.g., mutual information, entropy) and the dynamic grouping criterion in the abstract and method overview for immediate readability.
- [Experiments] Ensure all baseline implementations and hyperparameter choices are fully specified to support reproducibility of the reported gains.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript to incorporate additional empirical verification and statistical reporting as suggested.
read point-by-point responses
-
Referee: [Method and Experiments sections] The central claim that stage-1 information-theoretic latent identification produces representations that stage-2 mixture-of-RL-residuals can exploit without introducing instability or adverse interference is load-bearing yet lacks direct empirical verification. No ablation isolating latent stability (e.g., mutual information with task labels or representation similarity metrics before versus after residual optimization) is reported, especially under distribution shift from the initial cross-task capture phase.
Authors: We agree that direct ablations on latent stability would provide stronger support for the central claim. While the performance gains under distribution shift already suggest the representations remain effective after stage-2 refinement, we acknowledge the absence of explicit metrics such as mutual information preservation or representation similarity. In the revised manuscript we have added a new ablation subsection (Section 4.3) reporting mutual information with task labels and cosine similarity of latents before versus after residual optimization on LIBERO under distribution shift. The results show less than 4% drop in mutual information and similarity scores above 0.88, consistent with the dynamic grouping mechanism limiting adverse interference. revision: yes
-
Referee: [Experiments section] The evaluation claims of consistent improvements under multi-task training and distribution shift rest on benchmark results whose statistical robustness is unclear; variance across random seeds, confidence intervals, or significance tests are not detailed, weakening support for the generalizability assertions.
Authors: We thank the referee for this observation. The original submission reported average performance but omitted detailed variance and statistical tests. In the revised Experiments section we now include standard deviations across five random seeds, 95% confidence intervals, and paired statistical significance tests (Wilcoxon signed-rank) for all main results on LIBERO, RoboTwin2, and real-world tasks. The updated tables confirm that reported improvements remain statistically significant (p < 0.05) under both multi-task training and distribution shift, thereby strengthening the generalizability claims. revision: yes
Circularity Check
No circularity: two-stage framework uses independent info-theoretic capture followed by residual refinement
full rationale
The paper presents a two-stage process: stage 1 captures cross-task latent representations via information-theoretic principles, and stage 2 applies dynamic mixture-of-RL-residuals for policy refinement. No equations, self-citations, or fitted parameters are shown that reduce any claimed prediction or generalizability gain to a quantity defined by the method's own outputs or prior self-referential normalizations. Evaluations on LIBERO, RoboTwin2, and real-world tasks provide external benchmarks, keeping the derivation self-contained against independent data.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
max pθ(z|o) I(Z;A) − λIB I(Z;O) ... variational lower bound Lbase = E[−log πθ(a|z)] + λIB [EPOZ[Tψ(o,z)] − log EPOPZ[eTψ(o,z)]]
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mixture-of-RL-Residuals (MoRR) ... dynamic routing ... top-m experts ... load-balancing regularizer LLB
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2003 IEEE International Conference on Robotics and Automation (Cat
Automatic grasp planning using shape primitives , author=. 2003 IEEE International Conference on Robotics and Automation (Cat. No. 03CH37422) , volume=. 2003 , organization=
work page 2003
-
[2]
The International Journal of Robotics Research , volume=
Hand posture subspaces for dexterous robotic grasping , author=. The International Journal of Robotics Research , volume=. 2009 , publisher=
work page 2009
-
[3]
A Survey on Vision-Language-Action Models for Embodied AI
A survey on vision-language-action models for embodied ai , author=. arXiv preprint arXiv:2405.14093 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
2013 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages=
Classical grasp quality evaluation: New algorithms and theory , author=. 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems , pages=. 2013 , organization=
work page 2013
-
[5]
The International Journal of Robotics Research , volume=
Adaptive synergies for the design and control of the Pisa/IIT SoftHand , author=. The International Journal of Robotics Research , volume=. 2014 , publisher=
work page 2014
-
[6]
The International Journal of Robotics Research , volume=
Exploitation of environmental constraints in human and robotic grasping , author=. The International Journal of Robotics Research , volume=. 2015 , publisher=
work page 2015
-
[7]
2014 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Extrinsic dexterity: In-hand manipulation with external forces , author=. 2014 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2014 , organization=
work page 2014
-
[8]
The International Journal of Robotics Research , volume=
Motion planning with sequential convex optimization and convex collision checking , author=. The International Journal of Robotics Research , volume=. 2014 , publisher=
work page 2014
-
[9]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Language Models are Few-Shot Learners
Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[11]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An image is worth 16x16 words: Transformers for image recognition at scale , author=. arXiv preprint arXiv:2010.11929 , year=
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[12]
International conference on machine learning , pages=
Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[13]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
PaLM-E: An Embodied Multimodal Language Model
Palm-e: An embodied multimodal language model , author=. arXiv preprint arXiv:2303.03378 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Conference on robot learning , pages=
Do as i can, not as i say: Grounding language in robotic affordances , author=. Conference on robot learning , pages=. 2023 , organization=
work page 2023
-
[16]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open x-embodiment: Robotic learning datasets and rt-x models , author=. arXiv preprint arXiv:2310.08864 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Rt-2: Vision-language-action models transfer web knowledge to robotic control , author=. arXiv preprint arXiv:2307.15818 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Roboagent: Generalization and efficiency in robot manipulation via semantic augmentations and action chunking , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=
work page 2024
-
[19]
Conference on Robot Learning, CoRL , pages=
Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation , author=. Conference on Robot Learning, CoRL , pages=
-
[20]
Yaron Lipman and Ricky T. Q. Chen and Heli Ben. Flow Matching for Generative Modeling , booktitle =
-
[21]
Robotics: Science and Systems (RSS) , year=
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model , author=. Robotics: Science and Systems (RSS) , year=
-
[22]
Data Scaling Laws in Imitation Learning for Robotic Manipulation , author=. 2024 , eprint=
work page 2024
-
[23]
Octo: An Open-Source Generalist Robot Policy
Octo: An open-source generalist robot policy , author=. arXiv preprint arXiv:2405.12213 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
RT-1: Robotics Transformer for Real-World Control at Scale
Rt-1: Robotics transformer for real-world control at scale , author=. arXiv preprint arXiv:2212.06817 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Robotics: Science and Systems (RSS) , year=
Fast: Efficient action tokenization for vision-language-action models , author=. Robotics: Science and Systems (RSS) , year=
-
[26]
Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation , author=. arXiv preprint arXiv:2411.19650 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Robotics: Science and Systems (RSS) , year=
Fine-tuning vision-language-action models: Optimizing speed and success , author=. Robotics: Science and Systems (RSS) , year=
-
[28]
Dita: Scaling diffusion transformer for generalist vision-language-action policy
Dita: Scaling diffusion transformer for generalist vision-language-action policy , author=. arXiv preprint arXiv:2503.19757 , year=
-
[29]
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA: An Open-Source Vision-Language-Action Model , author=. arXiv preprint arXiv:2406.09246 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Vision-language-action models for robotics: A review towards real-world applications , author=. IEEE Access , year=
-
[31]
Journal of machine learning research , volume=
A review of robot learning for manipulation: Challenges, representations, and algorithms , author=. Journal of machine learning research , volume=
-
[32]
Vision-Language Foundation Models as Effective Robot Imitators
Vision-language foundation models as effective robot imitators , author=. arXiv preprint arXiv:2311.01378 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
pi0: A Vision-Language-Action Flow Model for General Robot Control , author=. arXiv preprint arXiv:2410.24164 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Open X-Embodiment Collaboration , howpublished =. Open
-
[35]
Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets
Bridge data: Boosting generalization of robotic skills with cross-domain datasets , author=. arXiv preprint arXiv:2109.13396 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0 , author=. 2024 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2024 , organization=
work page 2024
-
[37]
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Droid: A large-scale in-the-wild robot manipulation dataset , author=. arXiv preprint arXiv:2403.12945 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Intelligence, Physical and Black, Kevin and Brown, Noah and Darpinian, James and Dhabalia, Karan and others , journal=. _
-
[39]
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models
Hi robot: Open-ended instruction following with hierarchical vision-language-action models , author=. arXiv preprint arXiv:2502.19417 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
arXiv preprint arXiv:2505.21906 , year=
Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge , author=. arXiv preprint arXiv:2505.21906 , year=
-
[41]
LoRA: Low-Rank Adaptation of Large Language Models
Lora: Low-rank adaptation of large language models , author=. arXiv preprint arXiv:2106.09685 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Advances in neural information processing systems , volume=
Denoising diffusion probabilistic models , author=. Advances in neural information processing systems , volume=
-
[43]
The International Journal of Robotics Research , pages=
Diffusion policy: Visuomotor policy learning via action diffusion , author=. The International Journal of Robotics Research , pages=. 2023 , publisher=
work page 2023
-
[44]
arXiv preprint arXiv:2502.02853 , year=
Rethinking Latent Redundancy in Behavior Cloning: An Information Bottleneck Approach for Robot Manipulation , author=. arXiv preprint arXiv:2502.02853 , year=
-
[45]
Advances in Neural Information Processing Systems , volume=
Libero: Benchmarking knowledge transfer for lifelong robot learning , author=. Advances in Neural Information Processing Systems , volume=
-
[46]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware
Learning fine-grained bimanual manipulation with low-cost hardware , author=. arXiv preprint arXiv:2304.13705 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation , author=. arXiv preprint arXiv:2409.12514 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation , author=. arXiv preprint arXiv:2410.07864 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
arXiv preprint arXiv:2303.00905 , year=
Open-world object manipulation using pre-trained vision-language models , author=. arXiv preprint arXiv:2303.00905 , year=
-
[50]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
Voxposer: Composable 3d value maps for robotic manipulation with language models , author=. arXiv preprint arXiv:2307.05973 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
arXiv preprint arXiv:2406.18915 , year=
Manipulate-anything: Automating real-world robots using vision-language models , author=. arXiv preprint arXiv:2406.18915 , year=
-
[52]
R3m: A universal visual representation for robot manipulation , author=. CoRL , year=
-
[53]
Language-Driven Representation Learning for Robotics , author=. ArXiv , year=
-
[54]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[55]
An interactive agent foundation model.arXiv preprint arXiv:2402.05929, 2024
An interactive agent foundation model , author=. arXiv preprint arXiv:2402.05929 , year=
-
[56]
Proceedings of the International Conference on Machine Learning (ICML) , year=
An Embodied Generalist Agent in 3D World , author=. Proceedings of the International Conference on Machine Learning (ICML) , year=
-
[57]
3D-VLA: A 3D Vision-Language-Action Generative World Model
Zhen, Haoyu and Qiu, Xiaowen and Chen, Peihao and Yang, Jincheng and Yan, Xin and Du, Yilun and Hong, Yining and Gan, Chuang , title =. arXiv preprint arXiv:2403.09631 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Introducing RFM-1: Giving robots human-like reasoning capabilities
Andrew Sohn et al. Introducing RFM-1: Giving robots human-like reasoning capabilities. 2024
work page 2024
-
[59]
LINGO-2: Driving with Natural Language
Wayve. LINGO-2: Driving with Natural Language. 2024
work page 2024
-
[60]
International conference on machine learning , pages=
Improved denoising diffusion probabilistic models , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[61]
ProgPrompt: Generating Situated Robot Task Plans using Large Language Models , author=. 2022 , eprint=
work page 2022
-
[62]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances , author=. 2022 , eprint=
work page 2022
-
[63]
Inner Monologue: Embodied Reasoning through Planning with Language Models , author=. 2022 , eprint=
work page 2022
-
[64]
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents , author=. 2022 , eprint=
work page 2022
-
[65]
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models , author=. 2023 , eprint=
work page 2023
-
[66]
Robotic Control via Embodied Chain-of-Thought Reasoning , author=. 2024 , eprint=
work page 2024
-
[67]
Proceedings of the AAAI conference on artificial intelligence , volume=
Film: Visual reasoning with a general conditioning layer , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[68]
A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=
work page 2011
-
[69]
FAST: Efficient Action Tokenization for Vision-Language-Action Models , author=. 2025 , eprint=
work page 2025
-
[70]
Roboagent: Towards sample efficient robot manipulation with semantic augmentations and action chunking , author=. arxiv , year=
-
[71]
International conference on machine learning , pages=
Efficientnet: Rethinking model scaling for convolutional neural networks , author=. International conference on machine learning , pages=. 2019 , organization=
work page 2019
-
[72]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Deep residual learning for image recognition , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[73]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , author=. arXiv preprint arXiv:1910.01108 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[74]
DINOv2: Learning Robust Visual Features without Supervision
Dinov2: Learning robust visual features without supervision , author=. arXiv preprint arXiv:2304.07193 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[75]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[76]
Behavior generation with latent actions.arXiv preprint arXiv:2403.03181, 2024
Behavior generation with latent actions , author=. arXiv preprint arXiv:2403.03181 , year=
-
[77]
Advances in neural information processing systems , volume=
Neural discrete representation learning , author=. Advances in neural information processing systems , volume=
-
[78]
Finite Scalar Quantization: VQ-VAE Made Simple
Finite scalar quantization: Vq-vae made simple , author=. arXiv preprint arXiv:2309.15505 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[79]
Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation
Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation , author=. arXiv preprint arXiv:2401.02117 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[80]
arXiv preprint arXiv:2402.07865 , year=
Prismatic vlms: Investigating the design space of visually-conditioned language models , author=. arXiv preprint arXiv:2402.07865 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.