arxiv: 2311.01378 · v3 · submitted 2023-11-02 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Vision-Language Foundation Models as Effective Robot Imitators

Xinghang Li , Minghuan Liu , Hanbo Zhang , Cunjun Yu , Jie Xu , Hongtao Wu , Chilam Cheang , Ya Jing

show 4 more authors

Weinan Zhang Huaping Liu Hang Li Tao Kong

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:40 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords vision-language modelsrobot manipulationimitation learningfine-tuningfoundation modelsroboflamingoopenflamingo

0 comments

The pith

Simple fine-tuning adapts pre-trained vision-language models into robot policies that beat prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RoboFlamingo as a direct way to repurpose existing vision-language foundation models for robot manipulation. It keeps the pre-trained model for single-step vision and language understanding, adds a separate policy head to track action history over time, and performs light imitation learning only on language-conditioned manipulation datasets. This setup produces large gains over previous state-of-the-art robot controllers on standard benchmarks and supports open-loop execution on modest hardware. A reader would care because the work suggests that general multimodal models can supply usable robot skills without building new large-scale robot-specific training pipelines from scratch.

Core claim

RoboFlamingo shows that open-source VLMs such as OpenFlamingo can be used for single-step vision-language comprehension, paired with an explicit policy head to model sequential history, and then slightly fine-tuned by imitation learning on language-conditioned manipulation datasets, resulting in performance that exceeds prior state-of-the-art methods on the tested benchmarks.

What carries the argument

The explicit policy head that converts the VLM's single-step outputs into sequential robot actions while preserving the pre-trained vision-language weights.

If this is right

The framework supports open-loop control that runs on low-performance platforms.
Different pre-trained VLMs exhibit measurable differences in manipulation-task behavior after the same fine-tuning.
Robot policies become easier to customize because only small imitation datasets and standard fine-tuning are required.
The separation of single-step comprehension from sequential policy modeling reduces the need for end-to-end retraining of the entire model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds for newer VLMs, the same lightweight adaptation could be applied to other embodied tasks such as navigation or multi-robot coordination.
The approach may lower the data barrier for individuals or small labs to produce task-specific robot policies by leveraging publicly available manipulation datasets.
Combining the method with limited online fine-tuning or human feedback could address distribution shifts that appear in real-world deployment.

Load-bearing premise

Modest fine-tuning on existing language-conditioned manipulation datasets is sufficient to transfer general vision-language understanding into reliable sequential robot policies without catastrophic forgetting or domain shift.

What would settle it

A new held-out robot manipulation benchmark on which RoboFlamingo underperforms current specialized methods or exhibits clear loss of vision-language capability after fine-tuning.

read the original abstract

Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. We believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoboFlamingo keeps the VLM for single-step understanding, adds a policy head for history, and does light fine-tuning on manipulation data, claiming large benchmark gains that still need full experimental checks.

read the letter

The main thing to know is that this paper takes OpenFlamingo, freezes most of it for vision-language comprehension, tacks on an explicit policy head to track history, and runs modest imitation learning on existing language-conditioned robot datasets. The result is an open-loop controller that runs on modest hardware and reportedly beats prior methods by a wide margin on the tested benchmark. They also compare a few different pre-trained VLMs and note patterns in how each one behaves on manipulation tasks. That split between the backbone and the policy head is the cleanest part of the work and gives the method some practical flexibility that mixed architectures often lack. The VLM comparisons add a bit of extra value beyond the headline numbers. The main soft spot is the empirical support. The abstract stresses large-margin outperformance, but the available details do not show data splits, hyper-parameter controls, statistical tests, or ablations that isolate the policy head from the backbone. Without those, it is hard to tell whether the gains survive standard scrutiny or whether fine-tuning simply overfits the narrow robot distribution. The stress-test note on domain shift and possible forgetting is reasonable; if the paper lacks out-of-distribution tests or checks that general VLM capabilities survive, the transfer story stays provisional. This is the sort of paper that would interest people working on low-data ways to connect foundation models to robot control. A reader already familiar with OpenFlamingo or imitation learning setups would get the most out of the architecture choices and the VLM behavior notes. I would send it to peer review. The idea is straightforward enough and the potential payoff is clear if the results hold under closer inspection.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces RoboFlamingo, a framework that adapts pre-trained vision-language models (VLMs) such as OpenFlamingo for robot manipulation. The VLM is used for single-step vision-language comprehension, an explicit policy head models sequential history, and the system is lightly fine-tuned via imitation learning on language-conditioned manipulation datasets. The central claim is that this yields large-margin state-of-the-art performance on the evaluated benchmark, establishing VLMs as effective and competitive robot imitators with a simple, open-loop, cost-effective adaptation strategy.

Significance. If the empirical results prove robust under standard controls, the work offers a practical, low-resource route to leverage existing VLMs for robotics without full retraining. The explicit decomposition into single-step VLM comprehension plus a lightweight policy head supports open-loop deployment on modest hardware and could lower barriers for fine-tuning robot policies. The reported insights on different VLMs' behavior on manipulation tasks would also be of interest to the community.

major comments (3)

[§4] §4 (experimental results) and associated tables: the large-margin SOTA claim is presented without reported data splits, hyper-parameter search protocols, or statistical significance tests, so it is impossible to verify whether the gains survive standard controls for evaluation protocol or random seeds.
[§3.2] §3.2 (architecture) and §4.2 (ablations): no ablation isolates the contribution of the frozen pre-trained VLM backbone versus the added policy head and fine-tuning; without this, the claim that general VLM understanding transfers to sequential policies cannot be distinguished from gains due to the head alone.
[§5] §5 (discussion and generalization): the manuscript contains no out-of-distribution or domain-shift experiments, leaving untested the assumption that modest imitation learning on existing datasets avoids catastrophic forgetting or distribution mismatch when deploying the adapted VLM.

minor comments (2)

[Figures] Figure 1 and 2 captions could more explicitly label the policy head versus VLM components and the open-loop versus closed-loop distinction.
[Abstract] The abstract states that 'extensive experimental results also reveal several interesting conclusions' about different VLMs; a short dedicated subsection summarizing these would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each comment and will make revisions to address the concerns regarding experimental rigor, ablations, and generalization. Our responses are detailed below.

read point-by-point responses

Referee: [§4] §4 (experimental results) and associated tables: the large-margin SOTA claim is presented without reported data splits, hyper-parameter search protocols, or statistical significance tests, so it is impossible to verify whether the gains survive standard controls for evaluation protocol or random seeds.

Authors: We agree with this observation and acknowledge that the original submission lacked sufficient details on the evaluation protocol. In the revised manuscript, we will add comprehensive information on the data splits used (adhering to the standard splits provided by the benchmark datasets such as RLBench), the hyperparameter tuning process (including the ranges searched for learning rate, batch size, and number of epochs), and statistical analysis by reporting mean and standard deviation across multiple random seeds (e.g., 5 seeds). These additions will enable verification that the performance gains are robust and statistically significant. revision: yes
Referee: [§3.2] §3.2 (architecture) and §4.2 (ablations): no ablation isolates the contribution of the frozen pre-trained VLM backbone versus the added policy head and fine-tuning; without this, the claim that general VLM understanding transfers to sequential policies cannot be distinguished from gains due to the head alone.

Authors: Thank you for highlighting this important distinction. While our ablations in §4.2 explore variations in the policy head and fine-tuning, they do not fully isolate the VLM backbone's role. We will revise §4.2 to include a new ablation study with the following variants: (i) full model with fine-tuned VLM and policy head, (ii) frozen VLM with fine-tuned policy head, and (iii) policy head trained without VLM features (using raw image inputs). This will clearly demonstrate the transfer of general VLM understanding to the robot policy beyond the contributions of the policy head alone. revision: yes
Referee: [§5] §5 (discussion and generalization): the manuscript contains no out-of-distribution or domain-shift experiments, leaving untested the assumption that modest imitation learning on existing datasets avoids catastrophic forgetting or distribution mismatch when deploying the adapted VLM.

Authors: We recognize the value of OOD and domain-shift experiments for validating generalization claims. The primary focus of this work is to show that VLMs can be effectively adapted as robot imitators with minimal fine-tuning on standard datasets. In the revised manuscript, we will add experiments in §5 testing on held-out tasks or slight domain variations (e.g., different object appearances or camera angles) from the training distribution to assess robustness to distribution mismatch. We will also discuss the potential for catastrophic forgetting and how our light fine-tuning strategy mitigates it. However, exhaustive OOD testing across entirely new environments is beyond the scope of this paper and left for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical fine-tuning and benchmark evaluation

full rationale

The paper presents RoboFlamingo as an empirical adaptation of pre-trained OpenFlamingo VLMs via modest imitation learning on language-conditioned manipulation datasets, with performance measured directly on standard robotics benchmarks. No load-bearing step reduces by the paper's own equations or self-citations to a fitted parameter, self-defined quantity, or prior author result; the architecture (VLM for single-step comprehension plus explicit policy head for history) and reported SOTA margins are outcomes of training and testing rather than tautological constructions. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on the empirical claim that pre-trained VLMs already encode sufficient manipulation-relevant features; no new mathematical axioms or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5554 in / 1040 out tokens · 58097 ms · 2026-05-16T21:40:52.339632+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 conditional novelty 7.0

Reducing visual input to one token per frame in VLA world models maintains or improves long-horizon performance on MetaWorld, LIBERO, and real-robot tasks.
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models
cs.RO 2026-02 unverdicted novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
Large Video Planner Enables Generalizable Robot Control
cs.RO 2025-12 conditional novelty 7.0

A video foundation model trained on human demonstrations generates zero-shot plans that convert to executable robot actions on novel scenes and tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame in world models for vision-language-action policies maintains long-horizon performance while improving success rates on MetaWorld, LIBERO, and real-robot tasks.
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
cs.CV 2026-05 unverdicted novelty 6.0

Reducing visual input to one token per frame via adaptive attention pooling and a unified flow-matching objective improves long-horizon performance in VLA policies on MetaWorld, LIBERO, and real-robot tasks.
Temporal Difference Calibration in Sequential Tasks: Application to Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Temporal difference calibration aligns uncertainty estimates in vision-language-action models with their value functions for better sequential performance.
PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation
cs.RO 2026-01 unverdicted novelty 6.0

PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
cs.RO 2025-03 unverdicted novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
cs.CV 2025-03 unverdicted novelty 6.0

HybridVLA unifies diffusion and autoregression in a single VLA model via collaborative training and ensemble to raise robot manipulation success rates by 14% in simulation and 19% in real-world tasks.
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
cs.RO 2025-02 accept novelty 6.0

OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO 2024-11 unverdicted novelty 6.0

CogACT is a new VLA model that uses a conditioned diffusion action transformer to achieve over 35% higher average success rates than OpenVLA in simulation and 55% in real-robot experiments while generalizing to new ro...
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
cs.RO 2024-10 unverdicted novelty 6.0

GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
R3D: Revisiting 3D Policy Learning
cs.CV 2026-04 unverdicted novelty 5.0

A transformer 3D encoder plus diffusion decoder architecture, with 3D-specific augmentations, outperforms prior 3D policy methods on manipulation benchmarks by improving training stability.
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
cs.RO 2026-04 unverdicted novelty 5.0

CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
WorldVLA: Towards Autoregressive Action World Model
cs.RO 2025-06 unverdicted novelty 5.0

WorldVLA unifies VLA and world models in one autoregressive system, shows they boost each other, and adds an attention mask to stop error buildup when generating action chunks.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 18 Pith papers · 12 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

URL https://arxiv.org/abs/ 2204.06745. Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901,

work page 1901
[6]

Universal Sentence Encoder

Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder.arXiv preprint arXiv:1803.11175,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Pali-x: On scaling up a multilingual vision and language model

Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Car- los Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, et al. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565,

work page arXiv
[8]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

PaLM-E: An Embodied Multimodal Language Model

URL https://www.databricks.com/blog/2023/04/ 12/dolly-first-open-commercially-viable-instruction-tuned-llm . Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Language-driven representation learning for robotics

Siddharth Karamcheti, Suraj Nair, Annie S Chen, Thomas Kollar, Chelsea Finn, Dorsa Sadigh, and Percy Liang. Language-driven representation learning for robotics. arXiv preprint arXiv:2302.12766,

work page arXiv
[11]

m3 it: A large-scale dataset towards multi-modal multilingual instruction tuning

Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, et al. m3 it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387,

work page arXiv
[12]

Robotic indoor scene captioning from streaming video

Xinghang Li, Di Guo, Huaping Liu, and Fuchun Sun. Robotic indoor scene captioning from streaming video. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6109–6115. IEEE,

work page 2021
[13]

Energy-based imitation learning

Minghuan Liu, Tairan He, Minkai Xu, and Weinan Zhang. Energy-based imitation learning. arXiv preprint arXiv:2004.09395,

work page arXiv 2004
[14]

Goal-conditioned reinforcement learning: Problems and solutions

Minghuan Liu, Menghui Zhu, and Weinan Zhang. Goal-conditioned reinforcement learning: Problems and solutions. arXiv preprint arXiv:2201.08299, 2022a. Xinzhu Liu, Xinghang Li, Di Guo, Sinan Tan, Huaping Liu, and Fuchun Sun. Embodied multi-agent task planning from ambiguous instruction. Proceedings of robotics: science and systems, New York City, NY, USA, p...

work page arXiv 2005
[15]

What matters in language conditioned robotic imitation learning over unstructured data

Oier Mees, Lukas Hermann, and Wolfram Burgard. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robotics and Automation Letters (RA-L) , 7(4): 11205–11212, 2022a. Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language-conditioned policy learning for long-horizon robot ...

work page 2023
[16]

R3M: A Universal Visual Representation for Robot Manipulation

Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

URL http://jmlr.org/papers/v21/20-074.html. Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084,

work page internal anchor Pith review Pith/arXiv arXiv 1908
[20]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning

Tiannan Wang, Wangchunshu Zhou, Yan Zeng, and Xinsong Zhang. Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. arXiv preprint arXiv:2210.07795,

work page arXiv
[22]

arXiv preprint arXiv:2211.12402,

Yan Zeng, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, and Wangchunshu Zhou.x2-vlm: All-in-one pre-trained model for vision-language tasks. arXiv preprint arXiv:2211.12402,

work page arXiv
[23]

Deep imitation learning for complex manipulation tasks from virtual reality teleoperation

Tianhao Zhang, Zoe McCarthy, Owen Jow, Dennis Lee, Xi Chen, Ken Goldberg, and Pieter Abbeel. Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In 2018 IEEE International Conference on Robotics and Automation (ICRA) , pp. 5628–5635. IEEE,

work page 2018
[24]

We loaded the pre-train 14 Preprint Table 5: Comparison of co-trained models and fine-tuned models on the CALVIN benchmark

and V oltron (Karamcheti et al., 2023). We loaded the pre-train 14 Preprint Table 5: Comparison of co-trained models and fine-tuned models on the CALVIN benchmark. All results are selected from the best of the last 5 epochs. Method Training Data Test Split Task Completed in a Sequence 1 2 3 4 5 Avg Len Co-trained ABC D 0.829 0.636 0.453 0.321 0.234 2.47 F...

work page 2023
[25]

As for V oltron, we also include a version that fine- tunes the representation layers

Method COCO VQABLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE_L CIDEr SPICE Acc Fine-tune (3B, zero-shot) 0.157 0.052 0.018 0.008 0.038 0.147 0.005 0.006 4.09Fine-tune (3B, 4-shot) 0.168 0.057 0.020 0.008 0.043 0.161 0.005 0.007 3.87 OpenFlamingo (3B, zero-shot) 0.580 0.426 0.301 0.209 0.208 0.464 0.757 0.153 40.92OpenFlamingo (3B, 4-shot) 0.612 0.461 0.332 0.2...

work page 2022