SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation

Anh Nguyen; Anthony Gunderman; Chase Rainwater; Duy Nguyen Ho Minh; Huy Le; Kashu Yamazaki; Khoa Vo; Ngan Le; Nhat Chung; Taisei Hanyu

arxiv: 2511.06754 · v3 · submitted 2025-11-10 · 💻 cs.RO · cs.CV

SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation

Taisei Hanyu , Nhat Chung , Huy Le , Toan Nguyen , Yuki Ikebe , Anthony Gunderman , Duy Nguyen Ho Minh , Khoa Vo

show 5 more authors

Tung Kieu Kashu Yamazaki Chase Rainwater Anh Nguyen Ngan Le

This is my paper

Pith reviewed 2026-05-18 00:19 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords robotic manipulationobject-centric representationsslot attentionobject relationsmultitask learningvisuomotor policiesLIBERO benchmark

0 comments

The pith

Object-centric slot and object-relation representations reduce visual tokens in robotic manipulation while maintaining competitive generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether representations built around discrete objects and their relationships can support effective multitask robotic manipulation. It creates the LIBERO+ benchmark with object-level annotations including boxes, masks, and tracking, then introduces SlotVLA, which processes scenes through slot attention to isolate objects, decode relations, and feed the results to a language model for action generation. This contrasts with conventional dense visual embeddings that mix foreground and background information. A sympathetic reader would care because the method promises greater efficiency, consistency over time, and easier interpretation of what the robot is attending to during tasks.

Core claim

SlotVLA employs a slot-based visual tokenizer to produce consistent temporal object representations, a relation-centric decoder to generate task-relevant embeddings, and an LLM-driven module to convert those embeddings into actions. On the LIBERO+ benchmark, object-centric slot and object-relation slot representations achieve drastic reductions in the number of required visual tokens while delivering competitive generalization across manipulation tasks.

What carries the argument

Slot-attention framework that maintains consistent object slots across time, decodes object relations, and routes the resulting embeddings through an LLM to produce actions.

If this is right

Object-relation slots enable more compact and temporally consistent visual processing for visuomotor policies.
The same representations support competitive performance on a wide range of manipulation tasks without dense pixel-level embeddings.
LIBERO+ annotations make it possible to measure how well a model captures specific object relationships during execution.
Fewer visual tokens lower the computational load of the perception stage while preserving task success.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same slot-plus-relation structure could be tested in settings with moving cameras or partial occlusions to check whether temporal consistency survives real-world noise.
If the relation decoder proves reliable, similar object-relation tokens might improve sample efficiency in reinforcement learning for manipulation.
Reducing token count opens a path to running capable manipulation policies on lower-power hardware without sacrificing breadth of tasks.

Load-bearing premise

A slot-attention model with a relation decoder and language-model action head can extract the object relations that actually matter for completing manipulation tasks from raw visual input.

What would settle it

An experiment on LIBERO+ or a comparable suite of tasks in which the slot-based model requires as many or more tokens as a dense baseline or shows measurably lower success rates on held-out task variations.

Figures

Figures reproduced from arXiv: 2511.06754 by Anh Nguyen, Anthony Gunderman, Chase Rainwater, Duy Nguyen Ho Minh, Huy Le, Kashu Yamazaki, Khoa Vo, Ngan Le, Nhat Chung, Taisei Hanyu, Toan Nguyen, Tung Kieu, Yuki Ikebe.

**Figure 2.** Figure 2: Overview of the LIBERO+ dataset. providing structured object–relation annotations to support fine-grained reasoning. For temporally consistent action supervision, we retain LIBERO’s native action labels but introduce a filtering step to remove redundant no-op actions. This refinement reduces idle-frame redundancy and sharpens the alignment between annotated objects and actionrelevant dynamics. As a resu… view at source ↗

**Figure 3.** Figure 3: Overall framework of our proposed model. Stage-1 trains the Task-aware Object-Centric Encoder with slot attention and task-aware filtering. Stage-2 freezes Stage-1 parameters and introduces the Relation-Centric Encoder, enabling relational reasoning for final action decoding. Image slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 slot 8 slot 9 slot 10 slot 11 slot 12 slot 13 slot 14 slot 15 [PITH_F… view at source ↗

**Figure 5.** Figure 5: Trajectory demonstration in simulation from exocentric views. Task query: “Put the bowl on the stove”. TABLE V: Ablation study on the effect of temporal consistency. Method Temporal Consistency ✗ ✓ OC 0.38 0.77 ORC 0.40 0.86 focusing on task-relevant objects and gripper positions. OC, however, struggles with changing layouts and many objects (L-Spatial, L-Long), failing especially when filtered to only fou… view at source ↗

read the original abstract

Inspired by how humans reason over discrete objects and their relationships, we explore whether compact object-centric and object-relation representations can form a foundation for multitask robotic manipulation. Most existing robotic multitask models rely on dense embeddings that entangle both object and background cues, raising concerns about both efficiency and interpretability. In contrast, we study object-relation-centric representations as a pathway to more structured, efficient, and explainable visuomotor control. Our contributions are two-fold. First, we introduce LIBERO+, a fine-grained benchmark dataset designed to enable and evaluate object-relation reasoning in robotic manipulation. Unlike prior datasets, LIBERO+ provides object-centric annotations that enrich demonstrations with box- and mask-level labels as well as instance-level temporal tracking, supporting compact and interpretable visuomotor representations. Second, we propose SlotVLA, a slot-attention-based framework that captures both objects and their relations for action decoding. It uses a slot-based visual tokenizer to maintain consistent temporal object representations, a relation-centric decoder to produce task-relevant embeddings, and an LLM-driven module that translates these embeddings into executable actions. Experiments on LIBERO+ demonstrate that object-centric slot and object-relation slot representations drastically reduce the number of required visual tokens, while providing competitive generalization. Together, LIBERO+ and SlotVLA provide a compact, interpretable, and effective foundation for advancing object-relation-centric robotic manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SlotVLA adds object annotations to LIBERO and uses slot attention plus a relation decoder with an LLM for more compact robot policies, but the efficiency claims lack numbers and isolating ablations.

read the letter

The main thing to know is that this paper introduces LIBERO+ with added object-centric annotations and proposes SlotVLA, which uses slot attention for consistent object tracking, a relation decoder for task-relevant embeddings, and an LLM to generate actions from those. It does well in highlighting the problems with entangled dense embeddings in current multitask robotic models and in proposing a more structured approach inspired by human object-relation reasoning. The integration of temporal consistency in slots and the use of an LLM for action translation is a reasonable way to build on existing slot attention techniques. On the downside, the evaluation is light on details. The abstract mentions positive results with reduced visual tokens and good generalization but gives no actual figures or comparisons. The stress-test point is on target: since LIBERO+ supplies direct object annotations, an ablation removing the relation-centric decoder would be needed to show that the relation modeling is responsible for the benefits rather than the annotations alone. There's also no mention of any analysis to confirm the slots encode relations beyond simple co-occurrence. This paper would be useful for researchers in computer vision and robotics who are working on object-centric methods for manipulation tasks. Someone looking to improve efficiency and interpretability in their policies could get value from the benchmark and the model design. Overall, it deserves to go through peer review so the experimental support can be examined closely and any gaps in the ablations addressed.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LIBERO+, an extended benchmark for robotic manipulation that augments demonstrations with object-centric annotations including bounding boxes, masks, and instance-level temporal tracking. It proposes SlotVLA, a slot-attention architecture comprising a slot-based visual tokenizer for consistent object representations across time, a relation-centric decoder to produce task-relevant embeddings, and an LLM-driven module to map these embeddings to actions. The central empirical claim is that object-centric and object-relation slot representations drastically reduce the number of visual tokens while delivering competitive generalization on multitask manipulation in LIBERO+.

Significance. If the token-reduction and generalization results are shown to stem specifically from explicit relation modeling rather than the supplied object annotations, the work would offer a concrete route toward more efficient and interpretable visuomotor policies. The provision of LIBERO+ with its fine-grained labels would also supply a useful testbed for future object-relation research in robotics.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the headline claim that 'object-relation slot representations drastically reduce the number of required visual tokens' and yield 'competitive generalization' is not isolated from the object-centric supervision already present in LIBERO+ (boxes, masks, tracking). No ablation is described that removes the relation-centric decoder while retaining the same slot tokenizer and LLM action head; without this control it remains possible that observed benefits arise primarily from the slot structure and provided annotations rather than explicit relation extraction.
[Method] Method section (relation-centric decoder description): the decoder is presented as producing 'task-relevant embeddings,' yet no quantitative probe (e.g., relation classification accuracy on held-out pairs or attention-map analysis) is reported to verify that the resulting slots encode relational structure beyond simple co-occurrence or the supplied instance labels.

minor comments (2)

[Abstract] The abstract states positive results on LIBERO+ but supplies no numerical token counts, success rates, or baseline comparisons; these quantitative details should be added to the abstract or a results table for immediate verifiability.
[Method] Notation for the slot tokenizer and relation decoder could be clarified with a single diagram or equation block showing how object slots are transformed into relation slots before the LLM module.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the potential significance of LIBERO+ and SlotVLA. We address each major comment point by point below, with revisions incorporated where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the headline claim that 'object-relation slot representations drastically reduce the number of required visual tokens' and yield 'competitive generalization' is not isolated from the object-centric supervision already present in LIBERO+ (boxes, masks, tracking). No ablation is described that removes the relation-centric decoder while retaining the same slot tokenizer and LLM action head; without this control it remains possible that observed benefits arise primarily from the slot structure and provided annotations rather than explicit relation extraction.

Authors: We agree that explicitly isolating the contribution of the relation-centric decoder strengthens the central claim. The original experiments compared object-centric slot variants against the full object-relation model, but did not include the precise control of removing only the decoder while fixing the tokenizer and LLM head. In the revised manuscript we have added this ablation. Results on LIBERO+ show that omitting the relation-centric decoder reduces generalization on interaction-heavy tasks, indicating that explicit relation extraction contributes beyond the supplied annotations and slot structure alone. revision: yes
Referee: [Method] Method section (relation-centric decoder description): the decoder is presented as producing 'task-relevant embeddings,' yet no quantitative probe (e.g., relation classification accuracy on held-out pairs or attention-map analysis) is reported to verify that the resulting slots encode relational structure beyond simple co-occurrence or the supplied instance labels.

Authors: We acknowledge the value of direct verification. The relation-centric decoder is architecturally designed to operate on slot pairs and model their interactions. To provide quantitative support, the revised manuscript now includes attention-map visualizations from the decoder on held-out sequences and a proxy relation-classification probe that measures accuracy in predicting spatial and functional relations between object pairs. These analyses show that the embeddings capture structured relations rather than mere co-occurrence or the instance labels alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on architecture and benchmark evaluation

full rationale

The paper introduces LIBERO+ as a benchmark with object-centric annotations and proposes SlotVLA as a slot-attention framework with relation-centric decoder and LLM action module. Central claims concern empirical outcomes on token reduction and generalization from experiments on this dataset. No equations, fitted parameters presented as predictions, self-citation load-bearing arguments, or uniqueness theorems appear in the provided text. The architecture description and results are independent of any self-referential reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review prevents detailed extraction of hyperparameters or assumptions; model likely uses standard deep-learning choices for slot attention and transformers.

axioms (1)

domain assumption Slot attention maintains consistent temporal object representations across frames
Invoked in the description of the slot-based visual tokenizer.

pith-pipeline@v0.9.0 · 5592 in / 1182 out tokens · 51315 ms · 2026-05-18T00:19:11.925590+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SlotVLA: a slot-attention-based framework that captures both objects and their relations for action decoding... relation-centric decoder to produce task-relevant embeddings
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on LIBERO+ demonstrate that object-centric slot and object-relation slot representations drastically reduce the number of required visual tokens

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

OFlow unifies temporal foresight and object-aware reasoning inside a shared latent space via flow matching to improve VLA robustness in robotic manipulation under distribution shifts.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

Openvla: An open-source vision- language-action model,

M. J. Kim, K. Pertschet al., “Openvla: An open-source vision- language-action model,” inCoRL, 2025

work page 2025
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Robotic control via embodied chain-of-thought reasoning,

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,” inCoRL, 2024

work page 2024
[4]

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,

L. Wang, X. Chen, J. Zhao, and K. He, “Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,” inNeurIPS, 2024

work page 2024
[5]

A Survey on Vision-Language-Action Models for Embodied AI

Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision- language-action models for embodied ai,”CoRR, vol. abs/2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

DINOv2: Learning robust visual features without supervision,

M. Oquab, T. Darcetet al., “DINOv2: Learning robust visual features without supervision,”TMLR, 2024

work page 2024
[7]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inICCV, 2023

work page 2023
[8]

Tokenize the world into object-level knowledge to address long-tail events in autonomous driving,

T. Tian, B. Li, X. Weng, Y . Chen, E. Schmerling, Y . Wang, B. Ivanovic, and M. Pavone, “Tokenize the world into object-level knowledge to address long-tail events in autonomous driving,” inCoRL, 2024

work page 2024
[9]

Slot state space models,

J. Jiang, F. Deng, G. Singh, M. Lee, and S. Ahn, “Slot state space models,” inNeurIPS, 2024

work page 2024
[10]

Action-slot: Visual action- centric representations for multi-label atomic activity recognition in traffic scenes,

C. Kung, S. Lu, Y . Tsai, and Y . Chen, “Action-slot: Visual action- centric representations for multi-label atomic activity recognition in traffic scenes,” inCVPR, 2024

work page 2024
[11]

Viola: Imitation learning for vision-based manipulation with object proposal priors,

Y . Zhu, A. Joshi, P. Stone, and Y . Zhu, “Viola: Imitation learning for vision-based manipulation with object proposal priors,” inCoRL, 2023

work page 2023
[12]

Composing pre-trained object-centric representations for robotics from

J. Shi, J. Qian, Y . J. Ma, and D. Jayaraman, “Composing pre-trained object-centric representations for robotics from ”what” and ”where” foundation models,” inICRA, 2024

work page 2024
[13]

Uno: Unifying one- stage video scene graph generation via object-centric visual represen- tation learning,

H. Le, N. Chung, T. Kieu, J. Yang, and N. Le, “Uno: Unifying one- stage video scene graph generation via object-centric visual represen- tation learning,” inWACV, 2026

work page 2026
[14]

Object-centric learning with slot attention,

F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,” inNeurIPS, 2020

work page 2020
[15]

Visuomotor control in multi-object scenes using object-aware representations,

N. Heravi, A. Wahid, C. Lynch, P. Florence, T. Armstrong, J. Tompson, P. Sermanet, J. Bohg, and D. Dwibedi, “Visuomotor control in multi-object scenes using object-aware representations,”arXiv preprint arXiv:2205.06333, 2022

work page arXiv 2022
[16]

Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

W. Cai, Y . Ponomarenko, J. Yuan, X. Li, W. Yang, H. Dong, and B. Zhao, “Spatialbot: Precise spatial understanding with vision lan- guage models,”arXiv preprint arXiv:2406.13642, 2024

work page arXiv 2024
[17]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,” in NeurIPS, 2023

work page 2023
[18]

Learning to see before learning to act: Visual pre-training for manipulation,

Y . Lin, A. Zeng, S. Song, P. Isola, and T. Lin, “Learning to see before learning to act: Visual pre-training for manipulation,” inICRA, 2020

work page 2020
[19]

Accelerating transformers with spectrum-preserving token merging,

C. Tran, D. MH Nguyen, M.-D. Nguyen, T. Nguyen, N. Le, P. Xie, D. Sonntag, J. Y . Zou, B. Nguyen, and M. Niepert, “Accelerating transformers with spectrum-preserving token merging,” inNeurIPS, 2025

work page 2025
[20]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Y . Shang, M. Cai, B. Xu, Y . J. Lee, and Y . Yan, “Llava-prumerge: Adaptive token reduction for efficient large multimodal models,”arXiv preprint arXiv:2403.15388, 2024

work page arXiv 2024
[21]

Tokenpacker: Efficient visual projector for multimodal llm,

W. Li, Y . Yuan, J. Liu, D. Tang, S. Wang, J. Qin, J. Zhu, and L. Zhang, “Tokenpacker: Efficient visual projector for multimodal llm,”arXiv preprint arXiv:2407.02392, 2024

work page arXiv 2024
[22]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Matryoshka query transformer for large vision-language models,

W. Hu, Z.-Y . Dou, L. H. Li, A. Kamath, N. Peng, and K.-W. Chang, “Matryoshka query transformer for large vision-language models,” NeurIPS, 2024

work page 2024
[24]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inICML, 2023

work page 2023
[25]

Rethinking image-to-video adaptation: An object-centric perspective,

R. Qian, S. Ding, and D. Lin, “Rethinking image-to-video adaptation: An object-centric perspective,”CoRR, vol. abs/2407.06871, 2024

work page arXiv 2024
[26]

Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model,

K. V o, T. Phan, K. Yamazaki, M. Tran, and N. Le, “Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model,” inNeurIPS, 2025

work page 2025
[27]

Deep object-centric representations for generalizable robot learning,

C. Devin, P. Abbeel, T. Darrell, and S. Levine, “Deep object-centric representations for generalizable robot learning,” inICRA, 2018

work page 2018
[28]

6-dof pose estimation of household objects for robotic manipulation: An accessible dataset and benchmark,

S. Tyree, J. Tremblay, T. To, J. Cheng, T. Mosier, J. Smith, and S. Birchfield, “6-dof pose estimation of household objects for robotic manipulation: An accessible dataset and benchmark,” inIROS, 2022

work page 2022
[29]

Controlvla: Few-shot object-centric adap- tation for pre-trained vision-language-action models,

P. Li, Y . Wu, Z. Xi, W. Li, Y . Huang, Z. Zhang, Y . Chen, J. Wang, S.-C. Zhu, T. Liuet al., “Controlvla: Few-shot object-centric adap- tation for pre-trained vision-language-action models,”arXiv preprint arXiv:2506.16211, 2025

work page arXiv 2025
[30]

Object-centric instruction augmentation for robotic manipulation,

J. Wen, Y . Zhu, M. Zhu, J. Li, Z. Xu, Z. Che, C. Shen, Y . Peng, D. Liu, F. Fenget al., “Object-centric instruction augmentation for robotic manipulation,” inICRA, 2024

work page 2024
[31]

Efficient state abstraction using object- centered predicates for manipulation planning,

A. Agostini and D. Lee, “Efficient state abstraction using object- centered predicates for manipulation planning,”arXiv preprint arXiv:2007.08251, 2020

work page arXiv 2007
[32]

Semantically grounded object matching for robust robotic scene rearrangement,

W. Goodwin, S. Vaze, I. Havoutis, and I. Posner, “Semantically grounded object matching for robust robotic scene rearrangement,” inICRA, 2022

work page 2022
[33]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Y . Zhu, J. Wong, A. Mandlekar, and R. Mart ´ın-Mart´ın, “robosuite: A modular simulation framework and benchmark for robot learning,” CoRR, vol. abs/2009.12293, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[34]

Learning phrase representations using RNN encoder-decoder for statistical machine translation,

K. Cho, B. van Merrienboeret al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in EMNLP, 2014

work page 2014
[35]

Groupvit: Semantic segmentation emerges from text su- pervision,

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text su- pervision,” inCVPR, 2022

work page 2022
[36]

Improving object-centric learning with query optimization,

B. Jia, Y . Liu, and S. Huang, “Improving object-centric learning with query optimization,” inICLR, 2023

work page 2023
[37]

Grounded language-image pre-training,

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre-training,” inCVPR, 2022, pp. 10 965–10 975

work page 2022
[38]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inICLR, 2022

work page 2022
[39]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

The hungarian method for the assignment problem,

H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, 1955

work page 1955
[41]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020

work page 2020

[1] [1]

Openvla: An open-source vision- language-action model,

M. J. Kim, K. Pertschet al., “Openvla: An open-source vision- language-action model,” inCoRL, 2025

work page 2025

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π 0: A vision- language-action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Robotic control via embodied chain-of-thought reasoning,

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,” inCoRL, 2024

work page 2024

[4] [4]

Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,

L. Wang, X. Chen, J. Zhao, and K. He, “Scaling proprioceptive-visual learning with heterogeneous pre-trained transformers,” inNeurIPS, 2024

work page 2024

[5] [5]

A Survey on Vision-Language-Action Models for Embodied AI

Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision- language-action models for embodied ai,”CoRR, vol. abs/2405.14093, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

DINOv2: Learning robust visual features without supervision,

M. Oquab, T. Darcetet al., “DINOv2: Learning robust visual features without supervision,”TMLR, 2024

work page 2024

[7] [7]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” inICCV, 2023

work page 2023

[8] [8]

Tokenize the world into object-level knowledge to address long-tail events in autonomous driving,

T. Tian, B. Li, X. Weng, Y . Chen, E. Schmerling, Y . Wang, B. Ivanovic, and M. Pavone, “Tokenize the world into object-level knowledge to address long-tail events in autonomous driving,” inCoRL, 2024

work page 2024

[9] [9]

Slot state space models,

J. Jiang, F. Deng, G. Singh, M. Lee, and S. Ahn, “Slot state space models,” inNeurIPS, 2024

work page 2024

[10] [10]

Action-slot: Visual action- centric representations for multi-label atomic activity recognition in traffic scenes,

C. Kung, S. Lu, Y . Tsai, and Y . Chen, “Action-slot: Visual action- centric representations for multi-label atomic activity recognition in traffic scenes,” inCVPR, 2024

work page 2024

[11] [11]

Viola: Imitation learning for vision-based manipulation with object proposal priors,

Y . Zhu, A. Joshi, P. Stone, and Y . Zhu, “Viola: Imitation learning for vision-based manipulation with object proposal priors,” inCoRL, 2023

work page 2023

[12] [12]

Composing pre-trained object-centric representations for robotics from

J. Shi, J. Qian, Y . J. Ma, and D. Jayaraman, “Composing pre-trained object-centric representations for robotics from ”what” and ”where” foundation models,” inICRA, 2024

work page 2024

[13] [13]

Uno: Unifying one- stage video scene graph generation via object-centric visual represen- tation learning,

H. Le, N. Chung, T. Kieu, J. Yang, and N. Le, “Uno: Unifying one- stage video scene graph generation via object-centric visual represen- tation learning,” inWACV, 2026

work page 2026

[14] [14]

Object-centric learning with slot attention,

F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,” inNeurIPS, 2020

work page 2020

[15] [15]

Visuomotor control in multi-object scenes using object-aware representations,

N. Heravi, A. Wahid, C. Lynch, P. Florence, T. Armstrong, J. Tompson, P. Sermanet, J. Bohg, and D. Dwibedi, “Visuomotor control in multi-object scenes using object-aware representations,”arXiv preprint arXiv:2205.06333, 2022

work page arXiv 2022

[16] [16]

Spatialbot: Precise spatial understanding with vision language models.arXiv preprint arXiv:2406.13642,

W. Cai, Y . Ponomarenko, J. Yuan, X. Li, W. Yang, H. Dong, and B. Zhao, “Spatialbot: Precise spatial understanding with vision lan- guage models,”arXiv preprint arXiv:2406.13642, 2024

work page arXiv 2024

[17] [17]

Libero: Benchmarking knowledge transfer for lifelong robot learning,

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,” in NeurIPS, 2023

work page 2023

[18] [18]

Learning to see before learning to act: Visual pre-training for manipulation,

Y . Lin, A. Zeng, S. Song, P. Isola, and T. Lin, “Learning to see before learning to act: Visual pre-training for manipulation,” inICRA, 2020

work page 2020

[19] [19]

Accelerating transformers with spectrum-preserving token merging,

C. Tran, D. MH Nguyen, M.-D. Nguyen, T. Nguyen, N. Le, P. Xie, D. Sonntag, J. Y . Zou, B. Nguyen, and M. Niepert, “Accelerating transformers with spectrum-preserving token merging,” inNeurIPS, 2025

work page 2025

[20] [20]

Llava-prumerge: Adaptive token reduction for efficient large multimodal models

Y . Shang, M. Cai, B. Xu, Y . J. Lee, and Y . Yan, “Llava-prumerge: Adaptive token reduction for efficient large multimodal models,”arXiv preprint arXiv:2403.15388, 2024

work page arXiv 2024

[21] [21]

Tokenpacker: Efficient visual projector for multimodal llm,

W. Li, Y . Yuan, J. Liu, D. Tang, S. Wang, J. Qin, J. Zhu, and L. Zhang, “Tokenpacker: Efficient visual projector for multimodal llm,”arXiv preprint arXiv:2407.02392, 2024

work page arXiv 2024

[22] [22]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Matryoshka query transformer for large vision-language models,

W. Hu, Z.-Y . Dou, L. H. Li, A. Kamath, N. Peng, and K.-W. Chang, “Matryoshka query transformer for large vision-language models,” NeurIPS, 2024

work page 2024

[24] [24]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” inICML, 2023

work page 2023

[25] [25]

Rethinking image-to-video adaptation: An object-centric perspective,

R. Qian, S. Ding, and D. Lin, “Rethinking image-to-video adaptation: An object-centric perspective,”CoRR, vol. abs/2407.06871, 2024

work page arXiv 2024

[26] [26]

Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model,

K. V o, T. Phan, K. Yamazaki, M. Tran, and N. Le, “Henasy: Learning to assemble scene-entities for interpretable egocentric video-language model,” inNeurIPS, 2025

work page 2025

[27] [27]

Deep object-centric representations for generalizable robot learning,

C. Devin, P. Abbeel, T. Darrell, and S. Levine, “Deep object-centric representations for generalizable robot learning,” inICRA, 2018

work page 2018

[28] [28]

6-dof pose estimation of household objects for robotic manipulation: An accessible dataset and benchmark,

S. Tyree, J. Tremblay, T. To, J. Cheng, T. Mosier, J. Smith, and S. Birchfield, “6-dof pose estimation of household objects for robotic manipulation: An accessible dataset and benchmark,” inIROS, 2022

work page 2022

[29] [29]

Controlvla: Few-shot object-centric adap- tation for pre-trained vision-language-action models,

P. Li, Y . Wu, Z. Xi, W. Li, Y . Huang, Z. Zhang, Y . Chen, J. Wang, S.-C. Zhu, T. Liuet al., “Controlvla: Few-shot object-centric adap- tation for pre-trained vision-language-action models,”arXiv preprint arXiv:2506.16211, 2025

work page arXiv 2025

[30] [30]

Object-centric instruction augmentation for robotic manipulation,

J. Wen, Y . Zhu, M. Zhu, J. Li, Z. Xu, Z. Che, C. Shen, Y . Peng, D. Liu, F. Fenget al., “Object-centric instruction augmentation for robotic manipulation,” inICRA, 2024

work page 2024

[31] [31]

Efficient state abstraction using object- centered predicates for manipulation planning,

A. Agostini and D. Lee, “Efficient state abstraction using object- centered predicates for manipulation planning,”arXiv preprint arXiv:2007.08251, 2020

work page arXiv 2007

[32] [32]

Semantically grounded object matching for robust robotic scene rearrangement,

W. Goodwin, S. Vaze, I. Havoutis, and I. Posner, “Semantically grounded object matching for robust robotic scene rearrangement,” inICRA, 2022

work page 2022

[33] [33]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Y . Zhu, J. Wong, A. Mandlekar, and R. Mart ´ın-Mart´ın, “robosuite: A modular simulation framework and benchmark for robot learning,” CoRR, vol. abs/2009.12293, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[34] [34]

Learning phrase representations using RNN encoder-decoder for statistical machine translation,

K. Cho, B. van Merrienboeret al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in EMNLP, 2014

work page 2014

[35] [35]

Groupvit: Semantic segmentation emerges from text su- pervision,

J. Xu, S. De Mello, S. Liu, W. Byeon, T. Breuel, J. Kautz, and X. Wang, “Groupvit: Semantic segmentation emerges from text su- pervision,” inCVPR, 2022

work page 2022

[36] [36]

Improving object-centric learning with query optimization,

B. Jia, Y . Liu, and S. Huang, “Improving object-centric learning with query optimization,” inICLR, 2023

work page 2023

[37] [37]

Grounded language-image pre-training,

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y . Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwanget al., “Grounded language-image pre-training,” inCVPR, 2022, pp. 10 965–10 975

work page 2022

[38] [38]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inICLR, 2022

work page 2022

[39] [39]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

K. Pertsch, K. Stachowicz, B. Ichter, D. Driess, S. Nair, Q. Vuong, O. Mees, C. Finn, and S. Levine, “Fast: Efficient action tokenization for vision-language-action models,”arXiv preprint arXiv:2501.09747, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

The hungarian method for the assignment problem,

H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, 1955

work page 1955

[41] [41]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020

work page 2020