TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.
hub
Pix2seq: A language modeling framework for object detection
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.
Raster2Seq generates labeled polygon sequences autoregressively from floorplan images via an anchor-guided decoder, claiming state-of-the-art results on Structure3D, CubiCasa5K, Raster2Graph and generalization to WAFFLE.
Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
GPT-3.5 is turned into an autonomous-vehicle motion planner by representing driving scenes and trajectories as language tokens and applying a prompting-reasoning-finetuning pipeline, with results shown on nuScenes.
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
TinyVLA achieves faster inference and higher data efficiency than OpenVLA on robotic manipulation tasks by initializing from high-speed multimodal models and adding a diffusion policy decoder, without any pre-training phase.
citing papers explorer
-
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
TinyVLA achieves faster inference and higher data efficiency than OpenVLA on robotic manipulation tasks by initializing from high-speed multimodal models and adding a diffusion policy decoder, without any pre-training phase.