TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.
hub
Pix2seq: A language modeling framework for object detection
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.
Raster2Seq generates labeled polygon sequences autoregressively from floorplan images via an anchor-guided decoder, claiming state-of-the-art results on Structure3D, CubiCasa5K, Raster2Graph and generalization to WAFFLE.
Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
GPT-3.5 is turned into an autonomous-vehicle motion planner by representing driving scenes and trajectories as language tokens and applying a prompting-reasoning-finetuning pipeline, with results shown on nuScenes.
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
TinyVLA achieves faster inference and higher data efficiency than OpenVLA on robotic manipulation tasks by initializing from high-speed multimodal models and adding a diffusion policy decoder, without any pre-training phase.
citing papers explorer
-
TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction
TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.
-
SAM 2++: Tracking Anything at Any Granularity
SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.
-
PaLI: A Jointly-Scaled Multilingual Language-Image Model
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
-
Moondream Segmentation: From Words to Masks
Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.
-
Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction
Raster2Seq generates labeled polygon sequences autoregressively from floorplan images via an anchor-guided decoder, claiming state-of-the-art results on Structure3D, CubiCasa5K, Raster2Graph and generalization to WAFFLE.
-
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
-
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.
-
GPT-Driver: Learning to Drive with GPT
GPT-3.5 is turned into an autonomous-vehicle motion planner by representing driving scenes and trajectories as language tokens and applying a prompting-reasoning-finetuning pipeline, with results shown on nuScenes.
-
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
-
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
TinyVLA achieves faster inference and higher data efficiency than OpenVLA on robotic manipulation tasks by initializing from high-speed multimodal models and adding a diffusion policy decoder, without any pre-training phase.