hub

Pix2seq: A language modeling framework for object detection

Chen, T · 2021 · arXiv 2109.10852

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DeepGaze3.5-VL: Modeling Scanpaths via Autoregressive Token Prediction

cs.CV · 2026-07-02 · unverdicted · novelty 7.0

DeepGaze3.5-VL treats visual scanpaths as discrete token sequences predicted autoregressively by vision-language models, achieving 2.18 bits IG on MIT1003 and outperforming prior specialized models even with matched encoders.

TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction

cs.CV · 2026-04-10 · unverdicted · novelty 7.0

TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.

SAM 2++: Tracking Anything at Any Granularity

cs.CV · 2025-10-21 · conditional · novelty 7.0

SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.

PaLI: A Jointly-Scaled Multilingual Language-Image Model

cs.CV · 2022-09-14 · conditional · novelty 7.0

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

PARCEL is a new visual tokenization architecture combining pool-anchored resampling with conditioned elastic queries to enhance performance-efficiency tradeoffs in LVLMs over prior matryoshka methods.

Binding Visual Features Point by Point

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

Training VLMs to point via text induces serial processing that eliminates binding errors and enables compositional generalization on multi-object tasks.

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.

Moondream Segmentation: From Words to Masks

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.

Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction

cs.CV · 2026-02-09 · unverdicted · novelty 6.0

Raster2Seq generates labeled polygon sequences autoregressively from floorplan images via an anchor-guided decoder, claiming state-of-the-art results on Structure3D, CubiCasa5K, Raster2Graph and generalization to WAFFLE.

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

cs.CV · 2024-01-25 · unverdicted · novelty 6.0

Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

cs.HC · 2024-01-17 · unverdicted · novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

GPT-Driver: Learning to Drive with GPT

cs.CV · 2023-10-02 · conditional · novelty 6.0

GPT-3.5 is turned into an autonomous-vehicle motion planner by representing driving scenes and trajectories as language tokens and applying a prompting-reasoning-finetuning pipeline, with results shown on nuScenes.

Kosmos-2: Grounding Multimodal Large Language Models to the World

cs.CL · 2023-06-26 · unverdicted · novelty 6.0

Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

CheXanatomy: Anatomy-Aware Vision-Language Modeling for Chest Radiographs

cs.CV · 2026-06-07 · unverdicted · novelty 5.0

CheXanatomy trains VLMs to generate 2D anatomical masks via next-token prediction on synthetic CXRs from CT, matching U-Net performance with better domain-shift robustness and sample efficiency.

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation

cs.RO · 2024-09-19 · unverdicted · novelty 4.0

TinyVLA achieves faster inference and higher data efficiency than OpenVLA on robotic manipulation tasks by initializing from high-speed multimodal models and adding a diffusion policy decoder, without any pre-training phase.

PatternGSL: A Structured Specification Language for Template-Free and Simulation-Ready 3D Garments

cs.CV · 2026-06-23 · 2 refs

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Pix2seq: A language modeling framework for object detection

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer