pith. sign in

hub

Pix2seq: A language modeling framework for object detection

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

hub tools

citation-role summary

background 1

citation-polarity summary

roles

background 1

polarities

background 1

representative citing papers

SAM 2++: Tracking Anything at Any Granularity

cs.CV · 2025-10-21 · conditional · novelty 7.0

SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.

PaLI: A Jointly-Scaled Multilingual Language-Image Model

cs.CV · 2022-09-14 · conditional · novelty 7.0

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

Moondream Segmentation: From Words to Masks

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.

Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction

cs.CV · 2026-02-09 · unverdicted · novelty 6.0

Raster2Seq generates labeled polygon sequences autoregressively from floorplan images via an anchor-guided decoder, claiming state-of-the-art results on Structure3D, CubiCasa5K, Raster2Graph and generalization to WAFFLE.

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

cs.HC · 2024-01-17 · unverdicted · novelty 6.0

SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

GPT-Driver: Learning to Drive with GPT

cs.CV · 2023-10-02 · conditional · novelty 6.0

GPT-3.5 is turned into an autonomous-vehicle motion planner by representing driving scenes and trajectories as language tokens and applying a prompting-reasoning-finetuning pipeline, with results shown on nuScenes.

citing papers explorer

Showing 10 of 10 citing papers.

  • TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction cs.CV · 2026-04-10 · unverdicted · none · ref 5

    TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.

  • SAM 2++: Tracking Anything at Any Granularity cs.CV · 2025-10-21 · conditional · none · ref 8

    SAM 2++ unifies video tracking across mask, box, and point granularities via task-specific prompts, a unified decoder, task-adaptive memory, and a new multi-granularity dataset, reporting state-of-the-art results.

  • PaLI: A Jointly-Scaled Multilingual Language-Image Model cs.CV · 2022-09-14 · conditional · none · ref 126

    PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

  • Moondream Segmentation: From Words to Masks cs.CV · 2026-04-03 · unverdicted · none · ref 3

    Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.

  • Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction cs.CV · 2026-02-09 · unverdicted · none · ref 8

    Raster2Seq generates labeled polygon sequences autoregressively from floorplan images via an anchor-guided decoder, claiming state-of-the-art results on Structure3D, CubiCasa5K, Raster2Graph and generalization to WAFFLE.

  • Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks cs.CV · 2024-01-25 · unverdicted · none · ref 7

    Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.

  • SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents cs.HC · 2024-01-17 · unverdicted · none · ref 72

    SeeClick improves visual GUI agents via GUI grounding pre-training on automatically curated data and introduces the ScreenSpot benchmark, with results indicating that stronger grounding boosts downstream task performance.

  • GPT-Driver: Learning to Drive with GPT cs.CV · 2023-10-02 · conditional · none · ref 5

    GPT-3.5 is turned into an autonomous-vehicle motion planner by representing driving scenes and trajectories as language tokens and applying a prompting-reasoning-finetuning pipeline, with results shown on nuScenes.

  • Kosmos-2: Grounding Multimodal Large Language Models to the World cs.CL · 2023-06-26 · unverdicted · none · ref 3

    Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.

  • TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation cs.RO · 2024-09-19 · unverdicted · none · ref 34

    TinyVLA achieves faster inference and higher data efficiency than OpenVLA on robotic manipulation tasks by initializing from high-speed multimodal models and adding a diffusion policy decoder, without any pre-training phase.