pith. sign in

hub Mixed citations

PaliGemma: A versatile 3B VLM for transfer

Mixed citation behavior. Most common role is background (59%).

99 Pith papers citing it
Background 59% of classified citations
abstract

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

hub tools

citation-role summary

background 19 method 6 baseline 5 dataset 2

citation-polarity summary

claims ledger

  • abstract PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

co-cited works

clear filters

representative citing papers

Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

Koshur Pixel is the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs generated via SynthOCR-Gen from the KS-PRET-5M corpus across multiple fonts and granularities with 25+ augmentations.

DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.

SAM 3: Segment Anything with Concepts

cs.CV · 2025-11-20 · unverdicted · novelty 7.0

SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.

TextTeacher: What Can Language Teach About Images?

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.

citing papers explorer

Showing 6 of 6 citing papers after filters.

  • Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training cs.RO · 2026-04-25 · unverdicted · none · ref 61 · internal anchor

    DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist policies.

  • ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation cs.RO · 2026-04-20 · unverdicted · none · ref 3 · internal anchor

    ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

  • UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models cs.CV · 2026-04-02 · conditional · none · ref 44 · internal anchor

    UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV dataset, while cutting inference latency by 33.4%.

  • FAST: Efficient Action Tokenization for Vision-Language-Action Models cs.RO · 2025-01-16 · unverdicted · none · ref 5 · internal anchor

    FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.

  • $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control cs.LG · 2024-10-31 · unverdicted · none · ref 5 · internal anchor

    π₀ is a vision-language-action flow model trained on diverse multi-platform robot data that supports zero-shot task performance, language instruction following, and efficient fine-tuning for dexterous tasks.

  • ZAYA1-VL-8B Technical Report cs.CV · 2026-05-08 · unverdicted · none · ref 143 · internal anchor

    ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting benchmarks.