pith. sign in

super hub Mixed citations

PaliGemma: A versatile 3B VLM for transfer

Mixed citation behavior. Most common role is background (59%).

164 Pith papers citing it
Background 59% of classified citations
abstract

PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

hub tools

citation-role summary

background 19 method 6 baseline 5 dataset 2

citation-polarity summary

claims ledger

  • abstract PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m vision encoder and the Gemma-2B language model. It is trained to be a versatile and broadly knowledgeable base model that is effective to transfer. It achieves strong performance on a wide variety of open-world tasks. We evaluate PaliGemma on almost 40 diverse tasks including standard VLM benchmarks, but also more specialized tasks such as remote-sensing and segmentation.

authors

co-cited works

clear filters

representative citing papers

Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

cs.CV · 2026-06-22 · unverdicted · novelty 7.0 · 2 refs

Koshur Pixel is the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs generated via SynthOCR-Gen from the KS-PRET-5M corpus across multiple fonts and granularities with 25+ augmentations.

NAC: Neural Action Codec for Vision-Language-Action Models

cs.RO · 2026-06-19 · unverdicted · novelty 7.0

NAC adapts multi-scale RVQGAN audio codecs with kinematic-specific losses to produce ordered action tokens that yield lower reconstruction error and higher task success than prior tokenizers in VLA models.

Large Language Model Selection with Limited Annotations

cs.CL · 2026-05-24 · unverdicted · novelty 7.0

SELECT-LLM is the first active model selection framework for LLMs that uses expected information gain from pairwise output similarities to minimize required annotations, reporting up to 84.78% cost reduction across 23 datasets and 156 models.

Dynamic Execution Commitment of Vision-Language-Action Models

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 3 refs

A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models cs.CV · 2024-09-25 · accept · none · ref 10 · internal anchor

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  • Drop-Then-Recovery: How Redundant Are Vision-Language-Action Models? cs.RO · 2026-06-26 · accept · none · ref 2 · internal anchor

    VLA language backbones show high redundancy on manipulation benchmarks, with half the LLM blocks removable and even two blocks sufficient to recover baseline performance after fine-tuning, unlike vision and action pathways.

  • Should We Still Pretrain Encoders with Masked Language Modeling? cs.CL · 2025-07-01 · accept · none · ref 4 · internal anchor

    Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improves when initialized from pretrained CLM models.

  • Multilingual Vision-Language Models, A Survey cs.CL · 2025-09-26 · accept · none · ref 15 · internal anchor

    The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.