hub Canonical reference

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al

Canonical reference. 100% of citing Pith papers cite this work as background.

16 Pith papers citing it

Background 100% of classified citations

browse 16 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Improving Vision-language Models with Perception-centric Process Reward Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.

Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

cs.CV · 2026-03-31 · unverdicted · novelty 7.0

Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.

LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models

cs.CV · 2026-03-16 · unverdicted · novelty 7.0

LLMind uses bio-inspired non-uniform sampling via a Mobius module and closed-loop semantic feedback to retain 82-97% of full-resolution VLM performance with only 1-5% of pixels on VQA benchmarks.

Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding

cs.CV · 2026-02-28 · unverdicted · novelty 7.0

Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

cs.CV · 2026-01-04 · unverdicted · novelty 7.0

GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.

Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.

Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.

G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.

SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning

cs.CV · 2026-03-28 · unverdicted · novelty 6.0

SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.

Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs

cs.AI · 2025-12-09 · unverdicted · novelty 6.0

State-of-the-art MLLMs show substantial inconsistency when reasoning over the same information presented in image, text, or mixed modalities, even after accounting for OCR errors, with inconsistency linked to visual factors and modality gap.

MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding

cs.CV · 2025-12-06 · conditional · novelty 6.0

MedGRPO applies cross-dataset reward normalization and a clinical LLM judge within multi-task RL to improve vision-language models on heterogeneous medical video understanding tasks using the new MedVidBench dataset.

An Efficient Token Compression Framework for Visual Object Tracking

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

ETCTrack compresses template tokens by 60% in visual trackers via an adaptive compressor and hierarchical interaction, cutting MACs 21.4% with 0.4% accuracy drop on seven benchmarks.

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

cs.CV · 2026-04-18 · unverdicted · novelty 5.0

EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.

Character-Centered Dialogue Generation from Scene-Level Prompts

cs.CV · 2025-05-22 · unverdicted · novelty 4.0

A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.

citing papers explorer

Showing 16 of 16 citing papers.

Improving Vision-language Models with Perception-centric Process Reward Models cs.CV · 2026-04-27 · unverdicted · none · ref 2
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction cs.CV · 2026-04-09 · unverdicted · none · ref 4
PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding cs.CV · 2026-03-31 · unverdicted · none · ref 1
Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models cs.CV · 2026-03-16 · unverdicted · none · ref 3
LLMind uses bio-inspired non-uniform sampling via a Mobius module and closed-loop semantic feedback to retain 82-97% of full-resolution VLM performance with only 1-5% of pixels on VQA benchmarks.
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding cs.CV · 2026-02-28 · unverdicted · none · ref 1
Mema adds a stateful memory module to vision encoders that accumulates hierarchical visual features across layers and selectively injects portions back via feedback to preserve fine-grained cues, yielding consistent gains on multimodal benchmarks.
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation cs.CV · 2026-01-04 · unverdicted · none · ref 1
GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.
Curvature-Aware Captioning:Leveraging Geodesic Attention for 3D Scene Understanding cs.CV · 2026-05-09 · unverdicted · none · ref 3
A new framework combines self-attention on the Oblique manifold with bidirectional geodesic cross-attention on the Lorentz hyperboloid to improve both localization accuracy and descriptive coherence in 3D dense captioning.
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts cs.CV · 2026-05-03 · unverdicted · none · ref 2
Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
G-MIXER: Geodesic Mixup-based Implicit Semantic Expansion and Explicit Semantic Re-ranking for Zero-Shot Composed Image Retrieval cs.CV · 2026-04-16 · unverdicted · none · ref 3
G-MIXER achieves state-of-the-art zero-shot composed image retrieval by using geodesic mixup to build diverse implicit candidates and MLLM-derived explicit semantics for re-ranking.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models cs.AI · 2026-04-07 · unverdicted · none · ref 1
Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning cs.CV · 2026-03-28 · unverdicted · none · ref 1
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs cs.AI · 2025-12-09 · unverdicted · none · ref 4
State-of-the-art MLLMs show substantial inconsistency when reasoning over the same information presented in image, text, or mixed modalities, even after accounting for OCR errors, with inconsistency linked to visual factors and modality gap.
MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding cs.CV · 2025-12-06 · conditional · none · ref 2
MedGRPO applies cross-dataset reward normalization and a clinical LLM judge within multi-task RL to improve vision-language models on heterogeneous medical video understanding tasks using the new MedVidBench dataset.
An Efficient Token Compression Framework for Visual Object Tracking cs.CV · 2026-05-08 · unverdicted · none · ref 1
ETCTrack compresses template tokens by 60% in visual trackers via an adaptive compressor and hierarchical interaction, cutting MACs 21.4% with 0.4% accuracy drop on seven benchmarks.
EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling cs.CV · 2026-04-18 · unverdicted · none · ref 1
EvoComp compresses visual tokens in MLLMs by 3x while retaining 99.3% accuracy via an evolutionary labeling strategy that searches for low-loss, semantically diverse token subsets.
Character-Centered Dialogue Generation from Scene-Level Prompts cs.CV · 2025-05-22 · unverdicted · none · ref 6
A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer