Align before fuse: Vision and language representation learning with momentum distillation

· 2021

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation

cs.HC · 2026-05-11 · unverdicted · novelty 7.0

HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.

Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval

cs.CV · 2026-05-17 · unverdicted · novelty 6.0

MAQIU adds a memorization module and recall mechanism to update query intent dynamically in chat-based image retrieval, cutting FLOPs by 86.4% versus ChatIR while improving results.

Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

A unified autoregressive vision-language framework integrates segmentation, detection, and appearance reasoning for CT images via task-routing tokens and progressive refinement, with gains on public benchmarks.

citing papers explorer

Showing 3 of 3 citing papers.

HapticLDM: A Diffusion Model for Text-to-Vibrotactile Generation cs.HC · 2026-05-11 · unverdicted · none · ref 27
HapticLDM is the first latent diffusion model that generates vibrotactile signals directly from text, using dynamic text curation and global denoising to improve realism and semantic alignment over autoregressive baselines.
Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval cs.CV · 2026-05-17 · unverdicted · none · ref 22
MAQIU adds a memorization module and recall mechanism to update query intent dynamically in chat-based image retrieval, cutting FLOPs by 86.4% versus ChatIR while improving results.
Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning cs.CV · 2026-05-15 · unverdicted · none · ref 28
A unified autoregressive vision-language framework integrates segmentation, detection, and appearance reasoning for CT images via task-routing tokens and progressive refinement, with gains on public benchmarks.

Align before fuse: Vision and language representation learning with momentum distillation

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer