RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
Learning transferable visual models from natural language supervision
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4representative citing papers
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.
citing papers explorer
-
Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
RaPO reduces catastrophic forgetting in visual continual learning by shaping rewards around policy drift and stabilizing advantages with cross-task exponential moving averages during reinforcement fine-tuning of multimodal models.
-
LAION-5B: An open large-scale dataset for training next generation image-text models
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
-
When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy
Policy entropy remains constant in flow-matching models during RLHF due to fixed noise schedules while perceptual diversity collapses from mode-seeking policy gradients, so perceptual entropy constraints are introduced to preserve diversity and improve quality.
-
ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
ForgeVLA enables federated VLA model training from unlabeled vision-action pairs by recovering language via embodied classifiers and using contrastive planning plus adaptive aggregation to avoid feature collapse.