Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer · 2023

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

VLAs are Confined yet Capable of Generalizing to Novel Instructions

cs.RO · 2025-05-06 · unverdicted · novelty 7.0

Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.

OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder

cs.CV · 2026-05-02 · unverdicted · novelty 6.0

Omni-Encoder unifies visual and audio encoding at symmetrical 25 fps using a Transformer with three new components, yielding gains on fine-grained motion tasks while matching baselines on audio-visual benchmarks.

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

cs.CL · 2024-02-18 · unverdicted · novelty 6.0

ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.

Show-o2: Improved Native Unified Multimodal Models

cs.CV · 2025-06-18 · unverdicted · novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

citing papers explorer

Showing 4 of 4 citing papers.

VLAs are Confined yet Capable of Generalizing to Novel Instructions cs.RO · 2025-05-06 · unverdicted · none · ref 43
Averaging and temporally interpolating text latents in VLAs enables 83% success on novel task combinations in the libero-ood benchmark where SOTA models achieve under 15%.
OmniEncoder: See, Hear, and Feel Continuous Motion Like Humans With One Encoder cs.CV · 2026-05-02 · unverdicted · none · ref 2
Omni-Encoder unifies visual and audio encoding at symmetrical 25 fps using a Transformer with three new components, yielding gains on fine-grained motion tasks while matching baselines on audio-visual benchmarks.
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models cs.CL · 2024-02-18 · unverdicted · none · ref 136
ALLaVA creates 1.3M GPT4V-synthesized samples enabling 4B VLMs to achieve competitive results on 17 benchmarks and match 7B/13B models on some tasks.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 138
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Sigmoid loss for language image pre-training

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer