Clipscore: A reference-free evaluation metric for image captioning

Hessel, J · 2021

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Attention Sinks in Diffusion Transformers: A Causal Analysis

cs.CV · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.

Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

cs.CV · 2026-04-27 · conditional · novelty 6.0

CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted trade-off in original task performance.

Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design

cs.LG · 2026-02-04 · conditional · novelty 6.0

An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.

SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation

cs.AI · 2026-01-31 · unverdicted · novelty 6.0

SPOT projects prompts to a tau-safe set via total variation to cut inappropriate content 14-44% relative to baselines while preserving benign prompt behavior in frozen T2I models.

citing papers explorer

Showing 4 of 4 citing papers.

Attention Sinks in Diffusion Transformers: A Causal Analysis cs.CV · 2026-05-10 · unverdicted · none · ref 6 · 2 links
Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics cs.CV · 2026-04-27 · conditional · none · ref 6
CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted trade-off in original task performance.
Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design cs.LG · 2026-02-04 · conditional · none · ref 8
An ELBO-based likelihood estimator from the final generated sample dominates other RL design factors for diffusion models, raising GenEval from 0.24 to 0.95 in 90 GPU hours with better efficiency than prior methods.
SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation cs.AI · 2026-01-31 · unverdicted · none · ref 5
SPOT projects prompts to a tau-safe set via total variation to cut inappropriate content 14-44% relative to baselines while preserving benign prompt behavior in frozen T2I models.

Clipscore: A reference-free evaluation metric for image captioning

fields

years

verdicts

representative citing papers

citing papers explorer