hub

In International conference on machine learning

Learning transferable visual models from natural language supervision

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

browse 18 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation

cs.CV · 2026-04-13 · unverdicted · novelty 8.0

The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.

CFSR: Geometry-Conditioned Shadow Removal via Physical Disentanglement

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

CFSR reframes shadow removal as a physics-constrained process using geometric and semantic priors from depth, DINO, CLIP, and frequency decoupling to achieve claimed state-of-the-art results.

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.

VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories

cs.SD · 2026-04-12 · unverdicted · novelty 7.0

VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.

DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

DIRECT uses a three-level multi-agent framework to solve video mashup creation as a multimodal coherency problem, outperforming baselines on a new benchmark.

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

cs.CV · 2026-03-29 · unverdicted · novelty 7.0

A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.

OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

cs.AI · 2026-05-18 · unverdicted · novelty 6.0

OCCAM discovers open-set visual concepts, estimates causal contributions via object-level interventions on black-box vision models, and induces a global concept ontology from aggregated dataset evidence.

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.

FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching

cs.RO · 2026-04-27 · unverdicted · novelty 6.0

FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.

Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.

VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.

STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

cs.DB · 2025-09-16 · unverdicted · novelty 6.0

ScaleDoc achieves over 2x end-to-end speedup and up to 85% fewer LLM invocations for semantic predicates on large document collections via offline LLM representations, contrastive-trained proxy filtering, and adaptive cascades.

Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding

cs.CV · 2026-04-13 · unverdicted · novelty 5.0

A unified cost-aware formulation couples fine-grained high-resolution sampling decisions with cross-patch representation prediction to achieve superior performance-cost trade-offs on remote sensing recognition and retrieval tasks using a new 10M-image benchmark.

Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

cs.CV · 2026-04-13 · unverdicted · novelty 5.0

A text-guided fusion method for RGB-IR object detection aligns modalities via semantic bridging and incorporates both consensus and discrepancy cues through dynamic recalibration.

Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference

cs.CV · 2026-04-26 · unverdicted · novelty 4.0

VIBES uses Bayesian inference to trigger focused VLM reasoning on localized far-field regions in expressway videos, improving anomaly detection accuracy and efficiency.

Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

cs.AI · 2026-05-18

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

cs.CV · 2026-04-24

citing papers explorer

Showing 18 of 18 citing papers.

Towards Realistic 3D Emission Materials: Dataset, Baseline, and Evaluation for Emission Texture Generation cs.CV · 2026-04-13 · unverdicted · none · ref 22
The work creates the first dataset and baseline for generating emission textures on 3D objects to reproduce glowing materials from input images.
CFSR: Geometry-Conditioned Shadow Removal via Physical Disentanglement cs.CV · 2026-04-20 · unverdicted · none · ref 40
CFSR reframes shadow removal as a physics-constrained process using geometric and semantic priors from depth, DINO, CLIP, and frequency decoupling to achieve claimed state-of-the-art results.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs cs.CV · 2026-04-17 · unverdicted · none · ref 41
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
VidAudio-Bench: Benchmarking V2A and VT2A Generation across Four Audio Categories cs.SD · 2026-04-12 · unverdicted · none · ref 39
VidAudio-Bench benchmarks V2A and VT2A models across four audio categories, revealing poor speech/singing performance and a tension between visual alignment and text instruction following.
DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing cs.CV · 2026-04-06 · unverdicted · none · ref 26
DIRECT uses a three-level multi-agent framework to solve video mashup creation as a multimodal coherency problem, outperforming baselines on a new benchmark.
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models cs.CV · 2026-03-29 · unverdicted · none · ref 28
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models cs.AI · 2026-05-18 · unverdicted · none · ref 35
OCCAM discovers open-set visual concepts, estimates causal contributions via object-level interventions on black-box vision models, and induces a global concept ontology from aggregated dataset evidence.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems cs.AI · 2026-05-03 · unverdicted · none · ref 34
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching cs.RO · 2026-04-27 · unverdicted · none · ref 21
FreqCache uses frequency domain properties to adaptively select, refresh, and budget token caches in VLN models, delivering 1.59x speedup with negligible overhead.
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding cs.CV · 2026-04-19 · unverdicted · none · ref 24
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis cs.CV · 2026-04-08 · unverdicted · none · ref 27
VersaVogue unifies garment generation and virtual dressing via trait-routing attention with mixture-of-experts and an automated multi-perspective preference optimization pipeline that uses DPO without human labels.
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 38
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.
ScaleDoc: Scaling LLM-based Predicates over Large Document Collections cs.DB · 2025-09-16 · unverdicted · none · ref 35
ScaleDoc achieves over 2x end-to-end speedup and up to 85% fewer LLM invocations for semantic predicates on large document collections via offline LLM representations, contrastive-trained proxy filtering, and adaptive cascades.
Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding cs.CV · 2026-04-13 · unverdicted · none · ref 37
A unified cost-aware formulation couples fine-grained high-resolution sampling decisions with cross-patch representation prediction to achieve superior performance-cost trade-offs on remote sensing recognition and retrieval tasks using a new 10M-image benchmark.
Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection cs.CV · 2026-04-13 · unverdicted · none · ref 37
A text-guided fusion method for RGB-IR object detection aligns modalities via semantic bridging and incorporates both consensus and discrepancy cues through dynamic recalibration.
Zoom In, Reason Out: Efficient Far-field Anomaly Detection in Expressway Surveillance Videos via Focused VLM Reasoning Guided by Bayesian Inference cs.CV · 2026-04-26 · unverdicted · none · ref 29
VIBES uses Bayesian inference to trigger focused VLM reasoning on localized far-field regions in expressway videos, improving anomaly detection accuracy and efficiency.
Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs cs.AI · 2026-05-18 · unreviewed · ref 39
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings cs.CV · 2026-04-24 · unreviewed · ref 40

In International conference on machine learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer