BridgeVLM internalizes causal supervision in VLMs via causal graph induction, Causal Tokens, and RAMP layers with M3S training, raising intervention accuracy on CausalVLBench from 33.2% to 54.4% and structure learning F1 from 33.4% to 75.1%.
Looking Beyond Text: Reducing Language Bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
From Prompts to Tokens: Internalizing Causal Supervision in Vision-Language Model for Multi-Image Causal Reasoning
BridgeVLM internalizes causal supervision in VLMs via causal graph induction, Causal Tokens, and RAMP layers with M3S training, raising intervention accuracy on CausalVLBench from 33.2% to 54.4% and structure learning F1 from 33.4% to 75.1%.