hub

arXiv preprint arXiv:1908.07490 , year=

Hao Tan, Mohit Bansal · 1908 · arXiv 1908.07490

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

cs.CR · 2026-04-07 · unverdicted · novelty 7.0

HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

cs.CV · 2025-12-16 · unverdicted · novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.

Adversarial Video Promotion Against Text-to-Video Retrieval

cs.CV · 2025-08-09 · unverdicted · novelty 7.0

Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

cs.CV · 2024-12-11 · unverdicted · novelty 7.0

CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.

LRM: Large Reconstruction Model for Single Image to 3D

cs.CV · 2023-11-08 · conditional · novelty 7.0

LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.

ViperGPT: Visual Inference via Python Execution for Reasoning

cs.CV · 2023-03-14 · unverdicted · novelty 7.0

ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.

PaLI: A Jointly-Scaled Multilingual Language-Image Model

cs.CV · 2022-09-14 · conditional · novelty 7.0

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

Multimodal LLMs under Pairwise Modalities

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.

SpecPL: Disentangling Spectral Granularity for Prompt Learning

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

cs.CV · 2025-12-03 · unverdicted · novelty 6.0

ThinkDeeper introduces a world-model-based reasoning step that predicts future spatial states to improve multimodal visual grounding for autonomous vehicles, achieving top results on Talk2Car and other benchmarks.

Boosting Team Modeling through Tempo-Relational Representation Learning

cs.LG · 2025-07-17 · unverdicted · novelty 6.0

A tempo-relational neural architecture jointly models temporal and relational aspects of team interactions to outperform prior approaches on team performance prediction and enable efficient multi-task prediction of team constructs.

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

cs.CV · 2024-10-29 · conditional · novelty 6.0

Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine-tuning on nuScenes.

CoCa: Contrastive Captioners are Image-Text Foundation Models

cs.CV · 2022-05-04 · accept · novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid

cs.AI · 2026-05-02 · unverdicted · novelty 5.0

A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.

Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective

cs.LG · 2026-04-20 · unverdicted · novelty 5.0

CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and reconstruction constraints.

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

cs.CV · 2024-02-06 · unverdicted · novelty 4.0

MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

cs.LG · 2025-08-06 · unverdicted · novelty 3.0

A systematic literature review of explainability in multimodal attention models finds most studies focus on vision-language tasks with attention-based explanations, but evaluation methods lack consistency and modality-specific considerations.

citing papers explorer

Showing 17 of 17 citing papers.

Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization cs.CR · 2026-04-07 · unverdicted · none · ref 16
HyPE detects harmful prompts as outliers in hyperbolic space and HyPS sanitizes them using explainable attribution, outperforming prior defenses in accuracy and robustness across datasets and adversarial scenarios.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body cs.CV · 2025-12-16 · unverdicted · none · ref 102
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
Adversarial Video Promotion Against Text-to-Video Retrieval cs.CV · 2025-08-09 · unverdicted · none · ref 39
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding cs.CV · 2024-12-11 · unverdicted · none · ref 48
CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.
LRM: Large Reconstruction Model for Single Image to 3D cs.CV · 2023-11-08 · conditional · none · ref 29
LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
ViperGPT: Visual Inference via Python Execution for Reasoning cs.CV · 2023-03-14 · unverdicted · none · ref 52
ViperGPT generates executable Python code to compose pre-trained vision-and-language modules into programs that answer visual queries, reaching state-of-the-art results with no additional training.
PaLI: A Jointly-Scaled Multilingual Language-Image Model cs.CV · 2022-09-14 · conditional · none · ref 71
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
Multimodal LLMs under Pairwise Modalities cs.CV · 2026-05-20 · unverdicted · none · ref 48
A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.
SpecPL: Disentangling Spectral Granularity for Prompt Learning cs.CV · 2026-05-06 · unverdicted · none · ref 10
SpecPL introduces spectral decomposition via frozen VAE and counterfactual high-frequency permutation to bridge modality asymmetry in VLM prompt learning, reaching 81.51% harmonic-mean accuracy on 11 benchmarks.
Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles cs.CV · 2025-12-03 · unverdicted · none · ref 52
ThinkDeeper introduces a world-model-based reasoning step that predicts future spatial states to improve multimodal visual grounding for autonomous vehicles, achieving top results on Talk2Car and other benchmarks.
Boosting Team Modeling through Tempo-Relational Representation Learning cs.LG · 2025-07-17 · unverdicted · none · ref 135
A tempo-relational neural architecture jointly models temporal and relational aspects of team interactions to outperform prior approaches on team performance prediction and enable efficient multi-task prediction of team constructs.
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving cs.CV · 2024-10-29 · conditional · none · ref 57
Senna decouples language-based high-level planning from an LVLM with low-level trajectory prediction from an E2E model, reporting 27% lower planning error and 33% lower collisions after pre-training on DriveX and fine-tuning on nuScenes.
CoCa: Contrastive Captioners are Image-Text Foundation Models cs.CV · 2022-05-04 · accept · none · ref 25
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
Structural Ranking of the Cognitive Plausibility of Computational Models of Analogy and Metaphors with the Minimal Cognitive Grid cs.AI · 2026-05-02 · unverdicted · none · ref 105
A formalized Minimal Cognitive Grid ranks computational models of analogy and metaphor by alignment with cognitive theories using Functional/Structural Ratio, Generality, and Performance Match dimensions.
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective cs.LG · 2026-04-20 · unverdicted · none · ref 134
CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and reconstruction constraints.
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model cs.CV · 2024-02-06 · unverdicted · none · ref 62
MobileVLM V2 shows that 1.7B and 3B parameter vision-language models can reach or exceed the performance of 3B and 7B+ models on common VLM benchmarks via targeted design and data improvements.
Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models cs.LG · 2025-08-06 · unverdicted · none · ref 95
A systematic literature review of explainability in multimodal attention models finds most studies focus on vision-language tasks with attention-based explanations, but evaluation methods lack consistency and modality-specific considerations.

arXiv preprint arXiv:1908.07490 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer