Introduces ViTextCaps dataset and PhonoSTFG phonological graph fusion framework for Vietnamese scene-text image captioning, showing cross-modal graph edges harm performance.
V2VNet: Vehicle-to-vehicle communication for joint perception and prediction,
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Hyper-V2X uses a Bayesian hypernetwork with partial weight generation and V2X context embedding to produce calibrated epistemic and aleatoric uncertainty estimates for multi-agent BEV segmentation on the OPV2V benchmark.
Complex multimodal architectures do not reliably outperform unimodal baselines or a simple multimodal baseline under standardized evaluation.
Semantic-aware random convolution and intensity-based source matching enable effective single-source domain generalization for medical image segmentation, outperforming prior methods and sometimes matching in-domain performance.
MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
citing papers explorer
-
Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention
Introduces ViTextCaps dataset and PhonoSTFG phonological graph fusion framework for Vietnamese scene-text image captioning, showing cross-modal graph edges harm performance.
-
Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation
Hyper-V2X uses a Bayesian hypernetwork with partial weight generation and V2X context embedding to produce calibrated epistemic and aleatoric uncertainty estimates for multi-agent BEV segmentation on the OPV2V benchmark.
-
Fusion or Confusion? Multimodal Complexity Is Not All You Need
Complex multimodal architectures do not reliably outperform unimodal baselines or a simple multimodal baseline under standardized evaluation.
-
Semantic-aware Random Convolution and Source Matching for Domain Generalization in Medical Image Segmentation
Semantic-aware random convolution and intensity-based source matching enable effective single-source domain generalization for medical image segmentation, outperforming prior methods and sometimes matching in-domain performance.
-
MVDream: Multi-view Diffusion for 3D Generation
MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.