CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.
In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 7roles
background 1polarities
background 1representative citing papers
GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
A prompting pipeline and statement-level metrics show that six state-of-the-art text-based explainable recommendation models achieve high semantic similarity but very low factual consistency on Amazon review data.
Introduces TableGrid Navigation (TGN) and Progressive Inference Prompting (PIP) as training-free structured prompting frameworks that improve LLM performance on table question answering over baselines on TableBench and achieve SOTA on FeTaQa.
Hi-GaTA is a hierarchical gated temporal aggregation adapter that uses short-to-long temporal pyramids and gated fusion to enable surgical video report generation, backed by a new 214-video benchmark and a surgical ViViT pretrained on 40,000 minutes of video.
VLMs caption real objects effectively but degrade on 3D-printed fakes in robotic scenes, while some standard metrics fail to detect the factual errors from this domain shift.
STAND adds semantic anchoring and dual-granularity disambiguation modules to address viewpoint, scale, and knowledge ambiguities in remote sensing change captioning.
citing papers explorer
-
Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.
-
GTASA: Ground Truth Annotations for Spatiotemporal Analysis, Evaluation and Training of Video Models
GTASA supplies annotated multi-actor videos with exact 3D spatial and temporal ground truth that outperforms neural video generators in physical and semantic validity while enabling new probes of video encoders.
-
On the Factual Consistency of Text-based Explainable Recommendation Models
A prompting pipeline and statement-level metrics show that six state-of-the-art text-based explainable recommendation models achieve high semantic similarity but very low factual consistency on Amazon review data.
-
Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting
Introduces TableGrid Navigation (TGN) and Progressive Inference Prompting (PIP) as training-free structured prompting frameworks that improve LLM performance on table question answering over baselines on TableBench and achieve SOTA on FeTaQa.
-
Hi-GaTA: Hierarchical Gated Temporal Aggregation Adapter for Surgical Video Report Generation
Hi-GaTA is a hierarchical gated temporal aggregation adapter that uses short-to-long temporal pyramids and gated fusion to enable surgical video report generation, backed by a new 214-video benchmark and a surgical ViViT pretrained on 40,000 minutes of video.
-
Fake or Real, Can Robots Tell? Evaluating VLM Robustness to Domain Shift in Single-View Robotic Scene Understanding
VLMs caption real objects effectively but degrade on 3D-printed fakes in robotic scenes, while some standard metrics fail to detect the factual errors from this domain shift.
-
STAND: Semantic Anchoring Constraint with Dual-Granularity Disambiguation for Remote Sensing Image Change Captioning
STAND adds semantic anchoring and dual-granularity disambiguation modules to address viewpoint, scale, and knowledge ambiguities in remote sensing change captioning.