Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
Title resolution pending
17 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Surprisal minimization over goal-directed alternatives generated by language models provides the strongest account of production choices in open-ended dialogue compared to uniform information density or length-based costs.
SG-RAG frames retrieval as subgraph matching to ensure LLMs meet every condition in factual queries and reports large gains over baselines on a new 120k-pair ERQA dataset.
VLMs as judges exhibit informativeness bias by favoring detailed but image-inconsistent answers; BIRCH mitigates it by first correcting answers against the image, reducing bias up to 17% and improving performance up to 9.8%.
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Interventions in LLM-simulated user experiments induce distribution shifts in latent attributes that create confounding bias, diagnosable with negative control outcomes and partially mitigated by adding setting-relevant persona details.
The paper proposes Strategic Prior-data Fitted Network (SPN), an inference-time framework that adapts pretrained tabular foundation models (PFNs) to strategic manipulation by aligning predictions with approximated post-manipulation distributions via strategic in-context examples.
AMARIS augments rubric updates in RL for LLMs with a persistent memory of rollout analyses and prior edits, yielding gains such as +2.8 points on GPQA-Diamond over local-adaptive baselines.
DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.
PA-BDM adapts block diffusion by switching to causal intra-block denoising and dynamically committing reliable prefixes to KV cache, yielding higher accuracy and 71.6% higher throughput than a comparable baseline on document benchmarks.
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.
Filtering job posting data before LLM-assisted clustering and hierarchical labeling yields taxonomies with better AI skill coverage than unfiltered approaches.
TF-IDF identifies labeled experts in the top 25 recommendations 79.5% of the time versus 51.5% for GPT-4o mini on an astronomy observatory dataset.
Proposes Modality-Aware Credit Assignment (MoCA) with blindfolded-reasoning proxy to reward perception fidelity separately from reasoning in VLMs.