Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
Title resolution pending
17 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Surprisal minimization over goal-directed alternatives generated by language models provides the strongest account of production choices in open-ended dialogue compared to uniform information density or length-based costs.
SG-RAG frames retrieval as subgraph matching to ensure LLMs meet every condition in factual queries and reports large gains over baselines on a new 120k-pair ERQA dataset.
VLMs as judges exhibit informativeness bias by favoring detailed but image-inconsistent answers; BIRCH mitigates it by first correcting answers against the image, reducing bias up to 17% and improving performance up to 9.8%.
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Interventions in LLM-simulated user experiments induce distribution shifts in latent attributes that create confounding bias, diagnosable with negative control outcomes and partially mitigated by adding setting-relevant persona details.
DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.
PA-BDM adapts block diffusion by switching to causal intra-block denoising and dynamically committing reliable prefixes to KV cache, yielding higher accuracy and 71.6% higher throughput than a comparable baseline on document benchmarks.
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.
Filtering job posting data before LLM-assisted clustering and hierarchical labeling yields taxonomies with better AI skill coverage than unfiltered approaches.
The paper proposes Strategic Prior-data Fitted Network (SPN), an inference-time method that adapts pretrained tabular foundation models to strategic feature manipulation by constructing aligned in-context examples.
TF-IDF identifies labeled experts in the top 25 recommendations 79.5% of the time versus 51.5% for GPT-4o mini on an astronomy observatory dataset.
citing papers explorer
-
UCSF-PDGM-VQA: Visual Question Answering dataset for brain tumor MRI interpretation
Introduces the UCSF-PDGM-VQA dataset of 2387 QA pairs from 473 glioma MRI studies and demonstrates that state-of-the-art VLMs exhibit modality collapse on multi-sequence 3D medical images.
-
Surprisal Minimisation over Goal-directed Alternatives Predicts Production Choice in Dialogue
Surprisal minimization over goal-directed alternatives generated by language models provides the strongest account of production choices in open-ended dialogue compared to uniform information density or length-based costs.
-
Structure Guided Retrieval-Augmented Generation for Factual Queries
SG-RAG frames retrieval as subgraph matching to ensure LLMs meet every condition in factual queries and reports large gains over baselines on a new 120k-pair ERQA dataset.
-
When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias
VLMs as judges exhibit informativeness bias by favoring detailed but image-inconsistent answers; BIRCH mitigates it by first correcting answers against the image, reducing bias up to 17% and improving performance up to 9.8%.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study
Interventions in LLM-simulated user experiments induce distribution shifts in latent attributes that create confounding bias, diagnosable with negative control outcomes and partially mitigated by adding setting-relevant persona details.
-
DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs
DMN achieves over 90% attack success rate on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4 by distributing instructions, supplying multimodal evidence, and adding number chain tasks across multiple images.
-
Prefix-Adaptive Block Diffusion for Efficient Document Recognition
PA-BDM adapts block diffusion by switching to causal intra-block denoising and dynamically committing reliable prefixes to KV cache, yielding higher accuracy and 71.6% higher throughput than a comparable baseline on document benchmarks.
-
When Audio-Language Models Fail to Leverage Multimodal Context for Dysarthric Speech Recognition
Current audio-language models fail to use clinical multimodal context for dysarthric speech recognition, but context-aware LoRA fine-tuning delivers large accuracy gains on the SAP dataset.
-
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
-
PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking
PDDL-Mind improves LLM accuracy on theory-of-mind benchmarks by over 5% by translating stories into verifiable PDDL states that decouple environment tracking from belief inference.
-
Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings
Filtering job posting data before LLM-assisted clustering and hierarchical labeling yields taxonomies with better AI skill coverage than unfiltered approaches.
-
When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach
The paper proposes Strategic Prior-data Fitted Network (SPN), an inference-time method that adapts pretrained tabular foundation models to strategic feature manipulation by constructing aligned in-context examples.
-
Traditional statistical representations outperform generative AI in identifying expert peer reviewers
TF-IDF identifies labeled experts in the top 25 recommendations 79.5% of the time versus 51.5% for GPT-4o mini on an astronomy observatory dataset.
- AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning
- Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning
- R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling