pith. sign in

arxiv: 2604.23733 · v1 · submitted 2026-04-26 · 💻 cs.CL

Multimodal QUD: Inquisitive Questions from Scientific Figures

Pith reviewed 2026-05-08 06:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords multimodal QUDinquisitive questionsscientific figuresvision-language modelsmultimodal reasoningdataset annotationdiscourse comprehension
0
0 comments X

The pith

Author-annotated inquisitive questions drawn from scientific figures and their surrounding paper context improve vision-language models' ability to generate content-specific multimodal questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the linguistic framework of Questions Under Discussion from text alone to cases where figures and text interact in research papers. It introduces the MQUD dataset, in which the original paper authors explicitly mark the implicit questions that a figure raises and that the surrounding text resolves. Fine-tuning a vision-language model on this dataset moves its outputs away from generic low-level visual queries toward questions that require joint reasoning over the figure's content and the paper's communicative intent. A reader would care because the resulting questions better capture the depth of human engagement with scientific visualizations.

Core claim

We extend QUD theory to the multimodal case by collecting author-annotated questions that are raised by figures yet resolved only through the accompanying text, release the resulting MQUD dataset, and demonstrate that fine-tuning a VLM on these annotations produces questions that are more visually grounded and that demand higher-level multimodal reasoning than those generated by untuned models.

What carries the argument

The MQUD dataset of author-annotated multimodal questions under discussion, which captures implicit questions raised by a scientific figure and resolved by the paper's textual analysis.

If this is right

  • Fine-tuned VLMs produce questions that are measurably more specific to the figure's role in the paper rather than generic visual descriptions.
  • The same fine-tuning process yields questions that require cross-modal reasoning instead of isolated visual extraction.
  • MQUD supplies a concrete benchmark for evaluating how well VLMs track discourse goals in scientific documents.
  • The approach offers a scalable way to generate training data for models that aim to simulate human scientific reading.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models trained this way could be applied to generate study questions or summaries that better anticipate what a reader needs to understand next in a paper.
  • The dataset construction method could be adapted to other domains such as medical imaging or engineering diagrams where figures carry essential arguments.
  • If the improvement generalizes, it suggests that explicit discourse-level annotations can serve as a stronger training signal than purely visual question-answering data.

Load-bearing premise

That the questions marked by the original authors accurately reflect the depth and type of inquisitive questions humans naturally raise when reading scientific figures together with their textual context.

What would settle it

A direct comparison in which human raters judge the visual grounding and reasoning depth of questions generated by the fine-tuned model versus a baseline model on a set of held-out papers; no improvement would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.23733 by Alexandros G. Dimakis, Junyi Jessy Li, Venkata S Govindarajan, William Rudman, Yating Wu.

Figure 1
Figure 1. Figure 1: Multimodal QUD pipeline. Left: trigger context (title, abstract, figure, and caption); view at source ↗
Figure 2
Figure 2. Figure 2: Figure dependency by QUD type. Figure-driven types show high usefulness and answerability; integra￾tion types show a gap between the two. 5 Method We focus on question generation: given the trigger context (title, caption, abstract, figure), produce the implicit question QF. Unlike figure captioning or visual QA, generating mul￾timodal QUDs requires judging which aspects of a figure are interesting given t… view at source ↗
Figure 3
Figure 3. Figure 3: rIG and content-specific grounding over training steps. The dashed red line shows view at source ↗
Figure 4
Figure 4. Figure 4: Figure swap diagnostic (n=51). Lower loss is better. Base: the wrong fig￾ure still helps (correct < wrong < none). SFT: the wrong figure now hurts (correct < none < wrong), indicating content￾specific grounding. Error bars: bootstrap 95% CIs. Visual information gain (H2). After training on multimodal QUDs, the model relies more on the figure when generating questions. rIG increases from 0.60 [0.49, 0.73] t… view at source ↗
Figure 5
Figure 5. Figure 5: Blind A/B validation of the LLM judge. An expert evaluator compared view at source ↗
Figure 6
Figure 6. Figure 6: Representative examples from ChartQA and view at source ↗
Figure 7
Figure 7. Figure 7: Dataset annotation properties. (a) Figure type vs. usefulness. (b) Reference view at source ↗
Figure 8
Figure 8. Figure 8: Example 1: Unlearn–retain accuracy tradeoff for three methods. view at source ↗
Figure 9
Figure 9. Figure 9: Example 2: Three diagnostic tests for isotropy metrics. The relevant panel is the view at source ↗
Figure 10
Figure 10. Figure 10: Example 3: xMatch performance vs. number of subtokens for two code update view at source ↗
Figure 11
Figure 11. Figure 11: QUD type distribution on the disjoint evaluation set (unseen papers). Training view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison of generated questions for a scatter plot. view at source ↗
read the original abstract

Asking inquisitive questions while reading, and looking for their answers, is an important part in human discourse comprehension, curiosity, and creative ideation, and prior work has investigated this in text-only scenarios. However, in scientific or research papers, many of the critical takeaways are conveyed through both figures and the text that analyzes them. While scientific visualizations have been used to evaluate Vision-Language Models (VLMs) capabilities, current benchmarks are limited to questions that focus simply on extracting information from them. Such questions only require lower-level reasoning, do not take into account the context in which a figure appears, and do not reflect the communicative goals the authors wish to achieve. We generate inquisitive questions that reach the depth of questions humans generate when engaging with scientific papers, conditioned on both the figure and the paper's context, and require reasoning across both modalities. To do so, we extend the linguistic theory of Questions Under Discussion (QUD) from being text-only to multimodal, where implicit questions are raised and resolved as discourse progresses. We present MQUD, a dataset of research papers in which such questions are made explicit and annotated by the original authors. We show that fine-tuning a VLM on MQUD shifts the model from generating generic low-level visual questions to content-specific grounding that requires a high-level of multimodal reasoning, yielding higher-quality, more visually grounded multimodal QUD generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MQUD, a dataset of multimodal Questions Under Discussion (QUD) derived from scientific figures and surrounding paper text, with questions explicitly annotated by the original paper authors. It extends linguistic QUD theory to multimodal discourse and reports that fine-tuning a vision-language model on MQUD shifts output from generic low-level visual questions toward content-specific, high-level questions requiring cross-modal reasoning and better visual grounding.

Significance. If the central empirical claims hold after addressing annotation validity, the work would offer a novel resource and training signal for VLMs on scientific multimodal reasoning, moving beyond standard visual QA toward inquisitive, context-aware question generation with potential uses in research tools and scientific education. The theoretical extension of QUD is a clear strength.

major comments (2)
  1. [Dataset construction] Dataset construction section: The central claim that MQUD captures 'the depth of questions humans generate when engaging with scientific papers' rests on author annotations. No evidence is provided of inter-annotator agreement with independent readers, controls for author knowledge leakage, or external validation that the questions reflect natural reader inquisitiveness rather than expert intent. This directly undermines the interpretation of fine-tuning gains as genuine multimodal reasoning improvements rather than dataset artifacts.
  2. [Experiments] Experiments and evaluation section: The reported shift to 'higher-quality, more visually grounded multimodal QUD generation' is not supported by any quantitative metrics, human evaluation protocol, baseline comparisons, or example outputs in the provided description. Without these, the empirical result cannot be assessed for effect size or robustness.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result or example to convey the scale of the improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review. We appreciate the feedback highlighting the need for stronger validation of the dataset and more rigorous presentation of experimental results. We address each major comment below and have revised the manuscript to incorporate clarifications, additional discussion, and expanded evaluation details.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: The central claim that MQUD captures 'the depth of questions humans generate when engaging with scientific papers' rests on author annotations. No evidence is provided of inter-annotator agreement with independent readers, controls for author knowledge leakage, or external validation that the questions reflect natural reader inquisitiveness rather than expert intent. This directly undermines the interpretation of fine-tuning gains as genuine multimodal reasoning improvements rather than dataset artifacts.

    Authors: Author annotations were chosen to directly capture the intended communicative goals and implicit QUDs that the paper authors had when designing the figures and text, which aligns with the core of QUD theory (questions that advance the discourse from the producer's perspective). This approach ensures high fidelity to the scientific context rather than relying on external readers inferring intent. We agree that independent validation would further strengthen claims about reflecting natural reader inquisitiveness. In the revision, we have expanded the dataset section with details on the annotation guidelines, added a limitations paragraph explicitly discussing potential author bias and knowledge leakage, and included a small pilot comparison with questions from independent readers. We do not claim the dataset is the only possible set of questions but argue it provides a valuable, grounded training signal; the fine-tuning results demonstrate improved multimodal reasoning regardless. revision: partial

  2. Referee: [Experiments] Experiments and evaluation section: The reported shift to 'higher-quality, more visually grounded multimodal QUD generation' is not supported by any quantitative metrics, human evaluation protocol, baseline comparisons, or example outputs in the provided description. Without these, the empirical result cannot be assessed for effect size or robustness.

    Authors: We agree that the initial presentation of results required more explicit detail to allow assessment. The manuscript includes human evaluation protocols (rating on specificity, visual grounding, and relevance), baseline comparisons against zero-shot and few-shot VLM prompting, and example outputs. In the revised version, we have restructured the Experiments section to prominently feature quantitative results (e.g., human agreement rates and automated metrics for question complexity), full evaluation protocols, effect size reporting, and additional example generations in the main text and appendix. This makes the empirical claims fully assessable. revision: yes

Circularity Check

0 steps flagged

No circularity: new dataset construction and empirical fine-tuning evaluation are independent of inputs.

full rationale

The paper introduces MQUD as a novel dataset of author-annotated multimodal questions under discussion from scientific papers, then reports empirical results from fine-tuning VLMs on this dataset to measure shifts in generated question quality and grounding. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claims rest on the external validity of the new annotations and the observable differences in model outputs, which do not reduce to the annotation process by construction. This is a standard empirical pipeline with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5559 in / 1042 out tokens · 36373 ms · 2026-05-08T06:14:42.418057+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

  1. [1]

    Express genuine curiosity (why, how, what)

    Questions arise from viewing the figure:Reference what isvisible(trends, differences, patterns, values, annotations). Express genuine curiosity (why, how, what). Cannot be answered from the caption alone

  2. [2]

    Should provide insight, not just restate the question

    Answers from paper text:2–4 substantive sentences providing interpretation, cause, or context. Should provide insight, not just restate the question

  3. [3]

    Match the paper’s sophisti- cation level

    Natural research language:Write from a researcher’s perspective. Match the paper’s sophisti- cation level. Vary question structures naturally

  4. [4]

    Why does the accuracy drop sharply after 100 tokens in the left panel?

    Specific to this figure:Reference concrete visual elements (lines, bars, regions, numerical values). Use terms from the provided paragraphs, not generic placeholders. Question type examples(good✓and bad ×): •Cause:✓“Why does the accuracy drop sharply after 100 tokens in the left panel?” • Comparison: ✓ “How does the baseline’s behavior differ from the pro...

  5. [5]

    Every claim in the answer must be traceable to the caption and/or source text

  6. [6]

    can be identified by looking at the figure

    The answer must be concrete and informative — non-answers like “can be identified by looking at the figure” arenotgrounded. Figure caption:{caption}. Source text:{source text}. Question:{question}. Answer:{answer}. Output:JSON withgrounded(boolean) andreason(brief explanation). G.4 Zero-shot LLM judge We evaluate all 1,250 QUDs using a zero-shot LLM judge...

  7. [7]

    Farias and Jonathan C

    Juan P . Farias and Jonathan C. Tan. On the Formation of Runaway Stars BN and x in the Orion Nebula Cluster.Astronomy and Astrophysics, 2018

  8. [8]

    Decomposing Generalization: Models of Generic, Habitual, and Episodic Statements.Trans- actions of the Association for Computational Linguistics, 2019

    Venkata Subrahmanyan Govindarajan, Benjamin Van Durme, and Aaron Steven White. Decomposing Generalization: Models of Generic, Habitual, and Episodic Statements.Trans- actions of the Association for Computational Linguistics, 2019

  9. [9]

    Farias, Jonathan C

    Juan P . Farias, Jonathan C. Tan, and Laurent Eyer. Hunting for Runaways from the Orion Nebula Cluster.The Astrophysical Journal, 2020

  10. [10]

    Help! Need Advice on Identifying Advice

    Venkata Subrahmanyan Govindarajan, Benjamin T Chen, Rebecca Warholic, Katrin Erk, and Junyi Jessy Li. Help! Need Advice on Identifying Advice. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  11. [11]

    IsoScore: Measuring the Uniformity of Embedding Space Utilization

    William Rudman, Nate Gillman, Taylor Rayne, and Carsten Eickhoff. IsoScore: Measuring the Uniformity of Embedding Space Utilization. InFindings of the Association for Computational Linguistics: ACL 2022, 2022

  12. [12]

    Aleksey Generozov and Hagai B. Perets. Constraints on the origins of hypervelocity stars: velocity distribution, mergers and star-formation history.Monthly Notices of the Royal Astro- nomical Society, 2022

  13. [13]

    Inline Tests

    Yu Liu, Pengyu Nie, Owolabi Legunsen, and Milos Gligoric. Inline Tests. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022

  14. [14]

    Beaver, and Junyi Jessy Li

    Venkata S Govindarajan, Katherine Atwell, Barea Sinno, Malihe Alikhani, David I. Beaver, and Junyi Jessy Li. How people talk about each other: Modeling Generalized Intergroup Bias and Emotion. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

  15. [15]

    COMPS: Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models

    Kanishka Misra, Julia Taylor Rayz, and Allyson Ettinger. COMPS: Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

  16. [16]

    Aleksey Generozov and Hagai B. Perets. Capture of stars into gaseous discs around massive black holes: Alignment, circularization and growth.Monthly Notices of the Royal Astronomical Society, 2023

  17. [17]

    The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation.arXiv preprint, 2023

    Huancheng Chen, Johnny Wang, and Haris Vikalo. The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation.arXiv preprint, 2023

  18. [18]

    Exploring the Observability of Surviving Companions of Stripped-Envelope Supernovae: A Case Study of Type Ic SN 2020oi.The Astrophysical Journal, 2023

    Hsin-Pei Chen, Shiau-Jie Rau, and Kuo-Chuan Pan. Exploring the Observability of Surviving Companions of Stripped-Envelope Supernovae: A Case Study of Type Ic SN 2020oi.The Astrophysical Journal, 2023

  19. [19]

    Elaborative Simplification as Implicit Questions Under Discussion

    Yating Wu, William Sheffield, Kyle Mahowald, and Junyi Jessy Li. Elaborative Simplification as Implicit Questions Under Discussion. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  20. [20]

    pytest-inline: An Inline Testing Tool for Python

    Yu Liu, Zachary Thurston, Alan Han, Pengyu Nie, Milos Gligoric, and Owolabi Legunsen. pytest-inline: An Inline Testing Tool for Python. In2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), 2023

  21. [21]

    Stable Anisotropic Regularization.arXiv preprint, 2023

    William Rudman and Carsten Eickhoff. Stable Anisotropic Regularization.arXiv preprint, 2023

  22. [22]

    ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

    Dana Arad, Hadas Orgad, and Yonatan Belinkov. ReFACT: Updating Text-to-Image Models by Editing the Text Encoder. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

  23. [23]

    One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data

    Michal Golovanevsky, Eva Schiller, Akira Nair, Eric Han, Ritambhara Singh, and Carsten Eickhoff. One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data. InBiocomputing 2025, 2024. 25

  24. [24]

    Multilingual Code Co- Evolution Using Large Language Models

    Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. Multilingual Code Co- Evolution Using Large Language Models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023

  25. [25]

    Farias, Stella S

    Juan P . Farias, Stella S. R. Offner, Michael Y. Grudi´c, D´avid Guszejnov, and Anna L. Rosen. Stellar Populations in STARFORGE: The Origin and Evolution of Star Clusters and Associa- tions.Monthly Notices of the Royal Astronomical Society, 2023

  26. [26]

    Heterogeneity-Guided Client Sampling: Towards Fast and Efficient Non-IID Federated Learning

    Huancheng Chen and Haris Vikalo. Heterogeneity-Guided Client Sampling: Towards Fast and Efficient Non-IID Federated Learning. InAdvances in Neural Information Processing Systems 37, 2024

  27. [27]

    QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing

    Yating Wu, Ritika Mangla, Greg Durrett, and Junyi Jessy Li. QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

  28. [28]

    Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices

    Huancheng Chen and Haris Vikalo. Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  29. [29]

    Generozov and H

    A. Generozov and H. B. Perets. A Triple Scenario for the Formation of Wide Black Hole Binaries Such As Gaia BH1.The Astrophysical Journal, 2024

  30. [30]

    Dimakis, Greg Durrett, and Junyi Jessy Li

    Yating Wu, Ritika Mangla, Alexandros G. Dimakis, Greg Durrett, and Junyi Jessy Li. Which questions should I answer? Salience Prediction of Inquisitive Questions. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  31. [31]

    Recovering Labels from Local Updates in Federated Learning.arXiv preprint, 2024

    Huancheng Chen and Haris Vikalo. Recovering Labels from Local Updates in Federated Learning.arXiv preprint, 2024

  32. [32]

    La Cava, and Danielle S

    Shan Chen, Jack Gallifant, Mingye Gao, Pedro Moreira, Nikolaj Munch, Ajay Muthukkumar, Arvind Rajan, Jaya Kolluri, Amelia Fiske, Janna Hastings, Hugo Aerts, Brian Anthony, Leo Anthony Celi, William G. La Cava, and Danielle S. Bitterman. Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias. InAdvances in Neural Info...

  33. [33]

    exLong: Generating Exceptional Behavior Tests with Large Language Models.arXiv preprint, 2024

    Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. exLong: Generating Exceptional Behavior Tests with Large Language Models.arXiv preprint, 2024

  34. [34]

    What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian- Noise-free Text-Image Corruption and Evaluation.arXiv preprint, 2024

    Michal Golovanevsky, William Rudman, Vedant Palit, Ritambhara Singh, and Carsten Eickhoff. What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian- Noise-free Text-Image Corruption and Evaluation.arXiv preprint, 2024

  35. [35]

    Do they mean ’us’? Interpreting Referring Expressions in Intergroup Bias

    Venkata S Govindarajan, Matianyu Zang, Kyle Mahowald, David Beaver, and Junyi Jessy Li. Do they mean ’us’? Interpreting Referring Expressions in Intergroup Bias. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

  36. [36]

    Kennedy, and John D

    Kuleen Sasse, Shinjitha Vadlakonda, Richard E. Kennedy, and John D. Osborne. Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions.arXiv preprint, 2024

  37. [37]

    Characterizing the Role of Similarity in the Property Inferences of Language Models.arXiv preprint, 2024

    Juan Diego Rodriguez, Aaron Mueller, and Kanishka Misra. Characterizing the Role of Similarity in the Property Inferences of Language Models.arXiv preprint, 2024

  38. [38]

    Signatures of black hole seeding in the local Universe: Predic- tions from the BRAHMA cosmological simulations.arXiv preprint, 2024

    Aklant K Bhowmick, Laura Blecha, Paul Torrey, Rachel S Somerville, Luke Zoltan Kelley, Rainer Weinberger, Mark Vogelsberger, Lars Hernquist, Priyamvada Natarajan, Jonathan Kho, and Tiziana Di Matteo. Signatures of black hole seeding in the local Universe: Predic- tions from the BRAHMA cosmological simulations.arXiv preprint, 2024

  39. [39]

    Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats

    Kuleen Sasse, Carlos Aguirre, Isabel Cachola, Sharon Levy, and Mark Dredze. Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

  40. [40]

    Enhancing Retrieval- Augmented Generation: A Study of Best Practices.arXiv preprint, 2025

    Siran Li, Linus Stenzel, Carsten Eickhoff, and Seyed Ali Bahrainian. Enhancing Retrieval- Augmented Generation: A Study of Best Practices.arXiv preprint, 2025

  41. [41]

    SpaceGNN: Multi-Space Graph Neural Network for Node Anomaly Detection with Extremely Limited Labels.arXiv preprint, 2025

    Xiangyu Dong, Xingyi Zhang, Lei Chen, Mingxuan Yuan, and Sibo Wang. SpaceGNN: Multi-Space Graph Neural Network for Node Anomaly Detection with Extremely Limited Labels.arXiv preprint, 2025

  42. [42]

    Can Language Models Learn Typologically Implausible Languages?arXiv preprint, 2025

    Tianyang Xu, Tatsuki Kuribayashi, Yohei Oseki, Ryan Cotterell, and Alex Warstadt. Can Language Models Learn Typologically Implausible Languages?arXiv preprint, 2025. 26

  43. [43]

    Forgotten Polygons: Multimodal Large Language Models are Shape-Blind.arXiv preprint, 2025

    William Rudman, Michal Golovanevsky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, and Ritambhara Singh. Forgotten Polygons: Multimodal Large Language Models are Shape-Blind.arXiv preprint, 2025

  44. [44]

    Towards More Accurate Full-Atom Antibody Co-Design.arXiv preprint, 2025

    Jiayang Wu, Xingyi Zhang, Xiangyu Dong, Kun Xie, Ziqi Liu, Wensheng Gan, Sibo Wang, and Le Song. Towards More Accurate Full-Atom Antibody Co-Design.arXiv preprint, 2025

  45. [45]

    Fast and Accurate Antibody Sequence Design via Structure Retrieval

    Xingyi Zhang, Kun Xie, Ningqiao Huang, Wei Liu, Peilin Zhao, Sibo Wang, Kangfei Zhao, and Biaobin Jiang. Fast and Accurate Antibody Sequence Design via Structure Retrieval. arXiv preprint, 2025

  46. [46]

    QUDsim: Quantifying Discourse Similarities in LLM-Generated Text.arXiv preprint, 2025

    Ramya Namuduri, Yating Wu, Anshun Asher Zheng, Manya Wadhwa, Greg Durrett, and Junyi Jessy Li. QUDsim: Quantifying Discourse Similarities in LLM-Generated Text.arXiv preprint, 2025

  47. [47]

    Bitterman

    Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, and Danielle S. Bitterman. MedBrowseComp: Benchmarking Medical Deep Research and Computer Use.arXiv preprint, 2025

  48. [48]

    Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts.arXiv preprint, 2025

    Michal Golovanevsky, William Rudman, Michael Lepori, Amir Bar, Ritambhara Singh, and Carsten Eickhoff. Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts.arXiv preprint, 2025

  49. [49]

    SAEs Are Good for Steering – If You Select the Right Features

    Dana Arad, Aaron Mueller, and Yonatan Belinkov. SAEs Are Good for Steering – If You Select the Right Features. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

  50. [50]

    The Harmonic Structure of Information Contours.arXiv preprint, 2025

    Eleftheria Tsipidi, Samuel Kiegeland, Franz Nowak, Tianyang Xu, Ethan Wilcox, Alex Warstadt, Ryan Cotterell, and Mario Giulianelli. The Harmonic Structure of Information Contours.arXiv preprint, 2025

  51. [51]

    Bhowmick, Paul Torrey, Alex M

    Jonathan Kho, Aklant K. Bhowmick, Paul Torrey, Alex M. Garcia, Niusha Ahvazi, Laura Blecha, and Mark Vogelsberger. Signatures of BH seeding on the M•–σ relation: Predictions from the BRAHMA simulations.arXiv preprint, 2025

  52. [52]

    PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset

    Michal Golovanevsky, Pranav Mahableshwarkar, Carsten Eickhoff, and Ritambhara Singh. PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset. arXiv preprint, 2025

  53. [53]

    Type Ia Supernova Progenitors and Surviving Companions within the Symbiotic Channel.The Astrophysical Journal, 2025

    Yu-Hui Wang, Hsin-Pei Chen, and Kuo-Chuan Pan. Type Ia Supernova Progenitors and Surviving Companions within the Symbiotic Channel.The Astrophysical Journal, 2025

  54. [54]

    Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It.arXiv preprint, 2025

    Yulu Qin, Dheeraj Varghese, Adam Dahlgren Lindstr¨om, Lucia Donatelli, Kanishka Misra, and Najoung Kim. Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It.arXiv preprint, 2025

  55. [55]

    CRISP: Persistent Concept Unlearning via Sparse Autoencoders.arXiv preprint, 2025

    Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, and Yonatan Belinkov. CRISP: Persistent Concept Unlearning via Sparse Autoencoders.arXiv preprint, 2025

  56. [56]

    Task-Agnostic Federated Continual Learning via Replay-Free Gradient Projection.arXiv preprint, 2025

    Seohyeon Cha, Huancheng Chen, and Haris Vikalo. Task-Agnostic Federated Continual Learning via Replay-Free Gradient Projection.arXiv preprint, 2025

  57. [57]

    Bhowmick, Laura Blecha, Paul Torrey, Luke Zoltan Kelley, Priyamvada Natarajan, Rachel S

    Aklant K. Bhowmick, Laura Blecha, Paul Torrey, Luke Zoltan Kelley, Priyamvada Natarajan, Rachel S. Somerville, Rainer Weinberger, Alex M. Garcia, Lars Hernquist, Tiziana Di Matteo, Jonathan Kho, and Mark Vogelsberger. Heavy seeds and the first black holes: Insights from the BRAHMA simulations.arXiv preprint, 2025

  58. [58]

    Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written.arXiv preprint, 2025

    Venkata S Govindarajan and Laura Biester. Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written.arXiv preprint, 2025

  59. [59]

    OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking.arXiv preprint, 2025

    Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, and Jiawei Zhou. OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking.arXiv preprint, 2025

  60. [60]

    Regularized Calibration with Successive Rounding for Post- Training Quantization.arXiv preprint, 2026

    Seohyeon Cha, Huancheng Chen, Dongjun Kim, Haoran Zhang, Kevin Chan, Gustavo de Veciana, and Haris Vikalo. Regularized Calibration with Successive Rounding for Post- Training Quantization.arXiv preprint, 2026

  61. [61]

    Questions beyond Pixels: Integrat- ing Commonsense Knowledge in Visual Question Generation for Remote Sensing.arXiv preprint, 2026

    Siran Li, Li Mi, Javiera Castillo-Navarro, and Devis Tuia. Questions beyond Pixels: Integrat- ing Commonsense Knowledge in Visual Question Generation for Remote Sensing.arXiv preprint, 2026

  62. [62]

    Knowledge-aware Visual Question Generation for Remote Sensing Images

    Siran Li, Li Mi, Javiera Castillo-Navarro, and Devis Tuia. Knowledge-aware Visual Question Generation for Remote Sensing Images. InIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium, 2024. 27