Multimodal QUD: Inquisitive Questions from Scientific Figures

Alexandros G. Dimakis; Junyi Jessy Li; Venkata S Govindarajan; William Rudman; Yating Wu

arxiv: 2604.23733 · v1 · submitted 2026-04-26 · 💻 cs.CL

Multimodal QUD: Inquisitive Questions from Scientific Figures

Yating Wu , William Rudman , Venkata S Govindarajan , Alexandros G. Dimakis , Junyi Jessy Li This is my paper

Pith reviewed 2026-05-08 06:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords multimodal QUDinquisitive questionsscientific figuresvision-language modelsmultimodal reasoningdataset annotationdiscourse comprehension

0 comments

The pith

Author-annotated inquisitive questions drawn from scientific figures and their surrounding paper context improve vision-language models' ability to generate content-specific multimodal questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends the linguistic framework of Questions Under Discussion from text alone to cases where figures and text interact in research papers. It introduces the MQUD dataset, in which the original paper authors explicitly mark the implicit questions that a figure raises and that the surrounding text resolves. Fine-tuning a vision-language model on this dataset moves its outputs away from generic low-level visual queries toward questions that require joint reasoning over the figure's content and the paper's communicative intent. A reader would care because the resulting questions better capture the depth of human engagement with scientific visualizations.

Core claim

We extend QUD theory to the multimodal case by collecting author-annotated questions that are raised by figures yet resolved only through the accompanying text, release the resulting MQUD dataset, and demonstrate that fine-tuning a VLM on these annotations produces questions that are more visually grounded and that demand higher-level multimodal reasoning than those generated by untuned models.

What carries the argument

The MQUD dataset of author-annotated multimodal questions under discussion, which captures implicit questions raised by a scientific figure and resolved by the paper's textual analysis.

If this is right

Fine-tuned VLMs produce questions that are measurably more specific to the figure's role in the paper rather than generic visual descriptions.
The same fine-tuning process yields questions that require cross-modal reasoning instead of isolated visual extraction.
MQUD supplies a concrete benchmark for evaluating how well VLMs track discourse goals in scientific documents.
The approach offers a scalable way to generate training data for models that aim to simulate human scientific reading.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models trained this way could be applied to generate study questions or summaries that better anticipate what a reader needs to understand next in a paper.
The dataset construction method could be adapted to other domains such as medical imaging or engineering diagrams where figures carry essential arguments.
If the improvement generalizes, it suggests that explicit discourse-level annotations can serve as a stronger training signal than purely visual question-answering data.

Load-bearing premise

That the questions marked by the original authors accurately reflect the depth and type of inquisitive questions humans naturally raise when reading scientific figures together with their textual context.

What would settle it

A direct comparison in which human raters judge the visual grounding and reasoning depth of questions generated by the fine-tuned model versus a baseline model on a set of held-out papers; no improvement would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.23733 by Alexandros G. Dimakis, Junyi Jessy Li, Venkata S Govindarajan, William Rudman, Yating Wu.

**Figure 1.** Figure 1: Multimodal QUD pipeline. Left: trigger context (title, abstract, figure, and caption); view at source ↗

**Figure 2.** Figure 2: Figure dependency by QUD type. Figure-driven types show high usefulness and answerability; integration types show a gap between the two. 5 Method We focus on question generation: given the trigger context (title, caption, abstract, figure), produce the implicit question QF. Unlike figure captioning or visual QA, generating multimodal QUDs requires judging which aspects of a figure are interesting given t… view at source ↗

**Figure 3.** Figure 3: rIG and content-specific grounding over training steps. The dashed red line shows view at source ↗

**Figure 4.** Figure 4: Figure swap diagnostic (n=51). Lower loss is better. Base: the wrong figure still helps (correct < wrong < none). SFT: the wrong figure now hurts (correct < none < wrong), indicating contentspecific grounding. Error bars: bootstrap 95% CIs. Visual information gain (H2). After training on multimodal QUDs, the model relies more on the figure when generating questions. rIG increases from 0.60 [0.49, 0.73] t… view at source ↗

**Figure 5.** Figure 5: Blind A/B validation of the LLM judge. An expert evaluator compared view at source ↗

**Figure 6.** Figure 6: Representative examples from ChartQA and view at source ↗

**Figure 7.** Figure 7: Dataset annotation properties. (a) Figure type vs. usefulness. (b) Reference view at source ↗

**Figure 8.** Figure 8: Example 1: Unlearn–retain accuracy tradeoff for three methods. view at source ↗

**Figure 9.** Figure 9: Example 2: Three diagnostic tests for isotropy metrics. The relevant panel is the view at source ↗

**Figure 10.** Figure 10: Example 3: xMatch performance vs. number of subtokens for two code update view at source ↗

**Figure 11.** Figure 11: QUD type distribution on the disjoint evaluation set (unseen papers). Training view at source ↗

**Figure 12.** Figure 12: Qualitative comparison of generated questions for a scatter plot. view at source ↗

read the original abstract

Asking inquisitive questions while reading, and looking for their answers, is an important part in human discourse comprehension, curiosity, and creative ideation, and prior work has investigated this in text-only scenarios. However, in scientific or research papers, many of the critical takeaways are conveyed through both figures and the text that analyzes them. While scientific visualizations have been used to evaluate Vision-Language Models (VLMs) capabilities, current benchmarks are limited to questions that focus simply on extracting information from them. Such questions only require lower-level reasoning, do not take into account the context in which a figure appears, and do not reflect the communicative goals the authors wish to achieve. We generate inquisitive questions that reach the depth of questions humans generate when engaging with scientific papers, conditioned on both the figure and the paper's context, and require reasoning across both modalities. To do so, we extend the linguistic theory of Questions Under Discussion (QUD) from being text-only to multimodal, where implicit questions are raised and resolved as discourse progresses. We present MQUD, a dataset of research papers in which such questions are made explicit and annotated by the original authors. We show that fine-tuning a VLM on MQUD shifts the model from generating generic low-level visual questions to content-specific grounding that requires a high-level of multimodal reasoning, yielding higher-quality, more visually grounded multimodal QUD generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper creates an author-annotated MQUD dataset to extend QUD theory to scientific figures and shows fine-tuning shifts VLM output toward more context-aware questions, but lacks external checks on whether those questions match independent readers.

read the letter

The paper's main move is to build MQUD, a collection of inquisitive questions tied to figures in research papers, annotated by the papers' own authors under a multimodal version of Questions Under Discussion. They then fine-tune a VLM on this data and report that the model stops producing generic visual queries and starts generating questions that draw on both the figure and the surrounding text for higher-level grounding.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MQUD, a dataset of multimodal Questions Under Discussion (QUD) derived from scientific figures and surrounding paper text, with questions explicitly annotated by the original paper authors. It extends linguistic QUD theory to multimodal discourse and reports that fine-tuning a vision-language model on MQUD shifts output from generic low-level visual questions toward content-specific, high-level questions requiring cross-modal reasoning and better visual grounding.

Significance. If the central empirical claims hold after addressing annotation validity, the work would offer a novel resource and training signal for VLMs on scientific multimodal reasoning, moving beyond standard visual QA toward inquisitive, context-aware question generation with potential uses in research tools and scientific education. The theoretical extension of QUD is a clear strength.

major comments (2)

[Dataset construction] Dataset construction section: The central claim that MQUD captures 'the depth of questions humans generate when engaging with scientific papers' rests on author annotations. No evidence is provided of inter-annotator agreement with independent readers, controls for author knowledge leakage, or external validation that the questions reflect natural reader inquisitiveness rather than expert intent. This directly undermines the interpretation of fine-tuning gains as genuine multimodal reasoning improvements rather than dataset artifacts.
[Experiments] Experiments and evaluation section: The reported shift to 'higher-quality, more visually grounded multimodal QUD generation' is not supported by any quantitative metrics, human evaluation protocol, baseline comparisons, or example outputs in the provided description. Without these, the empirical result cannot be assessed for effect size or robustness.

minor comments (1)

[Abstract] The abstract would be strengthened by including at least one key quantitative result or example to convey the scale of the improvement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review. We appreciate the feedback highlighting the need for stronger validation of the dataset and more rigorous presentation of experimental results. We address each major comment below and have revised the manuscript to incorporate clarifications, additional discussion, and expanded evaluation details.

read point-by-point responses

Referee: [Dataset construction] Dataset construction section: The central claim that MQUD captures 'the depth of questions humans generate when engaging with scientific papers' rests on author annotations. No evidence is provided of inter-annotator agreement with independent readers, controls for author knowledge leakage, or external validation that the questions reflect natural reader inquisitiveness rather than expert intent. This directly undermines the interpretation of fine-tuning gains as genuine multimodal reasoning improvements rather than dataset artifacts.

Authors: Author annotations were chosen to directly capture the intended communicative goals and implicit QUDs that the paper authors had when designing the figures and text, which aligns with the core of QUD theory (questions that advance the discourse from the producer's perspective). This approach ensures high fidelity to the scientific context rather than relying on external readers inferring intent. We agree that independent validation would further strengthen claims about reflecting natural reader inquisitiveness. In the revision, we have expanded the dataset section with details on the annotation guidelines, added a limitations paragraph explicitly discussing potential author bias and knowledge leakage, and included a small pilot comparison with questions from independent readers. We do not claim the dataset is the only possible set of questions but argue it provides a valuable, grounded training signal; the fine-tuning results demonstrate improved multimodal reasoning regardless. revision: partial
Referee: [Experiments] Experiments and evaluation section: The reported shift to 'higher-quality, more visually grounded multimodal QUD generation' is not supported by any quantitative metrics, human evaluation protocol, baseline comparisons, or example outputs in the provided description. Without these, the empirical result cannot be assessed for effect size or robustness.

Authors: We agree that the initial presentation of results required more explicit detail to allow assessment. The manuscript includes human evaluation protocols (rating on specificity, visual grounding, and relevance), baseline comparisons against zero-shot and few-shot VLM prompting, and example outputs. In the revised version, we have restructured the Experiments section to prominently feature quantitative results (e.g., human agreement rates and automated metrics for question complexity), full evaluation protocols, effect size reporting, and additional example generations in the main text and appendix. This makes the empirical claims fully assessable. revision: yes

Circularity Check

0 steps flagged

No circularity: new dataset construction and empirical fine-tuning evaluation are independent of inputs.

full rationale

The paper introduces MQUD as a novel dataset of author-annotated multimodal questions under discussion from scientific papers, then reports empirical results from fine-tuning VLMs on this dataset to measure shifts in generated question quality and grounding. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claims rest on the external validity of the new annotations and the observable differences in model outputs, which do not reduce to the annotation process by construction. This is a standard empirical pipeline with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5559 in / 1042 out tokens · 36373 ms · 2026-05-08T06:14:42.418057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

[1]

Express genuine curiosity (why, how, what)

Questions arise from viewing the figure:Reference what isvisible(trends, differences, patterns, values, annotations). Express genuine curiosity (why, how, what). Cannot be answered from the caption alone

work page
[2]

Should provide insight, not just restate the question

Answers from paper text:2–4 substantive sentences providing interpretation, cause, or context. Should provide insight, not just restate the question

work page
[3]

Match the paper’s sophisti- cation level

Natural research language:Write from a researcher’s perspective. Match the paper’s sophisti- cation level. Vary question structures naturally

work page
[4]

Why does the accuracy drop sharply after 100 tokens in the left panel?

Specific to this figure:Reference concrete visual elements (lines, bars, regions, numerical values). Use terms from the provided paragraphs, not generic placeholders. Question type examples(good✓and bad ×): •Cause:✓“Why does the accuracy drop sharply after 100 tokens in the left panel?” • Comparison: ✓ “How does the baseline’s behavior differ from the pro...

work page
[5]

Every claim in the answer must be traceable to the caption and/or source text

work page
[6]

can be identified by looking at the figure

The answer must be concrete and informative — non-answers like “can be identified by looking at the figure” arenotgrounded. Figure caption:{caption}. Source text:{source text}. Question:{question}. Answer:{answer}. Output:JSON withgrounded(boolean) andreason(brief explanation). G.4 Zero-shot LLM judge We evaluate all 1,250 QUDs using a zero-shot LLM judge...

work page 2001
[7]

Farias and Jonathan C

Juan P . Farias and Jonathan C. Tan. On the Formation of Runaway Stars BN and x in the Orion Nebula Cluster.Astronomy and Astrophysics, 2018

work page 2018
[8]

Decomposing Generalization: Models of Generic, Habitual, and Episodic Statements.Trans- actions of the Association for Computational Linguistics, 2019

Venkata Subrahmanyan Govindarajan, Benjamin Van Durme, and Aaron Steven White. Decomposing Generalization: Models of Generic, Habitual, and Episodic Statements.Trans- actions of the Association for Computational Linguistics, 2019

work page 2019
[9]

Farias, Jonathan C

Juan P . Farias, Jonathan C. Tan, and Laurent Eyer. Hunting for Runaways from the Orion Nebula Cluster.The Astrophysical Journal, 2020

work page 2020
[10]

Help! Need Advice on Identifying Advice

Venkata Subrahmanyan Govindarajan, Benjamin T Chen, Rebecca Warholic, Katrin Erk, and Junyi Jessy Li. Help! Need Advice on Identifying Advice. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020
[11]

IsoScore: Measuring the Uniformity of Embedding Space Utilization

William Rudman, Nate Gillman, Taylor Rayne, and Carsten Eickhoff. IsoScore: Measuring the Uniformity of Embedding Space Utilization. InFindings of the Association for Computational Linguistics: ACL 2022, 2022

work page 2022
[12]

Aleksey Generozov and Hagai B. Perets. Constraints on the origins of hypervelocity stars: velocity distribution, mergers and star-formation history.Monthly Notices of the Royal Astro- nomical Society, 2022

work page 2022
[13]

Inline Tests

Yu Liu, Pengyu Nie, Owolabi Legunsen, and Milos Gligoric. Inline Tests. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022

work page 2022
[14]

Beaver, and Junyi Jessy Li

Venkata S Govindarajan, Katherine Atwell, Barea Sinno, Malihe Alikhani, David I. Beaver, and Junyi Jessy Li. How people talk about each other: Modeling Generalized Intergroup Bias and Emotion. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

work page 2023
[15]

COMPS: Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models

Kanishka Misra, Julia Taylor Rayz, and Allyson Ettinger. COMPS: Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

work page 2023
[16]

Aleksey Generozov and Hagai B. Perets. Capture of stars into gaseous discs around massive black holes: Alignment, circularization and growth.Monthly Notices of the Royal Astronomical Society, 2023

work page 2023
[17]

The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation.arXiv preprint, 2023

Huancheng Chen, Johnny Wang, and Haris Vikalo. The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation.arXiv preprint, 2023

work page 2023
[18]

Exploring the Observability of Surviving Companions of Stripped-Envelope Supernovae: A Case Study of Type Ic SN 2020oi.The Astrophysical Journal, 2023

Hsin-Pei Chen, Shiau-Jie Rau, and Kuo-Chuan Pan. Exploring the Observability of Surviving Companions of Stripped-Envelope Supernovae: A Case Study of Type Ic SN 2020oi.The Astrophysical Journal, 2023

work page 2023
[19]

Elaborative Simplification as Implicit Questions Under Discussion

Yating Wu, William Sheffield, Kyle Mahowald, and Junyi Jessy Li. Elaborative Simplification as Implicit Questions Under Discussion. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[20]

pytest-inline: An Inline Testing Tool for Python

Yu Liu, Zachary Thurston, Alan Han, Pengyu Nie, Milos Gligoric, and Owolabi Legunsen. pytest-inline: An Inline Testing Tool for Python. In2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), 2023

work page 2023
[21]

Stable Anisotropic Regularization.arXiv preprint, 2023

William Rudman and Carsten Eickhoff. Stable Anisotropic Regularization.arXiv preprint, 2023

work page 2023
[22]

ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

Dana Arad, Hadas Orgad, and Yonatan Belinkov. ReFACT: Updating Text-to-Image Models by Editing the Text Encoder. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

work page 2024
[23]

One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data

Michal Golovanevsky, Eva Schiller, Akira Nair, Eric Han, Ritambhara Singh, and Carsten Eickhoff. One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data. InBiocomputing 2025, 2024. 25

work page 2025
[24]

Multilingual Code Co- Evolution Using Large Language Models

Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. Multilingual Code Co- Evolution Using Large Language Models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023

work page 2023
[25]

Farias, Stella S

Juan P . Farias, Stella S. R. Offner, Michael Y. Grudi´c, D´avid Guszejnov, and Anna L. Rosen. Stellar Populations in STARFORGE: The Origin and Evolution of Star Clusters and Associa- tions.Monthly Notices of the Royal Astronomical Society, 2023

work page 2023
[26]

Heterogeneity-Guided Client Sampling: Towards Fast and Efficient Non-IID Federated Learning

Huancheng Chen and Haris Vikalo. Heterogeneity-Guided Client Sampling: Towards Fast and Efficient Non-IID Federated Learning. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024
[27]

QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing

Yating Wu, Ritika Mangla, Greg Durrett, and Junyi Jessy Li. QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[28]

Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices

Huancheng Chen and Haris Vikalo. Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[29]

Generozov and H

A. Generozov and H. B. Perets. A Triple Scenario for the Formation of Wide Black Hole Binaries Such As Gaia BH1.The Astrophysical Journal, 2024

work page 2024
[30]

Dimakis, Greg Durrett, and Junyi Jessy Li

Yating Wu, Ritika Mangla, Alexandros G. Dimakis, Greg Durrett, and Junyi Jessy Li. Which questions should I answer? Salience Prediction of Inquisitive Questions. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[31]

Recovering Labels from Local Updates in Federated Learning.arXiv preprint, 2024

Huancheng Chen and Haris Vikalo. Recovering Labels from Local Updates in Federated Learning.arXiv preprint, 2024

work page 2024
[32]

La Cava, and Danielle S

Shan Chen, Jack Gallifant, Mingye Gao, Pedro Moreira, Nikolaj Munch, Ajay Muthukkumar, Arvind Rajan, Jaya Kolluri, Amelia Fiske, Janna Hastings, Hugo Aerts, Brian Anthony, Leo Anthony Celi, William G. La Cava, and Danielle S. Bitterman. Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias. InAdvances in Neural Info...

work page 2024
[33]

exLong: Generating Exceptional Behavior Tests with Large Language Models.arXiv preprint, 2024

Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. exLong: Generating Exceptional Behavior Tests with Large Language Models.arXiv preprint, 2024

work page 2024
[34]

What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian- Noise-free Text-Image Corruption and Evaluation.arXiv preprint, 2024

Michal Golovanevsky, William Rudman, Vedant Palit, Ritambhara Singh, and Carsten Eickhoff. What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian- Noise-free Text-Image Corruption and Evaluation.arXiv preprint, 2024

work page 2024
[35]

Do they mean ’us’? Interpreting Referring Expressions in Intergroup Bias

Venkata S Govindarajan, Matianyu Zang, Kyle Mahowald, David Beaver, and Junyi Jessy Li. Do they mean ’us’? Interpreting Referring Expressions in Intergroup Bias. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

work page 2024
[36]

Kennedy, and John D

Kuleen Sasse, Shinjitha Vadlakonda, Richard E. Kennedy, and John D. Osborne. Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions.arXiv preprint, 2024

work page 2024
[37]

Characterizing the Role of Similarity in the Property Inferences of Language Models.arXiv preprint, 2024

Juan Diego Rodriguez, Aaron Mueller, and Kanishka Misra. Characterizing the Role of Similarity in the Property Inferences of Language Models.arXiv preprint, 2024

work page 2024
[38]

Signatures of black hole seeding in the local Universe: Predic- tions from the BRAHMA cosmological simulations.arXiv preprint, 2024

Aklant K Bhowmick, Laura Blecha, Paul Torrey, Rachel S Somerville, Luke Zoltan Kelley, Rainer Weinberger, Mark Vogelsberger, Lars Hernquist, Priyamvada Natarajan, Jonathan Kho, and Tiziana Di Matteo. Signatures of black hole seeding in the local Universe: Predic- tions from the BRAHMA cosmological simulations.arXiv preprint, 2024

work page 2024
[39]

Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats

Kuleen Sasse, Carlos Aguirre, Isabel Cachola, Sharon Levy, and Mark Dredze. Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

work page 2025
[40]

Enhancing Retrieval- Augmented Generation: A Study of Best Practices.arXiv preprint, 2025

Siran Li, Linus Stenzel, Carsten Eickhoff, and Seyed Ali Bahrainian. Enhancing Retrieval- Augmented Generation: A Study of Best Practices.arXiv preprint, 2025

work page 2025
[41]

SpaceGNN: Multi-Space Graph Neural Network for Node Anomaly Detection with Extremely Limited Labels.arXiv preprint, 2025

Xiangyu Dong, Xingyi Zhang, Lei Chen, Mingxuan Yuan, and Sibo Wang. SpaceGNN: Multi-Space Graph Neural Network for Node Anomaly Detection with Extremely Limited Labels.arXiv preprint, 2025

work page 2025
[42]

Can Language Models Learn Typologically Implausible Languages?arXiv preprint, 2025

Tianyang Xu, Tatsuki Kuribayashi, Yohei Oseki, Ryan Cotterell, and Alex Warstadt. Can Language Models Learn Typologically Implausible Languages?arXiv preprint, 2025. 26

work page 2025
[43]

Forgotten Polygons: Multimodal Large Language Models are Shape-Blind.arXiv preprint, 2025

William Rudman, Michal Golovanevsky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, and Ritambhara Singh. Forgotten Polygons: Multimodal Large Language Models are Shape-Blind.arXiv preprint, 2025

work page 2025
[44]

Towards More Accurate Full-Atom Antibody Co-Design.arXiv preprint, 2025

Jiayang Wu, Xingyi Zhang, Xiangyu Dong, Kun Xie, Ziqi Liu, Wensheng Gan, Sibo Wang, and Le Song. Towards More Accurate Full-Atom Antibody Co-Design.arXiv preprint, 2025

work page 2025
[45]

Fast and Accurate Antibody Sequence Design via Structure Retrieval

Xingyi Zhang, Kun Xie, Ningqiao Huang, Wei Liu, Peilin Zhao, Sibo Wang, Kangfei Zhao, and Biaobin Jiang. Fast and Accurate Antibody Sequence Design via Structure Retrieval. arXiv preprint, 2025

work page 2025
[46]

QUDsim: Quantifying Discourse Similarities in LLM-Generated Text.arXiv preprint, 2025

Ramya Namuduri, Yating Wu, Anshun Asher Zheng, Manya Wadhwa, Greg Durrett, and Junyi Jessy Li. QUDsim: Quantifying Discourse Similarities in LLM-Generated Text.arXiv preprint, 2025

work page 2025
[47]

Bitterman

Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, and Danielle S. Bitterman. MedBrowseComp: Benchmarking Medical Deep Research and Computer Use.arXiv preprint, 2025

work page 2025
[48]

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts.arXiv preprint, 2025

Michal Golovanevsky, William Rudman, Michael Lepori, Amir Bar, Ritambhara Singh, and Carsten Eickhoff. Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts.arXiv preprint, 2025

work page 2025
[49]

SAEs Are Good for Steering – If You Select the Right Features

Dana Arad, Aaron Mueller, and Yonatan Belinkov. SAEs Are Good for Steering – If You Select the Right Features. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025
[50]

The Harmonic Structure of Information Contours.arXiv preprint, 2025

Eleftheria Tsipidi, Samuel Kiegeland, Franz Nowak, Tianyang Xu, Ethan Wilcox, Alex Warstadt, Ryan Cotterell, and Mario Giulianelli. The Harmonic Structure of Information Contours.arXiv preprint, 2025

work page 2025
[51]

Bhowmick, Paul Torrey, Alex M

Jonathan Kho, Aklant K. Bhowmick, Paul Torrey, Alex M. Garcia, Niusha Ahvazi, Laura Blecha, and Mark Vogelsberger. Signatures of BH seeding on the M•–σ relation: Predictions from the BRAHMA simulations.arXiv preprint, 2025

work page 2025
[52]

PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset

Michal Golovanevsky, Pranav Mahableshwarkar, Carsten Eickhoff, and Ritambhara Singh. PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset. arXiv preprint, 2025

work page 2025
[53]

Type Ia Supernova Progenitors and Surviving Companions within the Symbiotic Channel.The Astrophysical Journal, 2025

Yu-Hui Wang, Hsin-Pei Chen, and Kuo-Chuan Pan. Type Ia Supernova Progenitors and Surviving Companions within the Symbiotic Channel.The Astrophysical Journal, 2025

work page 2025
[54]

Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It.arXiv preprint, 2025

Yulu Qin, Dheeraj Varghese, Adam Dahlgren Lindstr¨om, Lucia Donatelli, Kanishka Misra, and Najoung Kim. Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It.arXiv preprint, 2025

work page 2025
[55]

CRISP: Persistent Concept Unlearning via Sparse Autoencoders.arXiv preprint, 2025

Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, and Yonatan Belinkov. CRISP: Persistent Concept Unlearning via Sparse Autoencoders.arXiv preprint, 2025

work page 2025
[56]

Task-Agnostic Federated Continual Learning via Replay-Free Gradient Projection.arXiv preprint, 2025

Seohyeon Cha, Huancheng Chen, and Haris Vikalo. Task-Agnostic Federated Continual Learning via Replay-Free Gradient Projection.arXiv preprint, 2025

work page 2025
[57]

Bhowmick, Laura Blecha, Paul Torrey, Luke Zoltan Kelley, Priyamvada Natarajan, Rachel S

Aklant K. Bhowmick, Laura Blecha, Paul Torrey, Luke Zoltan Kelley, Priyamvada Natarajan, Rachel S. Somerville, Rainer Weinberger, Alex M. Garcia, Lars Hernquist, Tiziana Di Matteo, Jonathan Kho, and Mark Vogelsberger. Heavy seeds and the first black holes: Insights from the BRAHMA simulations.arXiv preprint, 2025

work page 2025
[58]

Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written.arXiv preprint, 2025

Venkata S Govindarajan and Laura Biester. Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written.arXiv preprint, 2025

work page 2025
[59]

OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking.arXiv preprint, 2025

Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, and Jiawei Zhou. OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking.arXiv preprint, 2025

work page 2025
[60]

Regularized Calibration with Successive Rounding for Post- Training Quantization.arXiv preprint, 2026

Seohyeon Cha, Huancheng Chen, Dongjun Kim, Haoran Zhang, Kevin Chan, Gustavo de Veciana, and Haris Vikalo. Regularized Calibration with Successive Rounding for Post- Training Quantization.arXiv preprint, 2026

work page 2026
[61]

Questions beyond Pixels: Integrat- ing Commonsense Knowledge in Visual Question Generation for Remote Sensing.arXiv preprint, 2026

Siran Li, Li Mi, Javiera Castillo-Navarro, and Devis Tuia. Questions beyond Pixels: Integrat- ing Commonsense Knowledge in Visual Question Generation for Remote Sensing.arXiv preprint, 2026

work page 2026
[62]

Knowledge-aware Visual Question Generation for Remote Sensing Images

Siran Li, Li Mi, Javiera Castillo-Navarro, and Devis Tuia. Knowledge-aware Visual Question Generation for Remote Sensing Images. InIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium, 2024. 27

work page 2024

[1] [1]

Express genuine curiosity (why, how, what)

Questions arise from viewing the figure:Reference what isvisible(trends, differences, patterns, values, annotations). Express genuine curiosity (why, how, what). Cannot be answered from the caption alone

work page

[2] [2]

Should provide insight, not just restate the question

Answers from paper text:2–4 substantive sentences providing interpretation, cause, or context. Should provide insight, not just restate the question

work page

[3] [3]

Match the paper’s sophisti- cation level

Natural research language:Write from a researcher’s perspective. Match the paper’s sophisti- cation level. Vary question structures naturally

work page

[4] [4]

Why does the accuracy drop sharply after 100 tokens in the left panel?

Specific to this figure:Reference concrete visual elements (lines, bars, regions, numerical values). Use terms from the provided paragraphs, not generic placeholders. Question type examples(good✓and bad ×): •Cause:✓“Why does the accuracy drop sharply after 100 tokens in the left panel?” • Comparison: ✓ “How does the baseline’s behavior differ from the pro...

work page

[5] [5]

Every claim in the answer must be traceable to the caption and/or source text

work page

[6] [6]

can be identified by looking at the figure

The answer must be concrete and informative — non-answers like “can be identified by looking at the figure” arenotgrounded. Figure caption:{caption}. Source text:{source text}. Question:{question}. Answer:{answer}. Output:JSON withgrounded(boolean) andreason(brief explanation). G.4 Zero-shot LLM judge We evaluate all 1,250 QUDs using a zero-shot LLM judge...

work page 2001

[7] [7]

Farias and Jonathan C

Juan P . Farias and Jonathan C. Tan. On the Formation of Runaway Stars BN and x in the Orion Nebula Cluster.Astronomy and Astrophysics, 2018

work page 2018

[8] [8]

Decomposing Generalization: Models of Generic, Habitual, and Episodic Statements.Trans- actions of the Association for Computational Linguistics, 2019

Venkata Subrahmanyan Govindarajan, Benjamin Van Durme, and Aaron Steven White. Decomposing Generalization: Models of Generic, Habitual, and Episodic Statements.Trans- actions of the Association for Computational Linguistics, 2019

work page 2019

[9] [9]

Farias, Jonathan C

Juan P . Farias, Jonathan C. Tan, and Laurent Eyer. Hunting for Runaways from the Orion Nebula Cluster.The Astrophysical Journal, 2020

work page 2020

[10] [10]

Help! Need Advice on Identifying Advice

Venkata Subrahmanyan Govindarajan, Benjamin T Chen, Rebecca Warholic, Katrin Erk, and Junyi Jessy Li. Help! Need Advice on Identifying Advice. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020

[11] [11]

IsoScore: Measuring the Uniformity of Embedding Space Utilization

William Rudman, Nate Gillman, Taylor Rayne, and Carsten Eickhoff. IsoScore: Measuring the Uniformity of Embedding Space Utilization. InFindings of the Association for Computational Linguistics: ACL 2022, 2022

work page 2022

[12] [12]

Aleksey Generozov and Hagai B. Perets. Constraints on the origins of hypervelocity stars: velocity distribution, mergers and star-formation history.Monthly Notices of the Royal Astro- nomical Society, 2022

work page 2022

[13] [13]

Inline Tests

Yu Liu, Pengyu Nie, Owolabi Legunsen, and Milos Gligoric. Inline Tests. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 2022

work page 2022

[14] [14]

Beaver, and Junyi Jessy Li

Venkata S Govindarajan, Katherine Atwell, Barea Sinno, Malihe Alikhani, David I. Beaver, and Junyi Jessy Li. How people talk about each other: Modeling Generalized Intergroup Bias and Emotion. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

work page 2023

[15] [15]

COMPS: Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models

Kanishka Misra, Julia Taylor Rayz, and Allyson Ettinger. COMPS: Conceptual Minimal Pair Sentences for testing Robust Property Knowledge and its Inheritance in Pre-trained Language Models. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023

work page 2023

[16] [16]

Aleksey Generozov and Hagai B. Perets. Capture of stars into gaseous discs around massive black holes: Alignment, circularization and growth.Monthly Notices of the Royal Astronomical Society, 2023

work page 2023

[17] [17]

The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation.arXiv preprint, 2023

Huancheng Chen, Johnny Wang, and Haris Vikalo. The Best of Both Worlds: Accurate Global and Personalized Models through Federated Learning with Data-Free Hyper-Knowledge Distillation.arXiv preprint, 2023

work page 2023

[18] [18]

Exploring the Observability of Surviving Companions of Stripped-Envelope Supernovae: A Case Study of Type Ic SN 2020oi.The Astrophysical Journal, 2023

Hsin-Pei Chen, Shiau-Jie Rau, and Kuo-Chuan Pan. Exploring the Observability of Surviving Companions of Stripped-Envelope Supernovae: A Case Study of Type Ic SN 2020oi.The Astrophysical Journal, 2023

work page 2023

[19] [19]

Elaborative Simplification as Implicit Questions Under Discussion

Yating Wu, William Sheffield, Kyle Mahowald, and Junyi Jessy Li. Elaborative Simplification as Implicit Questions Under Discussion. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[20] [20]

pytest-inline: An Inline Testing Tool for Python

Yu Liu, Zachary Thurston, Alan Han, Pengyu Nie, Milos Gligoric, and Owolabi Legunsen. pytest-inline: An Inline Testing Tool for Python. In2023 IEEE/ACM 45th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), 2023

work page 2023

[21] [21]

Stable Anisotropic Regularization.arXiv preprint, 2023

William Rudman and Carsten Eickhoff. Stable Anisotropic Regularization.arXiv preprint, 2023

work page 2023

[22] [22]

ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

Dana Arad, Hadas Orgad, and Yonatan Belinkov. ReFACT: Updating Text-to-Image Models by Editing the Text Encoder. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024

work page 2024

[23] [23]

One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data

Michal Golovanevsky, Eva Schiller, Akira Nair, Eric Han, Ritambhara Singh, and Carsten Eickhoff. One-Versus-Others Attention: Scalable Multimodal Integration for Biomedical Data. InBiocomputing 2025, 2024. 25

work page 2025

[24] [24]

Multilingual Code Co- Evolution Using Large Language Models

Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. Multilingual Code Co- Evolution Using Large Language Models. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023

work page 2023

[25] [25]

Farias, Stella S

Juan P . Farias, Stella S. R. Offner, Michael Y. Grudi´c, D´avid Guszejnov, and Anna L. Rosen. Stellar Populations in STARFORGE: The Origin and Evolution of Star Clusters and Associa- tions.Monthly Notices of the Royal Astronomical Society, 2023

work page 2023

[26] [26]

Heterogeneity-Guided Client Sampling: Towards Fast and Efficient Non-IID Federated Learning

Huancheng Chen and Haris Vikalo. Heterogeneity-Guided Client Sampling: Towards Fast and Efficient Non-IID Federated Learning. InAdvances in Neural Information Processing Systems 37, 2024

work page 2024

[27] [27]

QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing

Yating Wu, Ritika Mangla, Greg Durrett, and Junyi Jessy Li. QUDEVAL: The Evaluation of Questions Under Discussion Discourse Parsing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023

[28] [28]

Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices

Huancheng Chen and Haris Vikalo. Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024

[29] [29]

Generozov and H

A. Generozov and H. B. Perets. A Triple Scenario for the Formation of Wide Black Hole Binaries Such As Gaia BH1.The Astrophysical Journal, 2024

work page 2024

[30] [30]

Dimakis, Greg Durrett, and Junyi Jessy Li

Yating Wu, Ritika Mangla, Alexandros G. Dimakis, Greg Durrett, and Junyi Jessy Li. Which questions should I answer? Salience Prediction of Inquisitive Questions. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024

[31] [31]

Recovering Labels from Local Updates in Federated Learning.arXiv preprint, 2024

Huancheng Chen and Haris Vikalo. Recovering Labels from Local Updates in Federated Learning.arXiv preprint, 2024

work page 2024

[32] [32]

La Cava, and Danielle S

Shan Chen, Jack Gallifant, Mingye Gao, Pedro Moreira, Nikolaj Munch, Ajay Muthukkumar, Arvind Rajan, Jaya Kolluri, Amelia Fiske, Janna Hastings, Hugo Aerts, Brian Anthony, Leo Anthony Celi, William G. La Cava, and Danielle S. Bitterman. Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias. InAdvances in Neural Info...

work page 2024

[33] [33]

exLong: Generating Exceptional Behavior Tests with Large Language Models.arXiv preprint, 2024

Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. exLong: Generating Exceptional Behavior Tests with Large Language Models.arXiv preprint, 2024

work page 2024

[34] [34]

What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian- Noise-free Text-Image Corruption and Evaluation.arXiv preprint, 2024

Michal Golovanevsky, William Rudman, Vedant Palit, Ritambhara Singh, and Carsten Eickhoff. What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian- Noise-free Text-Image Corruption and Evaluation.arXiv preprint, 2024

work page 2024

[35] [35]

Do they mean ’us’? Interpreting Referring Expressions in Intergroup Bias

Venkata S Govindarajan, Matianyu Zang, Kyle Mahowald, David Beaver, and Junyi Jessy Li. Do they mean ’us’? Interpreting Referring Expressions in Intergroup Bias. InFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

work page 2024

[36] [36]

Kennedy, and John D

Kuleen Sasse, Shinjitha Vadlakonda, Richard E. Kennedy, and John D. Osborne. Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions.arXiv preprint, 2024

work page 2024

[37] [37]

Characterizing the Role of Similarity in the Property Inferences of Language Models.arXiv preprint, 2024

Juan Diego Rodriguez, Aaron Mueller, and Kanishka Misra. Characterizing the Role of Similarity in the Property Inferences of Language Models.arXiv preprint, 2024

work page 2024

[38] [38]

Signatures of black hole seeding in the local Universe: Predic- tions from the BRAHMA cosmological simulations.arXiv preprint, 2024

Aklant K Bhowmick, Laura Blecha, Paul Torrey, Rachel S Somerville, Luke Zoltan Kelley, Rainer Weinberger, Mark Vogelsberger, Lars Hernquist, Priyamvada Natarajan, Jonathan Kho, and Tiziana Di Matteo. Signatures of black hole seeding in the local Universe: Predic- tions from the BRAHMA cosmological simulations.arXiv preprint, 2024

work page 2024

[39] [39]

Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats

Kuleen Sasse, Carlos Aguirre, Isabel Cachola, Sharon Levy, and Mark Dredze. Making FETCH! Happen: Finding Emergent Dog Whistles Through Common Habitats. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

work page 2025

[40] [40]

Enhancing Retrieval- Augmented Generation: A Study of Best Practices.arXiv preprint, 2025

Siran Li, Linus Stenzel, Carsten Eickhoff, and Seyed Ali Bahrainian. Enhancing Retrieval- Augmented Generation: A Study of Best Practices.arXiv preprint, 2025

work page 2025

[41] [41]

SpaceGNN: Multi-Space Graph Neural Network for Node Anomaly Detection with Extremely Limited Labels.arXiv preprint, 2025

Xiangyu Dong, Xingyi Zhang, Lei Chen, Mingxuan Yuan, and Sibo Wang. SpaceGNN: Multi-Space Graph Neural Network for Node Anomaly Detection with Extremely Limited Labels.arXiv preprint, 2025

work page 2025

[42] [42]

Can Language Models Learn Typologically Implausible Languages?arXiv preprint, 2025

Tianyang Xu, Tatsuki Kuribayashi, Yohei Oseki, Ryan Cotterell, and Alex Warstadt. Can Language Models Learn Typologically Implausible Languages?arXiv preprint, 2025. 26

work page 2025

[43] [43]

Forgotten Polygons: Multimodal Large Language Models are Shape-Blind.arXiv preprint, 2025

William Rudman, Michal Golovanevsky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, and Ritambhara Singh. Forgotten Polygons: Multimodal Large Language Models are Shape-Blind.arXiv preprint, 2025

work page 2025

[44] [44]

Towards More Accurate Full-Atom Antibody Co-Design.arXiv preprint, 2025

Jiayang Wu, Xingyi Zhang, Xiangyu Dong, Kun Xie, Ziqi Liu, Wensheng Gan, Sibo Wang, and Le Song. Towards More Accurate Full-Atom Antibody Co-Design.arXiv preprint, 2025

work page 2025

[45] [45]

Fast and Accurate Antibody Sequence Design via Structure Retrieval

Xingyi Zhang, Kun Xie, Ningqiao Huang, Wei Liu, Peilin Zhao, Sibo Wang, Kangfei Zhao, and Biaobin Jiang. Fast and Accurate Antibody Sequence Design via Structure Retrieval. arXiv preprint, 2025

work page 2025

[46] [46]

QUDsim: Quantifying Discourse Similarities in LLM-Generated Text.arXiv preprint, 2025

Ramya Namuduri, Yating Wu, Anshun Asher Zheng, Manya Wadhwa, Greg Durrett, and Junyi Jessy Li. QUDsim: Quantifying Discourse Similarities in LLM-Generated Text.arXiv preprint, 2025

work page 2025

[47] [47]

Bitterman

Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, and Danielle S. Bitterman. MedBrowseComp: Benchmarking Medical Deep Research and Computer Use.arXiv preprint, 2025

work page 2025

[48] [48]

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts.arXiv preprint, 2025

Michal Golovanevsky, William Rudman, Michael Lepori, Amir Bar, Ritambhara Singh, and Carsten Eickhoff. Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts.arXiv preprint, 2025

work page 2025

[49] [49]

SAEs Are Good for Steering – If You Select the Right Features

Dana Arad, Aaron Mueller, and Yonatan Belinkov. SAEs Are Good for Steering – If You Select the Right Features. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

work page 2025

[50] [50]

The Harmonic Structure of Information Contours.arXiv preprint, 2025

Eleftheria Tsipidi, Samuel Kiegeland, Franz Nowak, Tianyang Xu, Ethan Wilcox, Alex Warstadt, Ryan Cotterell, and Mario Giulianelli. The Harmonic Structure of Information Contours.arXiv preprint, 2025

work page 2025

[51] [51]

Bhowmick, Paul Torrey, Alex M

Jonathan Kho, Aklant K. Bhowmick, Paul Torrey, Alex M. Garcia, Niusha Ahvazi, Laura Blecha, and Mark Vogelsberger. Signatures of BH seeding on the M•–σ relation: Predictions from the BRAHMA simulations.arXiv preprint, 2025

work page 2025

[52] [52]

PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset

Michal Golovanevsky, Pranav Mahableshwarkar, Carsten Eickhoff, and Ritambhara Singh. PiCME: Pipeline for Contrastive Modality Evaluation and Encoding in the MIMIC Dataset. arXiv preprint, 2025

work page 2025

[53] [53]

Type Ia Supernova Progenitors and Surviving Companions within the Symbiotic Channel.The Astrophysical Journal, 2025

Yu-Hui Wang, Hsin-Pei Chen, and Kuo-Chuan Pan. Type Ia Supernova Progenitors and Surviving Companions within the Symbiotic Channel.The Astrophysical Journal, 2025

work page 2025

[54] [54]

Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It.arXiv preprint, 2025

Yulu Qin, Dheeraj Varghese, Adam Dahlgren Lindstr¨om, Lucia Donatelli, Kanishka Misra, and Najoung Kim. Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It.arXiv preprint, 2025

work page 2025

[55] [55]

CRISP: Persistent Concept Unlearning via Sparse Autoencoders.arXiv preprint, 2025

Tomer Ashuach, Dana Arad, Aaron Mueller, Martin Tutek, and Yonatan Belinkov. CRISP: Persistent Concept Unlearning via Sparse Autoencoders.arXiv preprint, 2025

work page 2025

[56] [56]

Task-Agnostic Federated Continual Learning via Replay-Free Gradient Projection.arXiv preprint, 2025

Seohyeon Cha, Huancheng Chen, and Haris Vikalo. Task-Agnostic Federated Continual Learning via Replay-Free Gradient Projection.arXiv preprint, 2025

work page 2025

[57] [57]

Bhowmick, Laura Blecha, Paul Torrey, Luke Zoltan Kelley, Priyamvada Natarajan, Rachel S

Aklant K. Bhowmick, Laura Blecha, Paul Torrey, Luke Zoltan Kelley, Priyamvada Natarajan, Rachel S. Somerville, Rainer Weinberger, Alex M. Garcia, Lars Hernquist, Tiziana Di Matteo, Jonathan Kho, and Mark Vogelsberger. Heavy seeds and the first black holes: Insights from the BRAHMA simulations.arXiv preprint, 2025

work page 2025

[58] [58]

Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written.arXiv preprint, 2025

Venkata S Govindarajan and Laura Biester. Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written.arXiv preprint, 2025

work page 2025

[59] [59]

OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking.arXiv preprint, 2025

Yanhong Li, Tianyang Xu, Kenan Tang, Karen Livescu, David McAllester, and Jiawei Zhou. OKBench: Democratizing LLM Evaluation with Fully Automated, On-Demand, Open Knowledge Benchmarking.arXiv preprint, 2025

work page 2025

[60] [60]

Regularized Calibration with Successive Rounding for Post- Training Quantization.arXiv preprint, 2026

Seohyeon Cha, Huancheng Chen, Dongjun Kim, Haoran Zhang, Kevin Chan, Gustavo de Veciana, and Haris Vikalo. Regularized Calibration with Successive Rounding for Post- Training Quantization.arXiv preprint, 2026

work page 2026

[61] [61]

Questions beyond Pixels: Integrat- ing Commonsense Knowledge in Visual Question Generation for Remote Sensing.arXiv preprint, 2026

Siran Li, Li Mi, Javiera Castillo-Navarro, and Devis Tuia. Questions beyond Pixels: Integrat- ing Commonsense Knowledge in Visual Question Generation for Remote Sensing.arXiv preprint, 2026

work page 2026

[62] [62]

Knowledge-aware Visual Question Generation for Remote Sensing Images

Siran Li, Li Mi, Javiera Castillo-Navarro, and Devis Tuia. Knowledge-aware Visual Question Generation for Remote Sensing Images. InIGARSS 2024 - 2024 IEEE International Geoscience and Remote Sensing Symposium, 2024. 27

work page 2024