T-FIX: Text-Based Explanations with Features Interpretable to eXperts

Amin Madani; Anton Xue; Bhuvnesh Jain; Chaehyeon Kim; Daniel A. Hashimoto; Eric Wong; Gary E. Weissman; Helen Jin; Helen Qu; Lyle Ungar

arxiv: 2511.04070 · v3 · pith:AZWZRNZMnew · submitted 2025-11-06 · 💻 cs.CL

T-FIX: Text-Based Explanations with Features Interpretable to eXperts

Shreya Havaldar , Weiqiu You , Chaehyeon Kim , Anton Xue , Helen Jin , Marco Gatti , Bhuvnesh Jain , Helen Qu

show 7 more authors

Amin Madani Daniel A. Hashimoto Gary E. Weissman Rajat Deo Sameed Khatana Lyle Ungar Eric Wong

This is my paper

Pith reviewed 2026-05-21 19:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords expert alignmentLLM explanationsevaluation frameworkdomain reasoningautomatic scoringscientific tasksgeneralization

0 comments

The pith

T-FIX turns expert-defined criteria into automatic scores for whether LLM explanations match domain reasoning, and those scores generalize to new explanations without further expert input.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents T-FIX as a framework that lets researchers measure how closely LLM-generated explanations follow the reasoning patterns of domain experts in scientific fields. Instead of asking experts to label every new explanation, T-FIX first captures their standards once as concrete criteria across seven tasks in three domains. These criteria then drive automatic evaluation that applies to explanations the system has never seen before. The approach aims to replace costly, example-by-example expert annotation with a reusable, personalizable method that still reflects professional judgment. If the criteria capture the right aspects of reasoning, developers could test alignment once per domain and reuse the same system as models and tasks evolve.

Core claim

T-FIX operationalizes expert alignment as a measurable property of LLM explanations by encoding domain-grounded criteria supplied by experts, then applies those criteria to produce automatic scores that remain valid for explanations outside the original set of examples.

What carries the argument

T-FIX, a unified evaluation framework that converts expert-defined criteria for domain-grounded reasoning into automatic, generalizable scores for LLM explanations.

If this is right

Evaluation of new explanations no longer requires fresh expert annotations for each case.
The same criteria can be reused across multiple LLM outputs and tasks within a domain.
Different expert groups can supply their own criteria to create personalized alignment measures.
The framework covers seven tasks spanning three scientific domains, supporting cross-task comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support ongoing monitoring of deployed LLMs in high-stakes settings by flagging when explanations drift from expert standards.
If criteria prove stable, they might serve as shared benchmarks for comparing explanation quality across different model families.
Extending the approach to additional domains would require only new expert criteria rather than redesigning the entire evaluation process.

Load-bearing premise

Expert-defined criteria can be made concrete enough to support reliable automatic scoring while still capturing the core reasoning that experts use across different explanations.

What would settle it

Collect a fresh set of LLM explanations that domain experts rate as well-aligned, then check whether T-FIX assigns them consistently high scores; consistent mismatch between expert ratings and T-FIX scores would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2511.04070 by Amin Madani, Anton Xue, Bhuvnesh Jain, Chaehyeon Kim, Daniel A. Hashimoto, Eric Wong, Gary E. Weissman, Helen Jin, Helen Qu, Lyle Ungar, Marco Gatti, Rajat Deo, Sameed Khatana, Shreya Havaldar, Weiqiu You.

**Figure 2.** Figure 2: An overview of the T-FIX construction process. For each dataset, we first establish expert alignment criteria – features deemed important by domain experts for a specific task – through collaboration with these experts and LLM-based deep research tools. These criteria form the basis of the T-FIX evaluation pipeline, which processes an LLM-generated explanation to output an expert alignment score. A high sc… view at source ↗

**Figure 3.** Figure 3: Our T-FIX pipeline. To evaluate an LLM-generated explanation, we first decompose it into atomic claims. Next, we filter out irrelevant claims, such as unsupported or speculative statements. Each remaining claim is then scored against the domain-specific expert alignment criteria: a score of “complete” indicates perfect overlap with at least one criterion, while “none” indicates no overlap. Filtered-out cla… view at source ↗

**Figure 4.** Figure 4: Overview of datasets and domains in T-FIX. We evaluate LLM expert alignment across seven diverse domains, spanning cosmology, psychology, and medicine. For each dataset, we highlight the motivating task, input–output format, representative example, and the expert responsible for validating alignment criteria. The final row summarizes the expert alignment criteria used for scoring explanations in each domai… view at source ↗

**Figure 5.** Figure 5: Shannon Entropy of expert alignment criteria for GPT-4o. For each prompting baseline, we show coverage [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Expert Alignment vs. Accuracy Correlation [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users are often domain experts who expect not just answers, but explanations that mirror professional reasoning. Yet evaluating whether an LLM "thinks like an expert" remains difficult: existing approaches rely on per-example expert annotation, making them costly, hard to scale, and tied to a single notion of correct reasoning within each domain. To address this gap, we introduce T-FIX, a unified evaluation framework that operationalizes expert alignment as a desired attribute of LLM-generated explanations. T-FIX spans seven scientific tasks across three domains, with each task evaluated against expert-defined criteria that capture domain-grounded reasoning rather than generic explanation quality. Our framework enables automatic, personalizable evaluation of expert alignment that generalizes to unseen explanations without ongoing expert involvement. Code is available at https://github.com/BrachioLab/FIX-2/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes T-FIX, a framework that operationalizes expert alignment for evaluating LLM explanations using expert-defined criteria. It covers seven scientific tasks in three domains and claims automatic, personalizable evaluation that generalizes to unseen explanations without further expert involvement.

Significance. Should the framework's ability to generalize hold up under scrutiny, it would represent a meaningful advance in scalable evaluation of LLM explanations in expert domains. This could reduce the cost and scalability issues associated with expert annotations. The open-source code is noted as a strength for allowing community verification and extension.

major comments (2)

[§3] The description of how expert criteria are turned into automatic scoring mechanisms lacks detail on preventing overfitting to the initial annotated explanations, which is critical for the generalization claim to unseen cases.
[§5.2] Results on generalization are presented for held-out explanations within the same tasks, but no experiments test transfer to explanations generated under different conditions or from other models, leaving the robustness to distribution shift unaddressed.

minor comments (1)

[Introduction] Some citations to related work on explanation evaluation could be expanded to include more recent papers on LLM alignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps strengthen the presentation of T-FIX's generalization properties. We address each major comment below and commit to revisions that clarify the framework's design and experimental scope.

read point-by-point responses

Referee: [§3] The description of how expert criteria are turned into automatic scoring mechanisms lacks detail on preventing overfitting to the initial annotated explanations, which is critical for the generalization claim to unseen cases.

Authors: We agree that Section 3 would benefit from greater specificity. In the revised manuscript we will expand the description of the criterion-to-scorer pipeline to explicitly detail the safeguards against overfitting: (i) criteria are elicited at a high level of abstraction before any explanations are seen, (ii) a separate validation split of annotated explanations is used to tune the automatic scorer, and (iii) the final scorer is frozen before evaluating the held-out test set. These steps will be illustrated with a concrete example from one domain. revision: yes
Referee: [§5.2] Results on generalization are presented for held-out explanations within the same tasks, but no experiments test transfer to explanations generated under different conditions or from other models, leaving the robustness to distribution shift unaddressed.

Authors: The current experiments indeed evaluate generalization only to held-out explanations generated under the same prompting and model conditions. We will add a dedicated paragraph in §5.2 (and a short appendix table) that acknowledges this scope and reports preliminary transfer results on two tasks using explanations from a second LLM. If space constraints prevent full cross-model tables, we will at minimum include a clear limitations statement and outline the additional expert-validation steps required for broader distribution-shift testing. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework relies on independent expert criteria

full rationale

The paper introduces T-FIX as an operationalization of expert alignment using externally defined criteria across seven tasks in three domains. The central claim of automatic, generalizable evaluation to unseen explanations is grounded in these independent expert inputs rather than any self-referential definition, fitted parameter renamed as prediction, or self-citation chain. No equations or derivations reduce the output to the input by construction; the approach treats expert criteria as an external benchmark that is then automated, with generalization tested on held-out explanations. This is self-contained against external validation and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that expert criteria can be turned into automatic, generalizable evaluations; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Expert-defined criteria capture domain-grounded reasoning rather than generic explanation quality.
Invoked when describing how each task is evaluated against expert-defined criteria.

pith-pipeline@v0.9.0 · 5744 in / 1167 out tokens · 93951 ms · 2026-05-21T19:40:57.356235+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formalize expert alignment as a criterion for evaluating explanations with T-FIX... decompose it into atomic claims... score against the domain-specific expert alignment criteria
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

expert alignment criteria... validated... through collaboration with domain experts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Interpretability Can Be Actionable
cs.LG 2026-05 conditional novelty 6.0

Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923. Gagan Bansal, Tongshuang Wu, Joyce Zhou, Ray- mond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel S. Weld. 2021. Does the whole exceed its parts? the effect of AI explanations on complementary team performance. InProceedings of the 2021 CHI Conference on Human Factors in Computin...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Norman K Denzin

Goemotions: A dataset of fine-grained emo- tions.arXiv preprint arXiv:2005.00547. Norman K Denzin. 1984.On understanding emotion. Transaction Publishers. Janis Fluri, Tomasz Kacprzak, Aurelien Lucchi, Aurel Schneider, Alexandre Refregier, and Thomas Hof- mann. 2022. Full wCDM analysis of KiDS-1000 weak lensing maps using deep learning.Physical Review D, 1...

work page arXiv 2005
[3]

What is the role of large language models in the evolution of astronomy research?Preprint, arXiv:2409.20252. M. Gatti, E. Sheldon, A. Amon, M. Becker, M. Troxel, A. Choi, C. Doux, N. MacCrann, A. Navarro-Alsina, I. Harrison, D. Gruen, G. Bernstein, M. Jarvis, L. F. Secco, A. Ferté, T. Shin, J. McCullough, R. P. Rollins, R. Chen, and 85 others. 2021. Dark ...

work page arXiv 2021
[4]

Building knowledge-guided lexica to model cultural variation. InProceedings of the 2024 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pages 211–226. Shreya Havaldar, Matthew Pressimone, Eric Wong, and Lyle Ungar. 2023a. Comparing styles across lan- guag...

work page doi:10.13026/jz99-4j81 2024
[5]

Optionally, have a domain expert vet the generated criteria

Generate criteria:Use the deep research prompt template shown in Figure A4 to gener- ate a list of expert alignment criteria for your domain. Optionally, have a domain expert vet the generated criteria

work page
[6]

Modify prompts:Modify the prompt tem- plates outlined in Figure A1, Figure A2, and Figure A3 with your task description, few- shot examples, and generated expert criteria

work page
[7]

B Prompts for T-FIX Pipeline We show the prompts for Stage 1, 2, and 3 in Fig- ure A1, Figure A2, and Figure A3, respectively

Run T-FIX:Plug in your prompts for each stage of the pipeline and run T-FIX on your dataset! We encourage you to contact the authors of this work if you need additional assistance setting up your custom domain. B Prompts for T-FIX Pipeline We show the prompts for Stage 1, 2, and 3 in Fig- ure A1, Figure A2, and Figure A3, respectively. These prompts show ...

work page arXiv 2024
[8]

Lensing Peak (Cluster) Abundance:High peak count →higherσ 8; clumpy halos more common

work page
[9]

Void Size and Frequency:Large, frequent voids → lowerΩ m; less overall matter

work page
[10]

Filament Thickness and Sharpness:Thick, sharp fila- ments track higherσ 8; thin indicates lower

work page
[11]

Fine-Scale Clumpiness:Fine graininess signifies high σ8; smooth map implies lower

work page
[12]

Connectivity of the Cosmic Web:Interconnected web suggests higherΩ m; isolated clumps imply lower

work page
[13]

Density Contrast Extremes:Strong density contrast denotes highσ 8; muted contrast lower. D.2 Supernova Task.The objective is to classify astrophysical objects using time-series data comprising observa- tion times (Modified Julian Dates), wavelengths (filters), flux values, and corresponding flux uncer- tainties. We use data from the PLAsTiCC chal- lenge (...

work page 2018
[14]

We report the mean accuracy for each stage of the pipeline and annotator agreement – Cohen’sκ

Contiguous non-zero flux:Contiguous non -zero flux segments confirm genuine astrophysical activity and Domain N generated claims N aligned claims Claim Decomposition Accuracy Relevance Filtering Accuracy Expert Alignment Accuracy Cohen’sκ Cosmology Mass Maps66 48 0.900 0.826 0.979 0.4059 Supernova74 62 0.950 0.892 0.903 0.4946 Psychology Politeness72 58 0...

work page
[15]

Rise–decline rates:Characteristic rise -and-decline rates—such as the fast -rise/slow-fade morphology of many supernovae—encode energy -release physics and serve as strong class discriminators

work page
[16]

Photometric amplitude:Peak -to-trough photomet- ric amplitude separates high -energy explosive events (multi-magnitude outbursts) from low-amplitude peri- odic or stochastic variables

work page
[17]

Event duration:Total event duration, measured from first detection to return to baseline, distinguishes short-lived kilonovae and superluminous SNe from longer plateau or AGN variability phases

work page
[18]

Periodic light curves:Periodic light curves with stable periods and distinctive Fourier amplitude - and phase-ratios flag pulsators and eclipsing binaries rather than one-off transients

work page
[19]

Secondary maxima:Filter -specific secondary max- ima or shoulders in red/near -IR bands—prominent in SNeIa—are morphological features absent in most core-collapse SNe

work page
[20]

seems defective

Monotonic flux trends:Locally smooth, monotonic flux trends across one or multiple bands (plateaus, lin- ear decays) capture physical evolution stages and help distinguish SNII-P, SNII-L, and related classes. D.3 Politeness Task.Understanding how linguistic styles, like politeness, vary across cultures is necessary for building better communication, trans...

work page 2012
[21]

sir,” “usted,

Honorifics and Formal Address:The presence of re- spectful or formal address forms (e.g., “sir,” “usted,”) signals politeness by expressing deference to the hearer’s status or social distance

work page
[22]

please,” “kindly,

Courteous Politeness Markers:Words such as “please,” “kindly,” or their multilingual variants soften requests and reflect courteous intent

work page
[23]

thank you,

Gratitude Expressions:Use of expressions like “thank you,” “thanks,” or “I appreciate it” signals recognition of the other’s contribution and positive face

work page
[24]

sorry” or “I apologize

Apologies and Acknowledgment of Fault:Phrases such as “sorry” or “I apologize” express humility and repair social breaches, marking a clear politeness strat- egy

work page
[25]

could you,

Indirect and Modal Requests:Requests using modal verbs (“could you,” “would you”) or softening cues like “by the way” reduce imposition and signal respect for the hearer’s autonomy

work page
[26]

I think,

Hedging and Tentative Language:Words like “I think,” “maybe,” or “usually” lower assertion strength and make statements more negotiable, reflecting inter- personal sensitivity

work page
[27]

we,” “our,

Inclusive Pronouns and Group-Oriented Phrasing: Use of “we,” “our,” or “together” expresses solidarity and reduces hierarchical distance in requests or cri- tiques

work page
[28]

hi,” “hello

Greeting and Interaction Initiation:Opening with a salutation (“hi,” “hello”) creates a cooperative tone and frames the conversation positively

work page
[29]

great,” “awesome,

Compliments and Praise:Positive evaluations (“great,” “awesome,” “neat”) attend to the hearer’s positive face and foster a friendly environment

work page
[30]

Softened Disagreement or Face-Saving Critique: When disagreeing, the use of softeners, partial agree- ments, or concern for clarity preserves the hearer’s dig- nity

work page
[31]

asap,” “immediately

Urgency or Immediacy of Language:Utterances em- phasizing emergency or speed (“asap,” “immediately”) can heighten perceived imposition and reduce politeness if not softened

work page
[32]

Avoidance of Profanity or Negative Emotion:The presence of strong negative words or swearing is a key indicator of rudeness and face threat

work page
[33]

Bluntness and Direct Commands:Requests lacking modal verbs or mitigation (“Do this”) are perceived as less polite due to their imperative structure

work page
[34]

Empathy or Emotional Support:Recognizing the hearer’s emotional context or challenges is a politeness strategy of concern and goodwill

work page
[35]

I think,

First-Person Subjectivity Markers:Statements that begin with “I think,” “I feel,” or “In my view” convey humility and subjectivity, reducing imposition

work page
[36]

Second Person Responsibility or Engagement:Sen- tences starting with “you” or directly addressing the hearer can either signal engagement or come across as accusatory, depending on context and tone

work page
[37]

what do you think?

Questions as Indirect Strategies:Questions (“what do you think?” or “could you clarify?”) reduce imposition by inviting rather than demanding input

work page
[38]

so,” “then,

Discourse Management with Markers:Use of dis- course markers like “so,” “then,” “but” organizes conver- Prompt You will be given <task description + expert categories description> Your task is as follows:

work page
[39]

Determine which expert category is most aligned with the claim

work page
[40]

Use increments of 0.1)

Rate how strongly the category aligns with the claim on a scale of 0-1 (0 being lowest, 1 being←- highest. Use increments of 0.1). Return your answer as: Category: <category> Category Alignment Rating: <rating> Reasoning: <A brief explanation of why you selected the chosen category and why you judged the←- alignment rating as you did.> ----- Expert catego...

work page
[41]

mate,” “dude,

Ingroup Language and Informality:Use of group- identifying slang or casual expressions (“mate,” “dude,” “bro”) may foster solidarity or seem disrespectful, de- pending on relational norms. D.4 Emotion Task.Understanding and classifying emotion is important for tasks like therapy, mental health di- agnoses, etc. (Denzin, 1984). Emotion is often expressed i...

work page 1984
[42]

Valence:Decide if the overall tone is pleasant or un- pleasant; positive tones suggest joy or admiration, nega- tive tones suggest sadness or anger

work page
[43]

Arousal:Gauge how energized the wording is—calm phrasing implies low arousal emotions, intense phrasing implies high arousal emotions

work page
[44]

Emotion Words & Emojis:Look for direct emotion terms or emoticons that explicitly name the feeling

work page
[45]

Expressive Punctuation:Multiple exclamation marks, ALL-CAPS, or stretched spellings signal higher emo- tional intensity

work page
[46]

haha,” “lol,

Humor/Laughter Markers:Tokens like “haha,” “lol,” or laughing emojis reliably indicate amusement

work page
[47]

I don’t get it

Confusion Phrases:Statements such as “I don’t get it” clearly mark confusion

work page
[48]

I wonder

Curiosity Questions:Genuine information -seeking phrases (“I wonder. . . ”, “why is. . . ?”) point to curiosity

work page
[49]

No way!”, “I can’t believe it!

Surprise Exclamations:Reactions of astonishment (“No way!”, “I can’t believe it!”) denote surprise

work page
[50]

I’m scared,

Threat/Worry Language:References to danger or fear (“I’m scared,” “terrifying”) signal fear or nervousness

work page
[51]

Loss or Let-Down Words:Mentions of loss or disap- pointment cue sadness, disappointment, or grief

work page
[52]

Other-Blame Statements:Assigning fault to someone else for a bad outcome suggests anger or disapproval

work page
[53]

I’m sorry

Self-Blame & Apologies:Admitting fault and saying “I’m sorry” marks remorse

work page
[54]

gross,” “nasty,

Aversion Terms:Words like “gross,” “nasty,” or “dis- gusting” point to disgust. 14.Praise & Compliments:Positive evaluations of some- one’s actions show admiration or approval. Prompt You are an expert in <domain name>. You have a deep understanding of this subject. Your task is to behave like an <domain expert> and identify which criteria are important t...

work page
[55]

thanks” or “much appreciated

Gratitude Expressions:Phrases such as “thanks” or “much appreciated” indicate gratitude

work page
[56]

love this,

Affection & Care Words:Loving or nurturing lan- guage (“love this,” “sending hugs”) signals love or car- ing

work page
[57]

I nailed it

Self-Credit Statements:Boasting about one’s own success (“I nailed it”) signals pride

work page
[58]

phew,” “finally over,

Relief Indicators:Release phrases like “phew,” “finally over,” or “what a relief” mark relief after stress ends. D.5 Laparoscopic Cholecystectomy Surgery. Task.The task is to identify the safe and un- safe regions for incision. We used the open- source subset of data from (Madani et al., 2022), which consists of surgeon-annotated im- ages taken from video...

work page 2022
[59]

Calot’s triangle cleared - Hepatocystic triangle must be fully cleared of fat/fibrosis so that its boundaries are unmistakable

work page
[60]

Cystic plate exposed - The lower third of the gallbladder must be dissected off the liver to reveal the shiny cystic plate and ensure the correct dissection plane

work page
[61]

Only two structures visible - Only the cystic duct and cystic artery should be seen entering the gallbladder before any clipping or cutting

work page
[62]

Above the R4U line - Dissection must remain cephalad to an imaginary line from Rouviere’s sulcus to liver segment IV to avoid the common bile duct

work page
[63]

Safe distance from common bile duct - There should be sufficient distance between the common bile duct and the gallbladder wall to ensure safe dissection

work page
[64]

Infundibulum start point - Dissection should begin at the gallbladder infundibulum-cystic duct junction to stay in safe tissue planes

work page
[65]

Subserosal plane stay - When separating the gallbladder from the liver, stay in the avascular subserosal cleavage plane under the serosal fat layer

work page
[66]

Cystic lymph node guide - Identify the cystic lymph node and clip the artery on the gallbladder side of the node to avoid injuring the hepatic artery

work page
[67]

No division without ID - Never divide any duct or vessel until it is unequivocally identified as the cystic structure entering the gallbladder

work page
[68]

fundus-first

Inflammation bailout - If dense scarring or distorted anatomy obscures Calot’s triangle, convert to a subtotal "fundus-first" approach rather than blind cutting

work page
[69]

Aberrant artery caution - Preserve any large or tortuous artery (e.g., a Moynihan’s hump) that might be mistaken for the cystic artery. D.6 Cardiac Arrest Task.The objective is to predict whether an ICU patient will experience cardiac arrest within the next 5 minutes, using the patient’s demographic and clinical background (age, gender, race, rea- son for...

work page 2011
[70]

A detailed explanation of where it is safe and unsafe to cut in the image

work page
[71]

A list of grid positions (as integers) corresponding to safe regions

work page
[72]

safe list

A list of grid positions (as integers) corresponding to unsafe regions The image is discretized into a 9x16 grid (height x width), where each grid position can be←- represented as a single integer from 0 to 143 (9*16 - 1). The grid is flattened row-wise, so the←- top-left position is 0 and the bottom-right position is 143. Your response will help train su...

work page
[73]

Ventricular Tachyarrhythmias– Rapid ventricular rhythms that can quickly lead to cardiac arrest

work page
[74]

Ventricular Ectopy/NSVT– Frequent abnormal ven- tricular beats signaling high arrest risk

work page
[75]

Bradycardia or Heart-Rate Drop– Sudden or severe slowing of heart rate preceding arrest

work page
[76]

Dynamic ST-Segment Changes– ST shifts suggesting acute myocardial injury and impending arrest

work page
[77]

Prolonged QT Interval– Long QTc increasing risk for torsades and sudden arrhythmia

work page
[78]

Severe Hyperkalemia Signs– ECG changes from high potassium predicting arrest, especially among patients on dialysis / end stage renal disease

work page
[79]

Advanced Age– Older age strongly correlates with higher arrest likelihood

work page
[80]

Prompt You are a medical expert specializing in cardiac arrest prediction

Male Sex– Males have a higher overall risk of cardiac arrest. Prompt You are a medical expert specializing in cardiac arrest prediction. You will be given some basic background information about an ICU patient, including their age, gender,←- race, and primary reason for ICU admittance. You will also be provided with time-series←- Electrocardiogram (ECG) d...

work page

Showing first 80 references.

[1] [1]

Qwen2.5-VL Technical Report

Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923. Gagan Bansal, Tongshuang Wu, Joyce Zhou, Ray- mond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel S. Weld. 2021. Does the whole exceed its parts? the effect of AI explanations on complementary team performance. InProceedings of the 2021 CHI Conference on Human Factors in Computin...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Norman K Denzin

Goemotions: A dataset of fine-grained emo- tions.arXiv preprint arXiv:2005.00547. Norman K Denzin. 1984.On understanding emotion. Transaction Publishers. Janis Fluri, Tomasz Kacprzak, Aurelien Lucchi, Aurel Schneider, Alexandre Refregier, and Thomas Hof- mann. 2022. Full wCDM analysis of KiDS-1000 weak lensing maps using deep learning.Physical Review D, 1...

work page arXiv 2005

[3] [3]

What is the role of large language models in the evolution of astronomy research?Preprint, arXiv:2409.20252. M. Gatti, E. Sheldon, A. Amon, M. Becker, M. Troxel, A. Choi, C. Doux, N. MacCrann, A. Navarro-Alsina, I. Harrison, D. Gruen, G. Bernstein, M. Jarvis, L. F. Secco, A. Ferté, T. Shin, J. McCullough, R. P. Rollins, R. Chen, and 85 others. 2021. Dark ...

work page arXiv 2021

[4] [4]

Building knowledge-guided lexica to model cultural variation. InProceedings of the 2024 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pages 211–226. Shreya Havaldar, Matthew Pressimone, Eric Wong, and Lyle Ungar. 2023a. Comparing styles across lan- guag...

work page doi:10.13026/jz99-4j81 2024

[5] [5]

Optionally, have a domain expert vet the generated criteria

Generate criteria:Use the deep research prompt template shown in Figure A4 to gener- ate a list of expert alignment criteria for your domain. Optionally, have a domain expert vet the generated criteria

work page

[6] [6]

Modify prompts:Modify the prompt tem- plates outlined in Figure A1, Figure A2, and Figure A3 with your task description, few- shot examples, and generated expert criteria

work page

[7] [7]

B Prompts for T-FIX Pipeline We show the prompts for Stage 1, 2, and 3 in Fig- ure A1, Figure A2, and Figure A3, respectively

Run T-FIX:Plug in your prompts for each stage of the pipeline and run T-FIX on your dataset! We encourage you to contact the authors of this work if you need additional assistance setting up your custom domain. B Prompts for T-FIX Pipeline We show the prompts for Stage 1, 2, and 3 in Fig- ure A1, Figure A2, and Figure A3, respectively. These prompts show ...

work page arXiv 2024

[8] [8]

Lensing Peak (Cluster) Abundance:High peak count →higherσ 8; clumpy halos more common

work page

[9] [9]

Void Size and Frequency:Large, frequent voids → lowerΩ m; less overall matter

work page

[10] [10]

Filament Thickness and Sharpness:Thick, sharp fila- ments track higherσ 8; thin indicates lower

work page

[11] [11]

Fine-Scale Clumpiness:Fine graininess signifies high σ8; smooth map implies lower

work page

[12] [12]

Connectivity of the Cosmic Web:Interconnected web suggests higherΩ m; isolated clumps imply lower

work page

[13] [13]

Density Contrast Extremes:Strong density contrast denotes highσ 8; muted contrast lower. D.2 Supernova Task.The objective is to classify astrophysical objects using time-series data comprising observa- tion times (Modified Julian Dates), wavelengths (filters), flux values, and corresponding flux uncer- tainties. We use data from the PLAsTiCC chal- lenge (...

work page 2018

[14] [14]

We report the mean accuracy for each stage of the pipeline and annotator agreement – Cohen’sκ

Contiguous non-zero flux:Contiguous non -zero flux segments confirm genuine astrophysical activity and Domain N generated claims N aligned claims Claim Decomposition Accuracy Relevance Filtering Accuracy Expert Alignment Accuracy Cohen’sκ Cosmology Mass Maps66 48 0.900 0.826 0.979 0.4059 Supernova74 62 0.950 0.892 0.903 0.4946 Psychology Politeness72 58 0...

work page

[15] [15]

Rise–decline rates:Characteristic rise -and-decline rates—such as the fast -rise/slow-fade morphology of many supernovae—encode energy -release physics and serve as strong class discriminators

work page

[16] [16]

Photometric amplitude:Peak -to-trough photomet- ric amplitude separates high -energy explosive events (multi-magnitude outbursts) from low-amplitude peri- odic or stochastic variables

work page

[17] [17]

Event duration:Total event duration, measured from first detection to return to baseline, distinguishes short-lived kilonovae and superluminous SNe from longer plateau or AGN variability phases

work page

[18] [18]

Periodic light curves:Periodic light curves with stable periods and distinctive Fourier amplitude - and phase-ratios flag pulsators and eclipsing binaries rather than one-off transients

work page

[19] [19]

Secondary maxima:Filter -specific secondary max- ima or shoulders in red/near -IR bands—prominent in SNeIa—are morphological features absent in most core-collapse SNe

work page

[20] [20]

seems defective

Monotonic flux trends:Locally smooth, monotonic flux trends across one or multiple bands (plateaus, lin- ear decays) capture physical evolution stages and help distinguish SNII-P, SNII-L, and related classes. D.3 Politeness Task.Understanding how linguistic styles, like politeness, vary across cultures is necessary for building better communication, trans...

work page 2012

[21] [21]

sir,” “usted,

Honorifics and Formal Address:The presence of re- spectful or formal address forms (e.g., “sir,” “usted,”) signals politeness by expressing deference to the hearer’s status or social distance

work page

[22] [22]

please,” “kindly,

Courteous Politeness Markers:Words such as “please,” “kindly,” or their multilingual variants soften requests and reflect courteous intent

work page

[23] [23]

thank you,

Gratitude Expressions:Use of expressions like “thank you,” “thanks,” or “I appreciate it” signals recognition of the other’s contribution and positive face

work page

[24] [24]

sorry” or “I apologize

Apologies and Acknowledgment of Fault:Phrases such as “sorry” or “I apologize” express humility and repair social breaches, marking a clear politeness strat- egy

work page

[25] [25]

could you,

Indirect and Modal Requests:Requests using modal verbs (“could you,” “would you”) or softening cues like “by the way” reduce imposition and signal respect for the hearer’s autonomy

work page

[26] [26]

I think,

Hedging and Tentative Language:Words like “I think,” “maybe,” or “usually” lower assertion strength and make statements more negotiable, reflecting inter- personal sensitivity

work page

[27] [27]

we,” “our,

Inclusive Pronouns and Group-Oriented Phrasing: Use of “we,” “our,” or “together” expresses solidarity and reduces hierarchical distance in requests or cri- tiques

work page

[28] [28]

hi,” “hello

Greeting and Interaction Initiation:Opening with a salutation (“hi,” “hello”) creates a cooperative tone and frames the conversation positively

work page

[29] [29]

great,” “awesome,

Compliments and Praise:Positive evaluations (“great,” “awesome,” “neat”) attend to the hearer’s positive face and foster a friendly environment

work page

[30] [30]

Softened Disagreement or Face-Saving Critique: When disagreeing, the use of softeners, partial agree- ments, or concern for clarity preserves the hearer’s dig- nity

work page

[31] [31]

asap,” “immediately

Urgency or Immediacy of Language:Utterances em- phasizing emergency or speed (“asap,” “immediately”) can heighten perceived imposition and reduce politeness if not softened

work page

[32] [32]

Avoidance of Profanity or Negative Emotion:The presence of strong negative words or swearing is a key indicator of rudeness and face threat

work page

[33] [33]

Bluntness and Direct Commands:Requests lacking modal verbs or mitigation (“Do this”) are perceived as less polite due to their imperative structure

work page

[34] [34]

Empathy or Emotional Support:Recognizing the hearer’s emotional context or challenges is a politeness strategy of concern and goodwill

work page

[35] [35]

I think,

First-Person Subjectivity Markers:Statements that begin with “I think,” “I feel,” or “In my view” convey humility and subjectivity, reducing imposition

work page

[36] [36]

Second Person Responsibility or Engagement:Sen- tences starting with “you” or directly addressing the hearer can either signal engagement or come across as accusatory, depending on context and tone

work page

[37] [37]

what do you think?

Questions as Indirect Strategies:Questions (“what do you think?” or “could you clarify?”) reduce imposition by inviting rather than demanding input

work page

[38] [38]

so,” “then,

Discourse Management with Markers:Use of dis- course markers like “so,” “then,” “but” organizes conver- Prompt You will be given <task description + expert categories description> Your task is as follows:

work page

[39] [39]

Determine which expert category is most aligned with the claim

work page

[40] [40]

Use increments of 0.1)

Rate how strongly the category aligns with the claim on a scale of 0-1 (0 being lowest, 1 being←- highest. Use increments of 0.1). Return your answer as: Category: <category> Category Alignment Rating: <rating> Reasoning: <A brief explanation of why you selected the chosen category and why you judged the←- alignment rating as you did.> ----- Expert catego...

work page

[41] [41]

mate,” “dude,

Ingroup Language and Informality:Use of group- identifying slang or casual expressions (“mate,” “dude,” “bro”) may foster solidarity or seem disrespectful, de- pending on relational norms. D.4 Emotion Task.Understanding and classifying emotion is important for tasks like therapy, mental health di- agnoses, etc. (Denzin, 1984). Emotion is often expressed i...

work page 1984

[42] [42]

Valence:Decide if the overall tone is pleasant or un- pleasant; positive tones suggest joy or admiration, nega- tive tones suggest sadness or anger

work page

[43] [43]

Arousal:Gauge how energized the wording is—calm phrasing implies low arousal emotions, intense phrasing implies high arousal emotions

work page

[44] [44]

Emotion Words & Emojis:Look for direct emotion terms or emoticons that explicitly name the feeling

work page

[45] [45]

Expressive Punctuation:Multiple exclamation marks, ALL-CAPS, or stretched spellings signal higher emo- tional intensity

work page

[46] [46]

haha,” “lol,

Humor/Laughter Markers:Tokens like “haha,” “lol,” or laughing emojis reliably indicate amusement

work page

[47] [47]

I don’t get it

Confusion Phrases:Statements such as “I don’t get it” clearly mark confusion

work page

[48] [48]

I wonder

Curiosity Questions:Genuine information -seeking phrases (“I wonder. . . ”, “why is. . . ?”) point to curiosity

work page

[49] [49]

No way!”, “I can’t believe it!

Surprise Exclamations:Reactions of astonishment (“No way!”, “I can’t believe it!”) denote surprise

work page

[50] [50]

I’m scared,

Threat/Worry Language:References to danger or fear (“I’m scared,” “terrifying”) signal fear or nervousness

work page

[51] [51]

Loss or Let-Down Words:Mentions of loss or disap- pointment cue sadness, disappointment, or grief

work page

[52] [52]

Other-Blame Statements:Assigning fault to someone else for a bad outcome suggests anger or disapproval

work page

[53] [53]

I’m sorry

Self-Blame & Apologies:Admitting fault and saying “I’m sorry” marks remorse

work page

[54] [54]

gross,” “nasty,

Aversion Terms:Words like “gross,” “nasty,” or “dis- gusting” point to disgust. 14.Praise & Compliments:Positive evaluations of some- one’s actions show admiration or approval. Prompt You are an expert in <domain name>. You have a deep understanding of this subject. Your task is to behave like an <domain expert> and identify which criteria are important t...

work page

[55] [55]

thanks” or “much appreciated

Gratitude Expressions:Phrases such as “thanks” or “much appreciated” indicate gratitude

work page

[56] [56]

love this,

Affection & Care Words:Loving or nurturing lan- guage (“love this,” “sending hugs”) signals love or car- ing

work page

[57] [57]

I nailed it

Self-Credit Statements:Boasting about one’s own success (“I nailed it”) signals pride

work page

[58] [58]

phew,” “finally over,

Relief Indicators:Release phrases like “phew,” “finally over,” or “what a relief” mark relief after stress ends. D.5 Laparoscopic Cholecystectomy Surgery. Task.The task is to identify the safe and un- safe regions for incision. We used the open- source subset of data from (Madani et al., 2022), which consists of surgeon-annotated im- ages taken from video...

work page 2022

[59] [59]

Calot’s triangle cleared - Hepatocystic triangle must be fully cleared of fat/fibrosis so that its boundaries are unmistakable

work page

[60] [60]

Cystic plate exposed - The lower third of the gallbladder must be dissected off the liver to reveal the shiny cystic plate and ensure the correct dissection plane

work page

[61] [61]

Only two structures visible - Only the cystic duct and cystic artery should be seen entering the gallbladder before any clipping or cutting

work page

[62] [62]

Above the R4U line - Dissection must remain cephalad to an imaginary line from Rouviere’s sulcus to liver segment IV to avoid the common bile duct

work page

[63] [63]

Safe distance from common bile duct - There should be sufficient distance between the common bile duct and the gallbladder wall to ensure safe dissection

work page

[64] [64]

Infundibulum start point - Dissection should begin at the gallbladder infundibulum-cystic duct junction to stay in safe tissue planes

work page

[65] [65]

Subserosal plane stay - When separating the gallbladder from the liver, stay in the avascular subserosal cleavage plane under the serosal fat layer

work page

[66] [66]

Cystic lymph node guide - Identify the cystic lymph node and clip the artery on the gallbladder side of the node to avoid injuring the hepatic artery

work page

[67] [67]

No division without ID - Never divide any duct or vessel until it is unequivocally identified as the cystic structure entering the gallbladder

work page

[68] [68]

fundus-first

Inflammation bailout - If dense scarring or distorted anatomy obscures Calot’s triangle, convert to a subtotal "fundus-first" approach rather than blind cutting

work page

[69] [69]

Aberrant artery caution - Preserve any large or tortuous artery (e.g., a Moynihan’s hump) that might be mistaken for the cystic artery. D.6 Cardiac Arrest Task.The objective is to predict whether an ICU patient will experience cardiac arrest within the next 5 minutes, using the patient’s demographic and clinical background (age, gender, race, rea- son for...

work page 2011

[70] [70]

A detailed explanation of where it is safe and unsafe to cut in the image

work page

[71] [71]

A list of grid positions (as integers) corresponding to safe regions

work page

[72] [72]

safe list

A list of grid positions (as integers) corresponding to unsafe regions The image is discretized into a 9x16 grid (height x width), where each grid position can be←- represented as a single integer from 0 to 143 (9*16 - 1). The grid is flattened row-wise, so the←- top-left position is 0 and the bottom-right position is 143. Your response will help train su...

work page

[73] [73]

Ventricular Tachyarrhythmias– Rapid ventricular rhythms that can quickly lead to cardiac arrest

work page

[74] [74]

Ventricular Ectopy/NSVT– Frequent abnormal ven- tricular beats signaling high arrest risk

work page

[75] [75]

Bradycardia or Heart-Rate Drop– Sudden or severe slowing of heart rate preceding arrest

work page

[76] [76]

Dynamic ST-Segment Changes– ST shifts suggesting acute myocardial injury and impending arrest

work page

[77] [77]

Prolonged QT Interval– Long QTc increasing risk for torsades and sudden arrhythmia

work page

[78] [78]

Severe Hyperkalemia Signs– ECG changes from high potassium predicting arrest, especially among patients on dialysis / end stage renal disease

work page

[79] [79]

Advanced Age– Older age strongly correlates with higher arrest likelihood

work page

[80] [80]

Prompt You are a medical expert specializing in cardiac arrest prediction

Male Sex– Males have a higher overall risk of cardiac arrest. Prompt You are a medical expert specializing in cardiac arrest prediction. You will be given some basic background information about an ICU patient, including their age, gender,←- race, and primary reason for ICU admittance. You will also be provided with time-series←- Electrocardiogram (ECG) d...

work page