T-FIX: Text-Based Explanations with Features Interpretable to eXperts
Pith reviewed 2026-05-21 19:40 UTC · model grok-4.3
The pith
T-FIX turns expert-defined criteria into automatic scores for whether LLM explanations match domain reasoning, and those scores generalize to new explanations without further expert input.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
T-FIX operationalizes expert alignment as a measurable property of LLM explanations by encoding domain-grounded criteria supplied by experts, then applies those criteria to produce automatic scores that remain valid for explanations outside the original set of examples.
What carries the argument
T-FIX, a unified evaluation framework that converts expert-defined criteria for domain-grounded reasoning into automatic, generalizable scores for LLM explanations.
If this is right
- Evaluation of new explanations no longer requires fresh expert annotations for each case.
- The same criteria can be reused across multiple LLM outputs and tasks within a domain.
- Different expert groups can supply their own criteria to create personalized alignment measures.
- The framework covers seven tasks spanning three scientific domains, supporting cross-task comparisons.
Where Pith is reading between the lines
- The method could support ongoing monitoring of deployed LLMs in high-stakes settings by flagging when explanations drift from expert standards.
- If criteria prove stable, they might serve as shared benchmarks for comparing explanation quality across different model families.
- Extending the approach to additional domains would require only new expert criteria rather than redesigning the entire evaluation process.
Load-bearing premise
Expert-defined criteria can be made concrete enough to support reliable automatic scoring while still capturing the core reasoning that experts use across different explanations.
What would settle it
Collect a fresh set of LLM explanations that domain experts rate as well-aligned, then check whether T-FIX assigns them consistently high scores; consistent mismatch between expert ratings and T-FIX scores would falsify the generalization claim.
Figures
read the original abstract
As LLMs are deployed in knowledge-intensive settings (e.g., surgery, astronomy, therapy), users are often domain experts who expect not just answers, but explanations that mirror professional reasoning. Yet evaluating whether an LLM "thinks like an expert" remains difficult: existing approaches rely on per-example expert annotation, making them costly, hard to scale, and tied to a single notion of correct reasoning within each domain. To address this gap, we introduce T-FIX, a unified evaluation framework that operationalizes expert alignment as a desired attribute of LLM-generated explanations. T-FIX spans seven scientific tasks across three domains, with each task evaluated against expert-defined criteria that capture domain-grounded reasoning rather than generic explanation quality. Our framework enables automatic, personalizable evaluation of expert alignment that generalizes to unseen explanations without ongoing expert involvement. Code is available at https://github.com/BrachioLab/FIX-2/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes T-FIX, a framework that operationalizes expert alignment for evaluating LLM explanations using expert-defined criteria. It covers seven scientific tasks in three domains and claims automatic, personalizable evaluation that generalizes to unseen explanations without further expert involvement.
Significance. Should the framework's ability to generalize hold up under scrutiny, it would represent a meaningful advance in scalable evaluation of LLM explanations in expert domains. This could reduce the cost and scalability issues associated with expert annotations. The open-source code is noted as a strength for allowing community verification and extension.
major comments (2)
- [§3] The description of how expert criteria are turned into automatic scoring mechanisms lacks detail on preventing overfitting to the initial annotated explanations, which is critical for the generalization claim to unseen cases.
- [§5.2] Results on generalization are presented for held-out explanations within the same tasks, but no experiments test transfer to explanations generated under different conditions or from other models, leaving the robustness to distribution shift unaddressed.
minor comments (1)
- [Introduction] Some citations to related work on explanation evaluation could be expanded to include more recent papers on LLM alignment.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps strengthen the presentation of T-FIX's generalization properties. We address each major comment below and commit to revisions that clarify the framework's design and experimental scope.
read point-by-point responses
-
Referee: [§3] The description of how expert criteria are turned into automatic scoring mechanisms lacks detail on preventing overfitting to the initial annotated explanations, which is critical for the generalization claim to unseen cases.
Authors: We agree that Section 3 would benefit from greater specificity. In the revised manuscript we will expand the description of the criterion-to-scorer pipeline to explicitly detail the safeguards against overfitting: (i) criteria are elicited at a high level of abstraction before any explanations are seen, (ii) a separate validation split of annotated explanations is used to tune the automatic scorer, and (iii) the final scorer is frozen before evaluating the held-out test set. These steps will be illustrated with a concrete example from one domain. revision: yes
-
Referee: [§5.2] Results on generalization are presented for held-out explanations within the same tasks, but no experiments test transfer to explanations generated under different conditions or from other models, leaving the robustness to distribution shift unaddressed.
Authors: The current experiments indeed evaluate generalization only to held-out explanations generated under the same prompting and model conditions. We will add a dedicated paragraph in §5.2 (and a short appendix table) that acknowledges this scope and reports preliminary transfer results on two tasks using explanations from a second LLM. If space constraints prevent full cross-model tables, we will at minimum include a clear limitations statement and outline the additional expert-validation steps required for broader distribution-shift testing. revision: partial
Circularity Check
No significant circularity; framework relies on independent expert criteria
full rationale
The paper introduces T-FIX as an operationalization of expert alignment using externally defined criteria across seven tasks in three domains. The central claim of automatic, generalizable evaluation to unseen explanations is grounded in these independent expert inputs rather than any self-referential definition, fitted parameter renamed as prediction, or self-citation chain. No equations or derivations reduce the output to the input by construction; the approach treats expert criteria as an external benchmark that is then automated, with generalization tested on held-out explanations. This is self-contained against external validation and receives the default non-finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Expert-defined criteria capture domain-grounded reasoning rather than generic explanation quality.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formalize expert alignment as a criterion for evaluating explanations with T-FIX... decompose it into atomic claims... score against the domain-specific expert alignment criteria
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
expert alignment criteria... validated... through collaboration with domain experts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Reference graph
Works this paper leans on
-
[1]
Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923. Gagan Bansal, Tongshuang Wu, Joyce Zhou, Ray- mond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel S. Weld. 2021. Does the whole exceed its parts? the effect of AI explanations on complementary team performance. InProceedings of the 2021 CHI Conference on Human Factors in Computin...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Goemotions: A dataset of fine-grained emo- tions.arXiv preprint arXiv:2005.00547. Norman K Denzin. 1984.On understanding emotion. Transaction Publishers. Janis Fluri, Tomasz Kacprzak, Aurelien Lucchi, Aurel Schneider, Alexandre Refregier, and Thomas Hof- mann. 2022. Full wCDM analysis of KiDS-1000 weak lensing maps using deep learning.Physical Review D, 1...
-
[3]
What is the role of large language models in the evolution of astronomy research?Preprint, arXiv:2409.20252. M. Gatti, E. Sheldon, A. Amon, M. Becker, M. Troxel, A. Choi, C. Doux, N. MacCrann, A. Navarro-Alsina, I. Harrison, D. Gruen, G. Bernstein, M. Jarvis, L. F. Secco, A. Ferté, T. Shin, J. McCullough, R. P. Rollins, R. Chen, and 85 others. 2021. Dark ...
-
[4]
Building knowledge-guided lexica to model cultural variation. InProceedings of the 2024 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pages 211–226. Shreya Havaldar, Matthew Pressimone, Eric Wong, and Lyle Ungar. 2023a. Comparing styles across lan- guag...
-
[5]
Optionally, have a domain expert vet the generated criteria
Generate criteria:Use the deep research prompt template shown in Figure A4 to gener- ate a list of expert alignment criteria for your domain. Optionally, have a domain expert vet the generated criteria
-
[6]
Modify prompts:Modify the prompt tem- plates outlined in Figure A1, Figure A2, and Figure A3 with your task description, few- shot examples, and generated expert criteria
-
[7]
Run T-FIX:Plug in your prompts for each stage of the pipeline and run T-FIX on your dataset! We encourage you to contact the authors of this work if you need additional assistance setting up your custom domain. B Prompts for T-FIX Pipeline We show the prompts for Stage 1, 2, and 3 in Fig- ure A1, Figure A2, and Figure A3, respectively. These prompts show ...
-
[8]
Lensing Peak (Cluster) Abundance:High peak count →higherσ 8; clumpy halos more common
-
[9]
Void Size and Frequency:Large, frequent voids → lowerΩ m; less overall matter
-
[10]
Filament Thickness and Sharpness:Thick, sharp fila- ments track higherσ 8; thin indicates lower
-
[11]
Fine-Scale Clumpiness:Fine graininess signifies high σ8; smooth map implies lower
-
[12]
Connectivity of the Cosmic Web:Interconnected web suggests higherΩ m; isolated clumps imply lower
-
[13]
Density Contrast Extremes:Strong density contrast denotes highσ 8; muted contrast lower. D.2 Supernova Task.The objective is to classify astrophysical objects using time-series data comprising observa- tion times (Modified Julian Dates), wavelengths (filters), flux values, and corresponding flux uncer- tainties. We use data from the PLAsTiCC chal- lenge (...
work page 2018
-
[14]
We report the mean accuracy for each stage of the pipeline and annotator agreement – Cohen’sκ
Contiguous non-zero flux:Contiguous non -zero flux segments confirm genuine astrophysical activity and Domain N generated claims N aligned claims Claim Decomposition Accuracy Relevance Filtering Accuracy Expert Alignment Accuracy Cohen’sκ Cosmology Mass Maps66 48 0.900 0.826 0.979 0.4059 Supernova74 62 0.950 0.892 0.903 0.4946 Psychology Politeness72 58 0...
-
[15]
Rise–decline rates:Characteristic rise -and-decline rates—such as the fast -rise/slow-fade morphology of many supernovae—encode energy -release physics and serve as strong class discriminators
-
[16]
Photometric amplitude:Peak -to-trough photomet- ric amplitude separates high -energy explosive events (multi-magnitude outbursts) from low-amplitude peri- odic or stochastic variables
-
[17]
Event duration:Total event duration, measured from first detection to return to baseline, distinguishes short-lived kilonovae and superluminous SNe from longer plateau or AGN variability phases
-
[18]
Periodic light curves:Periodic light curves with stable periods and distinctive Fourier amplitude - and phase-ratios flag pulsators and eclipsing binaries rather than one-off transients
-
[19]
Secondary maxima:Filter -specific secondary max- ima or shoulders in red/near -IR bands—prominent in SNeIa—are morphological features absent in most core-collapse SNe
-
[20]
Monotonic flux trends:Locally smooth, monotonic flux trends across one or multiple bands (plateaus, lin- ear decays) capture physical evolution stages and help distinguish SNII-P, SNII-L, and related classes. D.3 Politeness Task.Understanding how linguistic styles, like politeness, vary across cultures is necessary for building better communication, trans...
work page 2012
-
[21]
Honorifics and Formal Address:The presence of re- spectful or formal address forms (e.g., “sir,” “usted,”) signals politeness by expressing deference to the hearer’s status or social distance
-
[22]
Courteous Politeness Markers:Words such as “please,” “kindly,” or their multilingual variants soften requests and reflect courteous intent
-
[23]
Gratitude Expressions:Use of expressions like “thank you,” “thanks,” or “I appreciate it” signals recognition of the other’s contribution and positive face
-
[24]
Apologies and Acknowledgment of Fault:Phrases such as “sorry” or “I apologize” express humility and repair social breaches, marking a clear politeness strat- egy
-
[25]
Indirect and Modal Requests:Requests using modal verbs (“could you,” “would you”) or softening cues like “by the way” reduce imposition and signal respect for the hearer’s autonomy
- [26]
-
[27]
Inclusive Pronouns and Group-Oriented Phrasing: Use of “we,” “our,” or “together” expresses solidarity and reduces hierarchical distance in requests or cri- tiques
-
[28]
Greeting and Interaction Initiation:Opening with a salutation (“hi,” “hello”) creates a cooperative tone and frames the conversation positively
-
[29]
Compliments and Praise:Positive evaluations (“great,” “awesome,” “neat”) attend to the hearer’s positive face and foster a friendly environment
-
[30]
Softened Disagreement or Face-Saving Critique: When disagreeing, the use of softeners, partial agree- ments, or concern for clarity preserves the hearer’s dig- nity
-
[31]
Urgency or Immediacy of Language:Utterances em- phasizing emergency or speed (“asap,” “immediately”) can heighten perceived imposition and reduce politeness if not softened
-
[32]
Avoidance of Profanity or Negative Emotion:The presence of strong negative words or swearing is a key indicator of rudeness and face threat
-
[33]
Bluntness and Direct Commands:Requests lacking modal verbs or mitigation (“Do this”) are perceived as less polite due to their imperative structure
-
[34]
Empathy or Emotional Support:Recognizing the hearer’s emotional context or challenges is a politeness strategy of concern and goodwill
- [35]
-
[36]
Second Person Responsibility or Engagement:Sen- tences starting with “you” or directly addressing the hearer can either signal engagement or come across as accusatory, depending on context and tone
-
[37]
Questions as Indirect Strategies:Questions (“what do you think?” or “could you clarify?”) reduce imposition by inviting rather than demanding input
-
[38]
Discourse Management with Markers:Use of dis- course markers like “so,” “then,” “but” organizes conver- Prompt You will be given <task description + expert categories description> Your task is as follows:
-
[39]
Determine which expert category is most aligned with the claim
-
[40]
Rate how strongly the category aligns with the claim on a scale of 0-1 (0 being lowest, 1 being←- highest. Use increments of 0.1). Return your answer as: Category: <category> Category Alignment Rating: <rating> Reasoning: <A brief explanation of why you selected the chosen category and why you judged the←- alignment rating as you did.> ----- Expert catego...
-
[41]
Ingroup Language and Informality:Use of group- identifying slang or casual expressions (“mate,” “dude,” “bro”) may foster solidarity or seem disrespectful, de- pending on relational norms. D.4 Emotion Task.Understanding and classifying emotion is important for tasks like therapy, mental health di- agnoses, etc. (Denzin, 1984). Emotion is often expressed i...
work page 1984
-
[42]
Valence:Decide if the overall tone is pleasant or un- pleasant; positive tones suggest joy or admiration, nega- tive tones suggest sadness or anger
-
[43]
Arousal:Gauge how energized the wording is—calm phrasing implies low arousal emotions, intense phrasing implies high arousal emotions
-
[44]
Emotion Words & Emojis:Look for direct emotion terms or emoticons that explicitly name the feeling
-
[45]
Expressive Punctuation:Multiple exclamation marks, ALL-CAPS, or stretched spellings signal higher emo- tional intensity
-
[46]
Humor/Laughter Markers:Tokens like “haha,” “lol,” or laughing emojis reliably indicate amusement
-
[47]
Confusion Phrases:Statements such as “I don’t get it” clearly mark confusion
- [48]
-
[49]
No way!”, “I can’t believe it!
Surprise Exclamations:Reactions of astonishment (“No way!”, “I can’t believe it!”) denote surprise
-
[50]
Threat/Worry Language:References to danger or fear (“I’m scared,” “terrifying”) signal fear or nervousness
-
[51]
Loss or Let-Down Words:Mentions of loss or disap- pointment cue sadness, disappointment, or grief
-
[52]
Other-Blame Statements:Assigning fault to someone else for a bad outcome suggests anger or disapproval
- [53]
-
[54]
Aversion Terms:Words like “gross,” “nasty,” or “dis- gusting” point to disgust. 14.Praise & Compliments:Positive evaluations of some- one’s actions show admiration or approval. Prompt You are an expert in <domain name>. You have a deep understanding of this subject. Your task is to behave like an <domain expert> and identify which criteria are important t...
-
[55]
Gratitude Expressions:Phrases such as “thanks” or “much appreciated” indicate gratitude
-
[56]
Affection & Care Words:Loving or nurturing lan- guage (“love this,” “sending hugs”) signals love or car- ing
-
[57]
Self-Credit Statements:Boasting about one’s own success (“I nailed it”) signals pride
-
[58]
Relief Indicators:Release phrases like “phew,” “finally over,” or “what a relief” mark relief after stress ends. D.5 Laparoscopic Cholecystectomy Surgery. Task.The task is to identify the safe and un- safe regions for incision. We used the open- source subset of data from (Madani et al., 2022), which consists of surgeon-annotated im- ages taken from video...
work page 2022
-
[59]
Calot’s triangle cleared - Hepatocystic triangle must be fully cleared of fat/fibrosis so that its boundaries are unmistakable
-
[60]
Cystic plate exposed - The lower third of the gallbladder must be dissected off the liver to reveal the shiny cystic plate and ensure the correct dissection plane
-
[61]
Only two structures visible - Only the cystic duct and cystic artery should be seen entering the gallbladder before any clipping or cutting
-
[62]
Above the R4U line - Dissection must remain cephalad to an imaginary line from Rouviere’s sulcus to liver segment IV to avoid the common bile duct
-
[63]
Safe distance from common bile duct - There should be sufficient distance between the common bile duct and the gallbladder wall to ensure safe dissection
-
[64]
Infundibulum start point - Dissection should begin at the gallbladder infundibulum-cystic duct junction to stay in safe tissue planes
-
[65]
Subserosal plane stay - When separating the gallbladder from the liver, stay in the avascular subserosal cleavage plane under the serosal fat layer
-
[66]
Cystic lymph node guide - Identify the cystic lymph node and clip the artery on the gallbladder side of the node to avoid injuring the hepatic artery
-
[67]
No division without ID - Never divide any duct or vessel until it is unequivocally identified as the cystic structure entering the gallbladder
-
[68]
Inflammation bailout - If dense scarring or distorted anatomy obscures Calot’s triangle, convert to a subtotal "fundus-first" approach rather than blind cutting
-
[69]
Aberrant artery caution - Preserve any large or tortuous artery (e.g., a Moynihan’s hump) that might be mistaken for the cystic artery. D.6 Cardiac Arrest Task.The objective is to predict whether an ICU patient will experience cardiac arrest within the next 5 minutes, using the patient’s demographic and clinical background (age, gender, race, rea- son for...
work page 2011
-
[70]
A detailed explanation of where it is safe and unsafe to cut in the image
-
[71]
A list of grid positions (as integers) corresponding to safe regions
-
[72]
A list of grid positions (as integers) corresponding to unsafe regions The image is discretized into a 9x16 grid (height x width), where each grid position can be←- represented as a single integer from 0 to 143 (9*16 - 1). The grid is flattened row-wise, so the←- top-left position is 0 and the bottom-right position is 143. Your response will help train su...
-
[73]
Ventricular Tachyarrhythmias– Rapid ventricular rhythms that can quickly lead to cardiac arrest
-
[74]
Ventricular Ectopy/NSVT– Frequent abnormal ven- tricular beats signaling high arrest risk
-
[75]
Bradycardia or Heart-Rate Drop– Sudden or severe slowing of heart rate preceding arrest
-
[76]
Dynamic ST-Segment Changes– ST shifts suggesting acute myocardial injury and impending arrest
-
[77]
Prolonged QT Interval– Long QTc increasing risk for torsades and sudden arrhythmia
-
[78]
Severe Hyperkalemia Signs– ECG changes from high potassium predicting arrest, especially among patients on dialysis / end stage renal disease
-
[79]
Advanced Age– Older age strongly correlates with higher arrest likelihood
-
[80]
Prompt You are a medical expert specializing in cardiac arrest prediction
Male Sex– Males have a higher overall risk of cardiac arrest. Prompt You are a medical expert specializing in cardiac arrest prediction. You will be given some basic background information about an ICU patient, including their age, gender,←- race, and primary reason for ICU admittance. You will also be provided with time-series←- Electrocardiogram (ECG) d...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.