Presents cue interventions and tie-aware metrics to detect rationalization bias in LLM judges and demonstrates that PROOF-BEFORE-PREFERENCE reduces cue anchoring compared to baselines.
Summeval: Re-evaluating summarization evaluation,
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
method 1polarities
use method 1representative citing papers
A local cascade framework for educational dialogue de-identification reaches 0.958 macro F1 on math tutoring transcripts, outperforming same-family LLM-only and commercial baselines while remaining fully on-device.
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.
HIPE-2026 is an evaluation campaign with 17 teams testing relation extraction for person presence at locations in 19th-20th century newspapers across French, German, and English plus a literary generalization set.
citing papers explorer
-
Faithful or Fabricated? A Causal Framework for Rationalization Bias in LLM Judges
Presents cue interventions and tie-aware metrics to detect rationalization bias in LLM judges and demonstrates that PROOF-BEFORE-PREFERENCE reduces cue anchoring compared to baselines.
-
Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification
A local cascade framework for educational dialogue de-identification reaches 0.958 macro F1 on math tutoring transcripts, outperforming same-family LLM-only and commercial baselines while remaining fully on-device.
-
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
-
Less LLM, More Documents: Searching for Improved RAG
Corpus scaling in RAG frequently matches the accuracy gains from larger LLMs on open-domain QA tasks, with mid-sized models benefiting most due to better passage coverage.
-
Overview of HIPE-2026: Person-Place Relation Extraction from Multilingual Historical Texts
HIPE-2026 is an evaluation campaign with 17 teams testing relation extraction for person presence at locations in 19th-20th century newspapers across French, German, and English plus a literary generalization set.