AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
ISBN 9798400704314
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 8roles
background 3representative citing papers
LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,
Relevance Context Learning generates explicit relevance narratives from judged examples to guide LLM assessors, outperforming zero-shot and standard in-context learning for IR relevance judgments.
Synthetically formalizing information needs into topics with descriptions and narratives improves LLM relevance assessor agreement with humans and reduces over-labeling of relevant documents on TREC Deep Learning and Robust04.
LLMs consistently overrate relevance of inadequate passages in IR evaluations due to biases toward length and lexical features rather than true content match.
A query-agnostic black-box attack uses zero-shot surrogate LLMs and adversarial learning on learnable queries to create transferable injection tokens that alter LLM retriever rankings.
LLMs judge document relevance at a level comparable to humans but frequently highlight different passages, indicating they are often not right for the right reasons and cannot fully replace human assessors.
RAG-based LLM extraction reaches 89% accuracy on clinical trial protocols versus 62.6% for standalone models and cuts simulated workflow time by 40%.
citing papers explorer
-
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
-
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability
LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,
-
Hybrid Pooling with LLMs via Relevance Context Learning
Relevance Context Learning generates explicit relevance narratives from judged examples to guide LLM assessors, outperforming zero-shot and standard in-context learning for IR relevance judgments.
-
Formalized Information Needs Improve Large-Language-Model Relevance Judgments
Synthetically formalizing information needs into topics with descriptions and narratives improves LLM relevance assessor agreement with humans and reduces over-labeling of relevant documents on TREC Deep Learning and Robust04.
-
When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment
LLMs consistently overrate relevance of inadequate passages in IR evaluations due to biases toward length and lexical features rather than true content match.
-
"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval
A query-agnostic black-box attack uses zero-shot surrogate LLMs and adversarial learning on learnable queries to create transferable injection tokens that alter LLM retriever rankings.
-
LLMs as Assessors: Right for the Right Reason?
LLMs judge document relevance at a level comparable to humans but frequently highlight different passages, indicating they are often not right for the right reasons and cannot fully replace human assessors.
-
AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows
RAG-based LLM extraction reaches 89% accuracy on clinical trial protocols versus 62.6% for standalone models and cuts simulated workflow time by 40%.