ISBN 9798400704314

Association for Computing Machinery · 2024 · arXiv 6772.365770

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 3

citation-polarity summary

background 2 unclear 1

representative citing papers

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

cs.AI · 2026-04-17 · conditional · novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

cs.IR · 2026-04-17 · unverdicted · novelty 7.0

LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,

Hybrid Pooling with LLMs via Relevance Context Learning

cs.IR · 2026-02-09 · unverdicted · novelty 7.0

Relevance Context Learning generates explicit relevance narratives from judged examples to guide LLM assessors, outperforming zero-shot and standard in-context learning for IR relevance judgments.

Formalized Information Needs Improve Large-Language-Model Relevance Judgments

cs.IR · 2026-04-05 · conditional · novelty 6.0

Synthetically formalizing information needs into topics with descriptions and narratives improves LLM relevance assessor agreement with humans and reduces over-labeling of relevant documents on TREC Deep Learning and Robust04.

When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

cs.IR · 2026-02-19 · unverdicted · novelty 6.0

LLMs consistently overrate relevance of inadequate passages in IR evaluations due to biases toward length and lexical features rather than true content match.

"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

cs.CR · 2026-01-30 · unverdicted · novelty 6.0

A query-agnostic black-box attack uses zero-shot surrogate LLMs and adversarial learning on learnable queries to create transferable injection tokens that alter LLM retriever rankings.

LLMs as Assessors: Right for the Right Reason?

cs.IR · 2026-01-13 · unverdicted · novelty 5.0

LLMs judge document relevance at a level comparable to humans but frequently highlight different passages, indicating they are often not right for the right reasons and cannot fully replace human assessors.

AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows

cs.IR · 2026-01-19 · unverdicted · novelty 4.0

RAG-based LLM extraction reaches 89% accuracy on clinical trial protocols versus 62.6% for standalone models and cuts simulated workflow time by 40%.

citing papers explorer

Showing 8 of 8 citing papers.

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench cs.AI · 2026-04-17 · conditional · none · ref 23
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability cs.IR · 2026-04-17 · unverdicted · none · ref 44
LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,
Hybrid Pooling with LLMs via Relevance Context Learning cs.IR · 2026-02-09 · unverdicted · none · ref 39
Relevance Context Learning generates explicit relevance narratives from judged examples to guide LLM assessors, outperforming zero-shot and standard in-context learning for IR relevance judgments.
Formalized Information Needs Improve Large-Language-Model Relevance Judgments cs.IR · 2026-04-05 · conditional · none · ref 40
Synthetically formalizing information needs into topics with descriptions and narratives improves LLM relevance assessor agreement with humans and reduces over-labeling of relevant documents on TREC Deep Learning and Robust04.
When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment cs.IR · 2026-02-19 · unverdicted · none · ref 25
LLMs consistently overrate relevance of inadequate passages in IR evaluations due to biases toward length and lexical features rather than true content match.
"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval cs.CR · 2026-01-30 · unverdicted · none · ref 8
A query-agnostic black-box attack uses zero-shot surrogate LLMs and adversarial learning on learnable queries to create transferable injection tokens that alter LLM retriever rankings.
LLMs as Assessors: Right for the Right Reason? cs.IR · 2026-01-13 · unverdicted · none · ref 30
LLMs judge document relevance at a level comparable to humans but frequently highlight different passages, indicating they are often not right for the right reasons and cannot fully replace human assessors.
AI-assisted Protocol Information Extraction For Improved Accuracy and Efficiency in Clinical Trial Workflows cs.IR · 2026-01-19 · unverdicted · none · ref 29
RAG-based LLM extraction reaches 89% accuracy on clinical trial protocols versus 62.6% for standalone models and cuts simulated workflow time by 40%.

ISBN 9798400704314

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer