Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

Junxian You; Pin Qian; Qiaolin Yu; Shuhuai Lin; Sipeng Zhang; Su Wang; Wenxuan Xu; Xiaoyuan Wang; Xinpeng Wei; Yihang Chen

arxiv: 2605.28044 · v1 · pith:WNJJLRRDnew · submitted 2026-05-27 · 💻 cs.AI

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

Pin Qian , Su Wang , Xiaoyuan Wang , Yihang Chen , Wenxuan Xu , Qiaolin Yu , Shuhuai Lin , Sipeng Zhang

show 2 more authors

Junxian You Xinpeng Wei

This is my paper

classification 💻 cs.AI

keywords citationcitedclaimacrosscalibrationevaluationevidence-calibratedevidence-force

0 comments

read the original abstract

Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8--36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents
cs.LG 2026-06 unverdicted novelty 7.0

High AUC from linear probes on model activations for indirect prompt injection does not license an unqualified claim of malicious-content detection, per a Qwen2.5-VL-7B case study with text and visual controls.
Chains That See, Answers That Don't: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME
cs.CV 2026-06 conditional novelty 6.0

Forced CoT produces video-dependent reasoning chains but does not improve MCQ accuracy on Qwen2.5-VL with Video-MME and causes a small drop on the 7B variant.
Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation
cs.LG 2026-06 unverdicted novelty 6.0

GPT-4o and Claude Sonnet 4 show similar susceptibility to bias on GSM8K (1.3% vs 1.2%) but differ sharply in acknowledgment rates (13% vs 75%) under a rubric-defined metric.
Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite
cs.CL 2026-06 unverdicted novelty 6.0

First end-to-end RAG on mobile NPU delivers 18.1x faster prefilling, 4x lower latency and energy than CPU on Snapdragon X Elite with equivalent quality.