Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

· 2026 · cs.AI · arXiv 2605.28044

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8--36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.

representative citing papers

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

cs.CL · 2026-06-09 · unverdicted · novelty 6.0

First end-to-end RAG on mobile NPU delivers 18.1x faster prefilling, 4x lower latency and energy than CPU on Snapdragon X Elite with equivalent quality.

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

cs.CL · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

Rock Tokens in on-policy distillation persist at high loss, account for up to 18% of outputs, absorb large gradient norms, but add negligible value to reasoning performance.

MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment

cs.LG · 2026-06-27 · unverdicted · novelty 5.0

MOSAIC combines frozen-LLM semantic embeddings with hierarchical consistency objectives to report up to 3.4% AUC gains on knowledge-tracing benchmarks including a new MOOC dataset.

DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation

cs.CL · 2026-06-27 · unverdicted · novelty 5.0

DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite cs.CL · 2026-06-09 · unverdicted · none · ref 29 · internal anchor
First end-to-end RAG on mobile NPU delivers 18.1x faster prefilling, 4x lower latency and energy than CPU on Snapdragon X Elite with equivalent quality.
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation cs.CL · 2026-05-10 · unverdicted · none · ref 66 · 2 links · internal anchor
Rock Tokens in on-policy distillation persist at high loss, account for up to 18% of outputs, absorb large gradient norms, but add negligible value to reasoning performance.
DriftGuard: Safety-Aware Multi-Monitor Detection and Selective Adaptation for Evolving Toxicity Moderation cs.CL · 2026-06-27 · unverdicted · none · ref 12 · internal anchor
DriftGuard introduces multi-monitor safety-aware drift detection paired with hard-mix selective adaptation, reporting toxic recall gains to 0.8777 on Civil Comments and 0.8523 on DynaHate under temporal and cross-dataset shifts.

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

fields

years

verdicts

representative citing papers

citing papers explorer