MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
Canonical reference
InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Canonical reference. 100% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 15roles
background 5polarities
background 5representative citing papers
Controlled experiments across six benchmarks and four models show RAG context enrichment with metadata, structure, or strategies mostly lowers accuracy, with model-context alignment as the determining factor.
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
PowerCodeBench and a boundary-aware intervention raise LLM accuracy on power-system code generation by 32-56 points across ten open-weight models and four commercial APIs on a 2,000-task benchmark.
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
RASER routers built on one-shot RAG features selectively escalate retrieval, matching SOTA F1 scores on multi-hop QA while using 41-49% of the tokens required by always-prune across six LLMs and three benchmarks.
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.
NeocorRAG uses Evidence Chains to achieve SOTA retrieval quality in RAG on HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ for 3B and 70B models while using under 20% of the tokens of comparable methods.
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.
KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.
A single hub text can unreasonably match many images in CLIP-based similarity, exposing vulnerabilities in cross-modal encoders for caption evaluation and retrieval.
LTRR learns to rank a pool of retrievers by their expected contribution to RAG answer correctness and shows that query-dependent selection beats the best single retriever on QA benchmarks.
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.
citing papers explorer
-
Adaptive Stopping for Multi-Turn LLM Reasoning
MiCP is the first conformal prediction method for multi-turn LLM pipelines that allocates per-turn error budgets to enable adaptive stopping with an overall coverage guarantee, shown to reduce turns and cost on RAG and ReAct benchmarks.
-
Metadata, Structure, or Strategy? A Decomposition of RAG Context Enrichment
Controlled experiments across six benchmarks and four models show RAG context enrichment with metadata, structure, or strategies mostly lowers accuracy, with model-context alignment as the determining factor.
-
When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
-
Knowledge Boundary Probing and Demand-Guided Intervention for LLM-Based Power System Code Generation
PowerCodeBench and a boundary-aware intervention raise LLM accuracy on power-system code generation by 32-56 points across ten open-weight models and four commercial APIs on a 2,000-task benchmark.
-
Boosting Self-Consistency with Ranking
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
-
RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering
RASER routers built on one-shot RAG features selectively escalate retrieval, matching SOTA F1 scores on multi-hop QA while using 41-49% of the tokens required by always-prune across six LLMs and three benchmarks.
-
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
-
Predictive Prefetching for Retrieval-Augmented Generation
Introduces predictive prefetching for RAG that anticipates retrieval needs several tokens ahead via three components, reporting up to 43.5% latency reduction and 62.4% TTFT improvement while preserving answer quality.
-
NeocorRAG: Less Irrelevant Information, More Explicit Evidence, and More Effective Recall via Evidence Chains
NeocorRAG uses Evidence Chains to achieve SOTA retrieval quality in RAG on HotpotQA, 2WikiMultiHopQA, MuSiQue, and NQ for 3B and 70B models while using under 20% of the tokens of comparable methods.
-
R$^3$AG: Retriever Routing for Retrieval-Augmented Generation
R³AG routes queries to retrievers by decomposing capabilities into retrieval quality and generation utility, trained via contrastive learning on document assessments and downstream answer correctness to outperform static methods.
-
Evaluation of Agents under Simulated AI Marketplace Dynamics
Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.
-
Filling the Gaps: Selective Knowledge Augmentation for LLM Recommenders
KnowSA_CKP uses comparative knowledge probing to selectively augment LLM prompts for items with knowledge gaps, improving recommendation accuracy and context efficiency.
-
One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness
A single hub text can unreasonably match many images in CLIP-based similarity, exposing vulnerabilities in cross-modal encoders for caption evaluation and retrieval.
-
LTRR: Learning To Rank Retrievers for LLMs
LTRR learns to rank a pool of retrievers by their expected contribution to RAG answer correctness and shows that query-dependent selection beats the best single retriever on QA benchmarks.
-
Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method
An adaptive thresholding mechanism combined with sliding-window reranking retrieves a query-dependent number of tables from large corpora, improving retrieval and downstream text-to-SQL performance on Spider, BIRD, and Spider 2.0.