Prover-verifier deliberation yields a high-confidence subset of LLM answers with ~30pp higher precision than the complement on GPQA Diamond by using defender-challenger dialogues.
hub
Chain-of-verification reduces hallucination in large language models
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 13roles
background 3representative citing papers
KG-Guard augments knowledge graphs with a virtual question node and uses a graph encoder plus MLP to classify LLM-proposed answers as hallucinations or not, reporting superior F1 scores and downstream improvements on three benchmarks.
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.
CURE trains LLMs to reason about uncertainty at the claim level via a structured protocol and multi-stage calibration, improving factual accuracy by up to 39.9% on biography generation while boosting calibration metrics.
Narrix helps novices identify and reuse narrative strategies from examples through visualization and strategy-steered generation, improving retention, confidence, and adaptation over chat interfaces in a 12-person study.
CRAG improves RAG robustness via a retrieval quality evaluator that triggers web augmentation and a decompose-recompose filter to focus on relevant information, yielding better results on short- and long-form generation tasks.
ConsisGuard is a consistency-aware framework that applies Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment to improve policy execution consistency in reasoning-based LLM guardrails on harmfulness detection tasks.
LCC-LLM creates a code-centric dataset and RAG-based LLM framework that reaches 0.634 average semantic similarity on 43 malware tasks and 10/10 pass rate in real-world case studies.
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
QREAM rewrites documents to question-focused style using iterative ICL and distilled FT models, boosting RAG performance by up to 8% relative improvement.
An SCM-GRPO framework grounds multi-hop reasoning in structural dependency graphs and optimizes chain length via rule-based RL, outperforming baselines on HoVer and EX-FEVER.
citing papers explorer
-
Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction
Prover-verifier deliberation yields a high-confidence subset of LLM answers with ~30pp higher precision than the complement on GPQA Diamond by using defender-challenger dialogues.
-
KG-Guard: Graph-Based Hallucination Detection for Knowledge Base Question Answering
KG-Guard augments knowledge graphs with a virtual question node and uses a graph encoder plus MLP to classify LLM-proposed answers as hallucinations or not, reporting superior F1 scores and downstream improvements on three benchmarks.
-
Argus: Evidence Assembly for Scalable Deep Research Agents
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.
-
Weighted Rules under the Stable Model Semantics
Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.
-
Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments
LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.
-
Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration
CURE trains LLMs to reason about uncertainty at the claim level via a structured protocol and multi-stage calibration, improving factual accuracy by up to 39.9% on biography generation while boosting calibration metrics.
-
Narrix: Remixing Narrative Strategies from Examples for Story Writing
Narrix helps novices identify and reuse narrative strategies from examples through visualization and strategy-steered generation, improving retention, confidence, and adaptation over chat interfaces in a 12-person study.
-
Corrective Retrieval Augmented Generation
CRAG improves RAG robustness via a retrieval quality evaluator that triggers web augmentation and a decompose-recompose filter to focus on relevant information, yielding better results on short- and long-form generation tasks.
-
ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails
ConsisGuard is a consistency-aware framework that applies Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment to improve policy execution consistency in reasoning-based LLM guardrails on harmfulness detection tasks.
-
LCC-LLM: Leveraging Code-Centric Large Language Models for Malware Attribution
LCC-LLM creates a code-centric dataset and RAG-based LLM framework that reaches 0.634 average semantic similarity on 43 malware tasks and 10/10 pass rate in real-world case studies.
-
Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research
AVA is a specialized GenAI platform for development policy research that provides verifiable syntheses from World Bank reports and is associated with 2.4-3.9 hours of weekly time savings in a large-scale user evaluation.
-
Align Documents to Questions: Question-Oriented Document Rewriting for Retrieval-Augmented Generation
QREAM rewrites documents to question-focused style using iterative ICL and distilled FT models, boosting RAG performance by up to 8% relative improvement.
-
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
An SCM-GRPO framework grounds multi-hop reasoning in structural dependency graphs and optimizes chain length via rule-based RL, outperforming baselines on HoVer and EX-FEVER.