Augmenting LLM search judges with historical QRI cards improves Spearman correlation with user preferences by ~5% overall (91% relative on disagreements) and 15% in multilingual settings, with better alignment to live A/B test outcomes.
Audit trails for accountability in large language models
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 2representative citing papers
Analysis of Canada's Federal AI Register reveals it frames AI as reliable internal tooling by obscuring sociotechnical elements like human discretion, turning transparency into performative compliance.
No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms for detect/enforce/recover.
This survey defines execution provenance as a typed graph of agent execution and evidence tracing as its projection onto evidence-support relations, then reviews methods, taxonomy, benchmarks, and challenges for auditable LLM agents.
Argues that trustworthiness in Agent-to-Agent networks requires a new conceptual framework with four design pillars baked in from the beginning, as retrofitting existing single-agent methods is insufficient.
Explicit provenance across the full agentic AI lifecycle is the necessary condition for making responsibility computable and actionable.
A statistical survey of RLHF for LLM alignment that connects preference learning and policy optimization to models like Bradley-Terry-Luce while reviewing methods, extensions, and open challenges.
citing papers explorer
-
As It Was: Aligning LLM Search Evaluation with Historical User Preferences
Augmenting LLM search judges with historical QRI cards improves Spearman correlation with user preferences by ~5% overall (91% relative on disagreements) and 15% in multilingual settings, with better alignment to live A/B test outcomes.