A normative-descriptive framework shows LLMs' tool-calling perceptions misalign with true need/utility for web search, and hidden-state estimators improve decisions over self-perceived baselines.
Wildhallucinations: Evaluating long-form factuality in llms with real-world entity queries
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
RISC reformulates self-consistency answer selection as a ranking task solved by a lightweight LambdaRank model with five hand-designed features, yielding better accuracy-efficiency trade-offs than majority voting on QA benchmarks.
LoVeC uses RL to train LLMs to output verbalized numerical confidence scores for statements in long-form text, achieving better calibration than self-consistency baselines on QA datasets while being 20x faster.
Per-Entity Bias Mapping claims aggregate visibility metrics fail because large brands exhibit higher fabricated citation rates than smaller ones in AI responses, attributed to the Brand Hallucination Paradox.
citing papers explorer
-
To Call or Not to Call: A Framework to Assess and Optimize LLM Tool Calling
A normative-descriptive framework shows LLMs' tool-calling perceptions misalign with true need/utility for web search, and hidden-state estimators improve decisions over self-perceived baselines.