CyberCertBench shows frontier LLMs reach human-expert performance on general IT and networking security but drop on vendor-specific and formal standards questions such as IEC 62443, with a new framework for producing interpretable explanations.
On hallucination and predictive uncertainty in conditional language generation
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.
SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.
A conformal prediction filter for retrieval chunks plus an attention-based factuality classifier can raise RAG answer quality by up to 6% and detect inconsistent generations up to 77% of the time.
citing papers explorer
-
CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge
CyberCertBench shows frontier LLMs reach human-expert performance on general IT and networking security but drop on vendor-specific and formal standards questions such as IEC 62443, with a new framework for producing interpretable explanations.
-
Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models
Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.
-
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
SelfCheckGPT detects hallucinations by checking consistency across multiple sampled responses from black-box LLMs on WikiBio biography generation tasks.
-
Towards Dependable Retrieval-Augmented Generation Using Factual Confidence Prediction
A conformal prediction filter for retrieval chunks plus an attention-based factuality classifier can raise RAG answer quality by up to 6% and detect inconsistent generations up to 77% of the time.