Evaluating superhuman models with consistency checks

Lukas Fluri, Daniel Paleka · 2023 · arXiv 2306.09983

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Argumentative Large Language Models for Explainable and Contestable Claim Verification

cs.CL · 2024-05-03 · unverdicted · novelty 6.0

ArgLLMs build argumentation frameworks from LLMs to support explainable and contestable formal reasoning for claim verification.

citing papers explorer

Showing 2 of 2 citing papers.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 213
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Argumentative Large Language Models for Explainable and Contestable Claim Verification cs.CL · 2024-05-03 · unverdicted · none · ref 18
ArgLLMs build argumentation frameworks from LLMs to support explainable and contestable formal reasoning for claim verification.

Evaluating superhuman models with consistency checks

fields

years

verdicts

representative citing papers

citing papers explorer