pith. sign in

hub

Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

hub tools

citation-role summary

background 1 method 1 other 1

citation-polarity summary

years

2026 14 2025 2

representative citing papers

Large Language Models Lack Temporal Awareness of Medical Knowledge

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

AI Achieves a Perfect LSAT Score

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.

MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment

cs.CY · 2026-04-18 · unverdicted · novelty 6.0

MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing variance in later turns and format bias in LLM judges.

Position: AI Evaluations Should be Grounded on a Theory of Capability

cs.AI · 2025-09-23 · conditional · novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

citing papers explorer

Showing 16 of 16 citing papers.