pith. sign in

hub

Large language models are not robust multiple choice selectors

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

hub tools

citation-role summary

background 1 method 1 other 1

citation-polarity summary

years

2026 14 2025 2

clear filters

representative citing papers

Large Language Models Lack Temporal Awareness of Medical Knowledge

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

AI Achieves a Perfect LSAT Score

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.

MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment

cs.CY · 2026-04-18 · unverdicted · novelty 6.0

MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing variance in later turns and format bias in LLM judges.

Position: AI Evaluations Should be Grounded on a Theory of Capability

cs.AI · 2025-09-23 · conditional · novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

citing papers explorer

Showing 11 of 11 citing papers after filters.