LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
hub
Large language models are not robust multiple choice selectors
22 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Sci-Rho is a dynamic multilingual visually-grounded symbolic benchmark for STEM problems that reveals robustness gaps in current VLMs between average and worst-case performance.
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical capability.
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
Complex adversarial instructions induce positional collapse in LLMs, with extreme cases showing 99.9% concentration on a single response position and zero content sensitivity.
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.
YOMI-Bench is a new benchmark of four tasks for kanji reading and phonological understanding in LLMs, showing low performance even for Japanese-specific and commercial models.
PRIME is a new evaluation framework that creates calibrated conflicts in LLM prompts and finds conflict type affects model behavior more than scale.
Configuration choices alone flip pairwise safety verdicts on every tested alignment benchmark, isolated via a finite-envelope proposition linking disagreement rate to strict ordering reversal.
Presents TS-Skill benchmark and SKEvol construction framework to diagnose three composable analytical skills in time-series QA across LLMs and TSLMs.
A training-free attention-guided debiasing framework mitigates position bias in MLLM multi-image retrieval by exploiting the observed mismatch between biased logits and aligned attention maps, yielding over 40% accuracy gains on MS-COCO benchmarks.
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.
MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing variance in later turns and format bias in LLM judges.
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
A controlled user study and qualitative survey find that AI assistance raises formalization accuracy for math proofs, with users flexibly combining multiple tools while retaining oversight.
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
Domain-adapted LLMs and SLMs do not consistently outperform general models on STRIDE threat classification for 5G, with decoding strategies and model scale affecting validity but gains remaining insufficient for reliable use.