hub

Large language models are not robust multiple choice selectors

· 2024 · arXiv 2309.03882

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

read on arXiv browse 22 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1 other 1

citation-polarity summary

background 1 unclear 1 use method 1

representative citing papers

Large Language Models Lack Temporal Awareness of Medical Knowledge

cs.LG · 2026-05-13 · unverdicted · novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

cs.CV · 2026-06-06 · unverdicted · novelty 7.0

Sci-Rho is a dynamic multilingual visually-grounded symbolic benchmark for STEM problems that reveals robustness gaps in current VLMs between average and worst-case performance.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

cs.CL · 2026-05-15 · conditional · novelty 7.0

MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical capability.

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

cs.CL · 2026-04-29 · unverdicted · novelty 7.0

Complex adversarial instructions induce positional collapse in LLMs, with extreme cases showing 99.9% concentration on a single response position and zero content sensitivity.

TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

cs.CV · 2026-04-29 · accept · novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

AI Achieves a Perfect LSAT Score

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

Language models achieve a perfect LSAT score, with experiments showing that internal thinking phases and a fine-tuned process reward model are key to high performance on logical reasoning questions.

YOMI-Bench: A Benchmark for Evaluating Kanji Reading and Phonological Understanding of LLMs for Japanese

cs.CL · 2026-07-01 · unverdicted · novelty 6.0

YOMI-Bench is a new benchmark of four tasks for kanji reading and phonological understanding in LLMs, showing low performance even for Japanese-specific and commercial models.

PRIME: Evaluating Prompt Resolution Under Incompatible Instructions in LLMs

cs.AI · 2026-06-21 · unverdicted · novelty 6.0

PRIME is a new evaluation framework that creates calibrated conflicts in LLM prompts and finds conflict type affects model behavior more than scale.

SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

cs.LG · 2026-05-25 · unverdicted · novelty 6.0

Configuration choices alone flip pairwise safety verdicts on every tested alignment benchmark, isolated via a finite-envelope proposition linking disagreement rate to strict ordering reversal.

TS-Skill: A Benchmark for Evaluating Analytical Skills in Time-Series Question Answering

cs.CL · 2026-05-23 · unverdicted · novelty 6.0

Presents TS-Skill benchmark and SKEvol construction framework to diagnose three composable analytical skills in time-series QA across LLMs and TSLMs.

Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

A training-free attention-guided debiasing framework mitigates position bias in MLLM multi-image retrieval by exploiting the observed mismatch between biased logits and aligned attention maps, yielding over 40% accuracy gains on MS-COCO benchmarks.

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.

Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.

MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment

cs.CY · 2026-04-18 · unverdicted · novelty 6.0

MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing variance in later turns and format bias in LLM judges.

Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

cs.AI · 2026-04-18 · unverdicted · novelty 6.0

Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.

Characterizing initial human-AI proof formalization workflows

cs.AI · 2026-06-02 · unverdicted · novelty 5.0

A controlled user study and qualitative survey find that AI assistance raises formalization accuracy for math proofs, with users flexibly combining multiple tools while retaining oversight.

Position: AI Evaluations Should be Grounded on a Theory of Capability

cs.AI · 2025-09-23 · conditional · novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

Threat Modelling using Domain-Adapted Language Models: Empirical Evaluation and Insights

cs.CR · 2026-05-11 · unverdicted · novelty 3.0

Domain-adapted LLMs and SLMs do not consistently outperform general models on STRIDE threat classification for 5G, with decoding strategies and model scale affecting validity but gains remaining insufficient for reliable use.

SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

q-bio.GN · 2026-01-19

MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs

cs.CV · 2025-11-18

citing papers explorer

Showing 2 of 2 citing papers after filters.

SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding q-bio.GN · 2026-01-19 · unreviewed · ref 70
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs cs.CV · 2025-11-18 · unreviewed · ref 67

Large language models are not robust multiple choice selectors

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer