LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.
Questioning the Survey Responses of Large Language Models , url=
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three evaluation settings.
LLM embeddings enable strong retrodiction of masked GSS opinions via cross-validation and external validation but only modest performance on entirely unasked opinions.
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.
citing papers explorer
-
Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation
LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.
-
Graph-Based Alternatives to LLMs for Human Simulation
GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three evaluation settings.
-
AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction
LLM embeddings enable strong retrodiction of masked GSS opinions via cross-validation and external validation but only modest performance on entirely unasked opinions.
-
Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead
Human tests should not be applied to AI to measure traits like intelligence due to calibration, validity, contamination, and prompt sensitivity issues; develop AI-specific evaluation frameworks instead.