Findings of the Association for Computational Linguistics: NAACL 2024 , pages =

Pezeshkpour, P · 2024 · DOI 10.18653/v1/2024.findings-naacl.130

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open at publisher browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Evaluating Commercial AI Chatbots as News Intermediaries

cs.CL · 2026-05-21 · conditional · novelty 7.0

Commercial AI chatbots reach over 90% multiple-choice accuracy on recent news facts but lose 11-17% in free response and drop to 19-70% on subtle false-premise questions, with retrieval failures causing most errors and clear Anglophone bias.

TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

cs.CV · 2026-04-29 · accept · novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

Position bias scales positively with reasoning trajectory length in CoT models, shown by partial correlations and truncation interventions across multiple benchmarks and model scales.

MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.

Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits

physics.soc-ph · 2026-05-17 · accept · novelty 6.0

Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

cs.LG · 2026-05-14 · conditional · novelty 6.0

LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.

Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.

On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization

cs.CL · 2025-09-28 · unverdicted · novelty 6.0

Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.

citing papers explorer

Showing 8 of 8 citing papers.

Evaluating Commercial AI Chatbots as News Intermediaries cs.CL · 2026-05-21 · conditional · none · ref 40
Commercial AI chatbots reach over 90% multiple-choice accuracy on recent news facts but lose 11-17% in free response and drop to 19-70% on subtle false-premise questions, with retrieval failures causing most errors and clear Anglophone bias.
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation cs.CV · 2026-04-29 · accept · none · ref 32
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models cs.AI · 2026-04-21 · unverdicted · none · ref 16
Position bias scales positively with reasoning trajectory length in CoT models, shown by partial correlations and truncation interventions across multiple benchmarks and model scales.
MixRea: Benchmarking Explicit-Implicit Reasoning in Large Language Models cs.CL · 2026-05-19 · unverdicted · none · ref 46
MixRea benchmark reveals LLMs achieve at most 42.8% consistency on explicit-implicit reasoning tasks, with PRCP prompting proposed to recover overlooked relations.
Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits physics.soc-ph · 2026-05-17 · accept · none · ref 47
Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling cs.LG · 2026-05-14 · conditional · none · ref 51
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
Do Large Language Models Plan Answer Positions? Position Bias in Multiple-Choice Question Generation cs.CL · 2026-05-03 · unverdicted · none · ref 20
LLMs implicitly plan answer positions during MCQ generation, as shown by predictive signals in hidden representations and controllable shifts via activation steering.
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization cs.CL · 2025-09-28 · unverdicted · none · ref 28
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.

Findings of the Association for Computational Linguistics: NAACL 2024 , pages =

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer