ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.
Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
LLMs show mixed results on authorship verification, post generation, and attribute inference from Twitter data, with new frameworks and user studies establishing benchmarks for these analytics tasks.
LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.
League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.
citing papers explorer
-
Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest
LLMs show mixed results on authorship verification, post generation, and attribute inference from Twitter data, with new frameworks and user studies establishing benchmarks for these analytics tasks.