FLIPS identifies LLM instances with 96% closed-set and 90% open-set accuracy by exploiting biases in generated binary random sequences across 237 instances.
Optimizing temperature for language models with multi-sample inference.arXiv preprint arXiv:2502.05234
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.
LLMs can generate coherent multimodal behaviors for SIAs that align with intended ability and benevolence levels as confirmed by user perceptions, while also reproducing gender stereotypes.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Zero-shot prompting reaches 59% accuracy at moderate temperatures while chain-of-thought prompting excels at temperature extremes on Olympiad-level math problems, with extended reasoning gains scaling to 14.3x at high temperature.
citing papers explorer
-
Towards Trust Calibration in Socially Interactive Agents: Investigating Gendered Multimodal Behaviors Generation with LLMs
LLMs can generate coherent multimodal behaviors for SIAs that align with intended ability and benevolence levels as confirmed by user perceptions, while also reproducing gender stereotypes.
-
Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models
Zero-shot prompting reaches 59% accuracy at moderate temperatures while chain-of-thought prompting excels at temperature extremes on Olympiad-level math problems, with extended reasoning gains scaling to 14.3x at high temperature.