Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Large language model hacking: Quantifying the hidden risks of using llms for text annotation
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.
LLM safety evaluations for personal advice must test responses against diverse user vulnerability profiles, since context-blind ratings overestimate safety and realistic prompt context does not fix the problem.
Batching texts and stacking variables in LLM prompts reduces annotation costs by over 80% while maintaining accuracy within 2pp of single-item baselines for most models, with errors smaller than human inter-coder disagreement.
Multiverse analysis of three published CSS studies reveals substantial variation in findings across methodological decision combinations and identifies cases of computational failure not reported in originals.
citing papers explorer
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
-
Navigating the Conceptual Multiverse
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.
-
Safe for Whom? Rethinking How We Evaluate the Safety of LLMs for Real Users
LLM safety evaluations for personal advice must test responses against diverse user vulnerability profiles, since context-blind ratings overestimate safety and realistic prompt context does not fix the problem.
-
Researchers waste 80% of LLM annotation costs by classifying one text at a time
Batching texts and stacking variables in LLM prompts reduces annotation costs by over 80% while maintaining accuracy within 2pp of single-item baselines for most models, with errors smaller than human inter-coder disagreement.
-
Making Uncertainty Visible: Multiverse Analysis for Robust Computational Social Science
Multiverse analysis of three published CSS studies reveals substantial variation in findings across methodological decision combinations and identifies cases of computational failure not reported in originals.