WildChat releases a dataset of 1 million ChatGPT conversations with timestamps, demographics, and headers, claimed to be the most diverse and multilingual such resource available.
super hub Mixed citations
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Mixed citation behavior. Most common role is background (64%).
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
DT² trains digital twins to preserve pairwise policy rankings from fitted Q-evaluation on offline data rather than minimizing one-step transition errors, improving policy ranking and reducing decision regret.
CheckMIABench converts LLMs with intermediate checkpoints into clean MIA testbeds by using pre- and post-checkpoint training data from the same distribution and evaluates published attacks on Pythia and OLMo models while releasing an open-source library.
Polar is a new cross-context benchmark showing LLM political bias measurements are not fixed but vary with country, issue, model, and language.
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
LLM outputs are meaningful according to standard theories of human language, without requiring anthropomorphic assumptions about the models.
The authors introduce a three-level formality spectrum (informal, casual, formal) and the 3LF dataset to correct supervision misalignment in formality transfer, reporting large gains in informal-to-formal performance on models including GPT variants.
A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Causal tracing reveals a persistent Refusal Trajectory in LLM hidden states; SALO detector using sparse activations from a layer window improves jailbreak detection across Qwen, Llama, and Mistral models.
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.
LLMs display a consistent pattern of elevated form-meaning divergence and uniform rhetorical device use in argumentative texts compared to humans, quantified by new metrics FMD, GPR, and RDDE.
Introduces LLM-mediated computing as a paradigm of reflective conversation and co-disclosure where the computer emerges through human-LLM interaction.
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Authors share a new dataset of GPT-4 behavior-change conversations with user language metrics, perception measures, and feedback collected in a preregistered study.
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
Grain calibration decomposes theoretical constructs into clause-level components, tests each with extractive evidence, and combines results through explicit theory-derived rules to validate LLM coding beyond agreement with human annotators.
Empirical audit of LAION-2B-en and LAION-2B-multi finds overrepresentation of young adults, White people, and males plus stereotypical emotion associations across two attribute classifiers.
Empirical study of LLM brand recommendations across industries finds moderate concentration (mean Gini 0.28) and low cross-model agreement (41.6%) on top brands.
LLMs exhibit misfired alignment on stereotype questions at 4.7-18.9% rates on the new VETO benchmark of 2,032 contrastive pairs, unlike humans at 0%, due to overgeneralized safety cues after instruction tuning.
citing papers explorer
-
WildChat: 1M ChatGPT Interaction Logs in the Wild
WildChat releases a dataset of 1 million ChatGPT conversations with timestamps, demographics, and headers, claimed to be the most diverse and multilingual such resource available.
-
$\text{DT}^2$: Decision-Targeted Digital Twins
DT² trains digital twins to preserve pairwise policy rankings from fitted Q-evaluation on offline data rather than minimizing one-step transition errors, improving policy ranking and reducing decision regret.
-
CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models
CheckMIABench converts LLMs with intermediate checkpoints into clean MIA testbeds by using pre- and post-checkpoint training data from the same distribution and evaluates published attacks on Pythia and OLMo models while releasing an open-source library.
-
Polar: A Benchmark for Evaluating Political Bias in LLMs
Polar is a new cross-context benchmark showing LLM political bias measurements are not fixed but vary with country, issue, model, and language.
-
Toward Calibrated, Fair, and accurate Deepfake Detection
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
-
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
-
Chatbots Output Meaningful (but Problematic) Language
LLM outputs are meaningful according to standard theories of human language, without requiring anthropomorphic assumptions about the models.
-
Casual as an Anchor: Resolving Supervision Misalignment in Formality Transfer Dataset
The authors introduce a three-level formality spectrum (informal, casual, formal) and the 3LF dataset to correct supervision misalignment in formality transfer, reporting large gains in informal-to-formal performance on models including GPT variants.
-
Is She Even Relevant? When BERT Ignores Explicit Gender Cues
A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Causal tracing reveals a persistent Refusal Trajectory in LLM hidden states; SALO detector using sparse activations from a layer window improves jailbreak detection across Qwen, Llama, and Mistral models.
-
Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
-
SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models
SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.
-
Saying More Than They Know: A Framework for Quantifying Epistemic-Rhetorical Miscalibration in Large Language Models
LLMs display a consistent pattern of elevated form-meaning divergence and uniform rhetorical device use in argumentative texts compared to humans, quantified by new metrics FMD, GPR, and RDDE.
-
Co-Disclosing the Computer: LLM-Mediated Computing through Reflective Conversation
Introduces LLM-mediated computing as a paradigm of reflective conversation and co-disclosure where the computer emerges through human-LLM interaction.
-
Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs
VISE is the first benchmark for sycophancy in Video-LLMs, with two training-free mitigation strategies based on key-frame selection and internal representation steering.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
"You tell me": A Dataset of GPT-4-Based Behaviour Change Support Conversations
Authors share a new dataset of GPT-4 behavior-change conversations with user language metrics, perception measures, and feedback collected in a preregistered study.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Towards Measuring the Representation of Subjective Global Opinions in Language Models
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
-
Correct codes for the wrong reasons? validating LLMs as measurement instruments for theoretical constructs
Grain calibration decomposes theoretical constructs into clause-level components, tests each with extractive evidence, and combines results through explicit theory-derived rules to validate LLM coding beyond agreement with human annotators.
-
Unmasking LAION-5B: Age, Gender, Race, and Emotion Biases in Large-Scale Image Datasets
Empirical audit of LAION-2B-en and LAION-2B-multi finds overrepresentation of young adults, White people, and males plus stereotypical emotion associations across two attribute classifiers.
-
Who Owns the AI Recommendation? A Multi-Industry Empirical Map of Brand Category Ownership Across Large Language Models
Empirical study of LLM brand recommendations across industries finds moderate concentration (mean Gini 0.28) and low cross-model agreement (41.6%) on top brands.
-
The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs
LLMs exhibit misfired alignment on stereotype questions at 4.7-18.9% rates on the new VETO benchmark of 2,032 contrastive pairs, unlike humans at 0%, due to overgeneralized safety cues after instruction tuning.
-
Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts
Schützen is a German-Bulgarian LLM safety dataset showing pronounced cross-language differences in model safety behavior.
-
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Empirical benchmarks show distribution similarity between adaptation and pretraining data increases practical privacy leakage in DP-adapted LLMs at fixed theoretical guarantees, with LoRA providing strongest protection for OOD cases.
-
The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction
The Ghost Annotator framework applies conformal prediction and collaborative filtering representations to measure LLM divergence from human annotations across four models and datasets, revealing higher confidence in misaligned cases and consistent demographic misalignment.
-
Child-directed speech facilitates production, not comprehension, in BabyLMs
CDS-trained BabyLMs show earlier and more appropriate production in a new frame-completion task while FineWeb-edu models lead on comprehension benchmarks, indicating current tests underestimate CDS benefits.
-
Prompts for Public-Sector LLMs Should Be Governed as Commons
Prompts for public-sector LLMs encode value-laden decisions and should be governed through community-maintained Prompt Commons repositories with provenance, licensing, and moderation.
-
Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms
LM agents' changeable modules prevent persistent identity and sanction sensitivity, making reputation mechanisms structurally inapplicable and requiring protocol-based behavioral harnesses instead.
-
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
-
"It became a self-fulfilling prophecy": How Lived Experiences are Entangled with AI Predictions in Menstrual Cycle Tracking Apps
Users entangle their lived experiences with AI predictions in menstrual tracking apps, leading to self-fulfilling prophecies, limited critical awareness from UI, and isolation for non-normative users.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Creating Group Rules with AI: Human-AI Collaboration in WhatsApp Moderation
Admins in India used Meta AI to help create WhatsApp group rules, appreciating reduced workload but remaining cautious about privacy, relational trust, and contextual tone.
-
A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination
Under semantic underdetermination, LLMs cannot always guarantee strong correctness, strict non-bias, and high utility at once.
-
Push and Pushback in Contesting AI: Demands for and Resistance to Accountability
Thematic analysis of 43 AI contestation cases, using Bovens's relational accountability model, produces categories of demands from below, institutional pushback, outcomes, and contextual factors shaping effective contestation.
-
Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs
LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.
-
The Grounding Gap: How LLMs Anchor the Meaning of Abstract Concepts Differently from Humans
LLMs show a grounding gap with humans on abstract concepts, with property-generation correlations at most r=0.37 versus human-to-human r>0.9, though larger models align better on explicit rating tasks and internal SAE features capture some grounding dimensions.
-
Query-efficient model evaluation using cached responses
DKPS-based methods predict new model benchmark scores using cached responses, matching baseline mean absolute error with substantially fewer queries and an offline query selection approach.
-
When AI Meets Science: Research Diversity, Interdisciplinarity, Visibility, and Retractions across Disciplines in a Global Surge
AI use in science has grown exponentially since 2015 but stays confined to computer science and statistics topics, shows higher retraction rates and citations, and follows distinct global adoption patterns.
-
Can AI Debias the News? LLM Interventions Improve Cross-Partisan Receptivity but LLMs Overestimate Their Own Effectiveness
Substantive LLM reframing boosts cross-partisan receptivity to news headlines without backfire, but models overestimate effect sizes and lack fidelity in modeling human psychological responses.
-
Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking
BREW uses block voting and window-shifting verification to reach TPR 0.965 and FPR 0.02 under 10% synonym substitution, addressing high false-positive issues in prior multi-bit LLM watermarking.
-
How Frontier LLMs Adapt to Neurodivergence Context: A Measurement Framework for Surface vs. Structural Change in System-Prompted Responses
Frontier LLMs adapt structurally to explicit neurodivergence instructions by increasing output length, headings, and step granularity, but ND persona assertion alone fails to suppress harmful tendencies.
-
Disentangling Linguistic Relatedness from Task Alignment in Cross-Lingual Transfer
Fine-tuning LLMs on Arabic yields similar zero-shot gains on Semitic and non-Semitic languages, with chain-of-thought reasoning producing parallel benefits, indicating task alignment drives transfer more than language relatedness.
-
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles
Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
-
Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue
Incremental visual scaffolding using multimodal models improves persistent common ground representation in situated dialogue by reducing representational blur compared to text-only approaches, with hybrid text-visual yielding best results on the IndiRef benchmark.
-
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
-
Cooking Up Risks: Benchmarking and Reducing Food Safety Risks in Large Language Models
A new benchmark exposes food-safety gaps in current LLMs and guardrails, and a fine-tuned 4B model is offered as a domain-specific fix.
-
Large Language Model Agent for User-friendly Chemical Process Simulations
An LLM agent integrated with AVEVA Process Simulation via MCP enables natural language driven flowsheet analysis, optimization, and construction for chemical separation processes.
-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.