ELDR reduces median TPOT by 5.9-13.9% in PD-disaggregated MoE serving via expert signatures from prefill, K-means partitioning, and locality-band routing with KV-co-indexed signature cache.
hub Baseline reference
WildChat: 1M ChatGPT Interaction Logs in the Wild
Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.
abstract
Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions. Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models. WildChat is released at https://wildchat.allen.ai under AI2 ImpACT Licenses.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Analysis of 500k ChatGPT logs shows over one-third of conversations generate fiction, dominated by power users with repetitive and niche patterns.
Introduces Situated Interaction Auditing (SIA) to examine how user sociodemographic signals affect LLM response quality, content, and tone in personal interactions.
RealClawBench turns 281 real OpenClaw sessions into reproducible tasks that preserve the original distribution and shows the best of 14 models solves only 65.8 percent.
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
An identification theorem shows that a randomized experiment and simulator together recover causal model values from confounded logs, with logs used only afterward to reduce estimation error.
CacheFlow cuts TTFT by 10-62% in batched LLM serving via 3D-parallel KV cache restoration and a two-pointer scheduler that overlaps recompute and I/O.
TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
A renewal-reward analysis yields a closed-form mean-field rule for the optimal Attention/FFN provisioning ratio in disaggregated LLM serving that accounts for stochastic KV-cache growth and matches simulation optima within 10%.
LCLMs are scaled 0.6B-encoder 4B-decoder compressors pre-trained on over 350B tokens that improve the Pareto frontier for general-task performance, compression speed, and peak memory in long-context language model inference.
CPT is introduced as a pairwise reasoning-trace comparison stage that improves the reasoning-metacognition trade-off over standard SFT+RL pipelines across model scales.
A hybrid LLM-symbolic verifier maintains a dependency graph over conversation turns classified into eight formal update operations, enabling linear-time groundedness checks and precise retraction propagation with a conflict-free guarantee.
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
DisAAD trains a 1%-sized proxy model via adversarial distillation to quantify uncertainty in black-box LLMs by aligning with their output distributions.
Reasoning traces in large reasoning models expose safety failures missed by final-answer checks, and adaptive multi-principle steering reduces unsafe content in both traces and answers while preserving task performance.
A methodological framework and browser system BITE for collecting evolving user preferences on LLM outputs through context-triggered reflections and privacy-preserving data over time.
Switchless topologies such as 3D full-mesh are 20.6-56.2% more cost-effective than scale-up networks for MoE LLM serving, with current link bandwidths over-provisioned by up to 27%.
LenVM models token-level remaining generation length as a bounded discounted value function derived from constant negative per-token rewards, providing a scalable proxy for generation horizon.
Fluent AI users adopt an active, iterative collaboration mode that produces more visible failures but better recovery and success on hard tasks, whereas novices experience more invisible failures from passive use.
ChatGPT expands the diversity of user questions (80% non-searchable) but delivers less diverse responses than Google for comparable queries, creating a feedback loop that may constrain information exposure.
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
citing papers explorer
-
Beyond Third-Person Audits: Situated Interaction Auditing for User-Centered LLM Bias Research
Introduces Situated Interaction Auditing (SIA) to examine how user sociodemographic signals affect LLM response quality, content, and tone in personal interactions.