A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despite incentive misalignment.
Demystifying prompts in language models via perplexity estimation
6 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
A pipeline derives continuous signed edges from LLM stance scores on text and links discourse signals such as toxicity and extreme claims to changes in structural polarization measured by spectral and frustration scores on Reddit Brexit data.
SnapAudit decomposes DP-ICL into a deterministic snapshot stage and a stochastic noise stage, using bootstrap simulation to achieve 80-200x faster auditing and exposing privacy bound violations in existing Gaussian and embedding mechanisms.
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
LLMs are highly sensitive to prompt formatting in few-shot settings, with accuracy varying by up to 76 points across formats; FormatSpread samples formats to report performance intervals without model weights.
Introduces NPAS and AV Filter using LLM attention weights to defend RAG against poisoning, reporting up to 20% accuracy gains while adaptive attacks reach 35% success.
citing papers explorer
-
From Competition to Collaboration: Designing Sustainable Mechanisms Between LLMs and Online Forums
A new sequential interaction framework lets LLMs propose questions to forums, with simulations on real Stack Exchange data showing players can reach roughly half the utility of an ideal full-information scenario despite incentive misalignment.
-
Linking Extreme Discourse to Structural Polarization in Signed Interaction Networks
A pipeline derives continuous signed edges from LLM stance scores on text and links discourse signals such as toxicity and extreme claims to changes in structural polarization measured by spectral and frustration scores on Reddit Brexit data.
-
SnapAudit: Active Auditing of Differentially Private In-Context Learning via Snapshot-Based Simulation
SnapAudit decomposes DP-ICL into a deterministic snapshot stage and a stochastic noise stage, using bootstrap simulation to achieve 80-200x faster auditing and exposing privacy bound violations in existing Gaussian and embedding mechanisms.
-
Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI
Optimus mitigates toxicity during LLM fine-tuning by combining repurposed LLM safety alignments for detection with synthetic data and DPO alignment, remaining effective even with highly biased classifiers and against attacks.
-
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
LLMs are highly sensitive to prompt formatting in few-shot settings, with accuracy varying by up to 76 points across formats; FormatSpread samples formats to report performance intervals without model weights.
-
Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG
Introduces NPAS and AV Filter using LLM attention weights to defend RAG against poisoning, reporting up to 20% accuracy gains while adaptive attacks reach 35% success.