PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.
hub
ISBN 979-8-89176-251-0
25 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2representative citing papers
Re-ranking retrieval candidates via a cross-encoder trained on continuous perturbation-based attribution scores improves citation faithfulness and gold-answer alignment in legal QA over semantic similarity.
Audio LLMs fail to use paralinguistic audio information and default to transcript content; a new adversarial benchmark plus PCLM and DPO training raise accuracy on VoxParadox from 17.4% to 65.2%.
EDEN adaptively sets branching factor proportional to next-token entropy, achieving better accuracy per expansion than fixed beam search while providing a proof that monotone entropy-based branching outperforms any fixed budget allocation.
DialectLLM generates parallel multi-dialect dialog data and a 50k-dialog benchmark showing frontier LLMs achieve under 70% accuracy on dialect tasks while the generated data can improve post-training.
SOLAR aligns soft-token probability mixtures across languages in embedding space during SFT and raises multilingual reasoning accuracy by up to 17.7 points over the base model.
AttriCoT is a black-box algorithm that attributes causal importance to units in a specific CoT trace via a structural causal model estimated with linear forward passes.
AdaMEM proposes hybrid long-term and short-term memory for test-time adaptation in language agents, reporting relative gains of up to 13% on ALFWorld and 11% on WebShop over static baselines.
D-Judge applies semantics-preserving output rewriting, trained via SFT and DPO on paired responses that differ in judge scores, to disrupt multi-turn jailbreak refinement loops and reduce attack success on HarmBench.
JuICE is a new multilingual benchmark dataset showing top LLM judges reach only F1 0.52 on span-level cultural error detection and miss errors locals readily spot.
Priority ranking offers a low-cost direct evaluation for harness optimizers that correlates with their real multi-step optimization performance, supported by the Shor dataset of 182 scenarios.
Proposes forward replay of target hidden states from the first editing layer instead of backward spreading, claiming equivalent complexity but higher accuracy for LLM parameter editing.
ConfLayers dynamically skips LLM layers based on confidence scores to create adaptive draft models for self-speculative decoding, reporting up to 1.4x speedup over standard generation.
Long-context LLMs refuse explicit harmful requests but often comply when the same harmful goals must be inferred from distributed fragments in long contexts.
WorldCup is a new multi-bit LLM watermarking framework that models token sampling as a communication channel and uses hierarchical competition with entropy-aware modulation for robust message embedding and recovery.
RE-TAB uses a deterministic LCS-based table-state reward for stepwise guidance and test-time scaling, raising LLM table-reasoning accuracy by 26.7 pp on average across six backbones and three benchmarks.
SenSE adds language-model semantic guidance to flow-matching generative speech enhancement via a dual-path masked conditioning strategy and reports SOTA results on distorted speech.
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Cultural zones explain variance in safety ratings beyond demographics across six datasets, with roughly 10% of items identified as culturally sensitive.
ANCHOR uses hierarchical factor construction and causal Bayesian networks to reduce unknown predictions and improve reliability of LLM-based probability inference over prior Naive Bayes approaches.
Fisher information from the target data distribution supplies a task-dependent criterion for selecting LoRA directions that outperforms weight-magnitude heuristics.
A survey synthesizing LLM methods for peer review generation, post-review tasks like rebuttals and meta-reviews, evaluation approaches, datasets, and future directions in AI-assisted academic publishing.
citing papers explorer
-
Do Audio LLMs Listen or Read? Analyzing and Mitigating Paralinguistic Failures with VoxParadox
Audio LLMs fail to use paralinguistic audio information and default to transcript content; a new adversarial benchmark plus PCLM and DPO training raise accuracy on VoxParadox from 17.4% to 65.2%.